fscrawler icon indicating copy to clipboard operation
fscrawler copied to clipboard

Extract ACL information

Open Denis4545 opened this issue 8 years ago • 21 comments

Hello. We are crawling FTP servers and I've noted that based metadata objects (meta.created, meta.metadata_date, meta.author) aren't extracted from txt, log, csv, html, htm files. Could you explain please why based objects aren't extracted from mentioned types files? See the example of text files below please: image

Denis4545 avatar Dec 01 '17 16:12 Denis4545

Because a txt file has no metadata associated with it in its file.

dadoonet avatar Dec 01 '17 17:12 dadoonet

@dadoonet Hello David. +1 Could you explain more detail, please? I thought, that there are some base attributes for any file in any file system like "date created", "date modified", "size" and etc., that should be extracted by fscrawler and load into elasticsearch. The problem is, that we are working on global search inside our company and chose your crawler (thanks for this project!) to create indexes for all FTP servers. We have crawled all FTP servers and for now we are working on front-end for search system. There are should be some filters for users (like in google) by extension of file, by date created, date modified and etc to get filtered results. But we faced with this problem, that there is no "date created" information for at least *.txt, *.log, *.html files. I mean, that crawler doesn't extract these fields from these files. It makes our filter in front-end partial and filter will not work for all files. Do you know, why it happens with "date created" field and how we can fix this issue? Thanks.

konovalcev avatar Dec 05 '17 08:12 konovalcev

Reopening to think about it.

dadoonet avatar Dec 05 '17 09:12 dadoonet

@dadoonet Hello David. Thanks for your time. I will wait for your thoughts, because it is important to us and we cannot going forward without understanding this issue. Thanks again.

konovalcev avatar Dec 05 '17 10:12 konovalcev

May I should try to extract more data coming from the filesystem and add that here:

https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/FsCrawlerImpl.java#L627-L653

And then if something is provided when the Tika extraction is done, overwrite the "FS" value. Like what we do in: https://github.com/dadoonet/fscrawler/blob/master/src/main/java/fr/pilato/elasticsearch/crawler/fs/tika/TikaDocParser.java#L108-L110

That said, I'm pretty sure that some data will depends on the FS implementation. So it might be only a few things that we can capture... Or it will mean first FS detection, then try to extract more or less data depending on that which will take more time to fix IMO.

Can you list exactly what are the fields you need for your project as a MVP?

dadoonet avatar Dec 05 '17 18:12 dadoonet

@dadoonet Hello David. If you are asking about it, I would like to ask to add an opportunity to extract ACL list from every file too. That will be very cool feature, that allows us to set security level in front-end, and user will get only those files in results, to which he has at least read-only access in file system. As all our FTP servers are on Windows Server, then I'm talking about windows NTFS ACL list, like this:

acl

If you add "Date Created" and ACL list in index, it will be incredible support from your side. All other fields are ok for us. If adding of ACL list is not possible, then the field, that we need is "Date Created" field only. All other fields are ok for us. Thanks!

konovalcev avatar Dec 06 '17 10:12 konovalcev

May be I can use that: https://myshittycode.com/2013/09/10/reading-directoryfiles-acl-directly-from-java/

I need to give it a try when I'll have time.

dadoonet avatar Dec 06 '17 13:12 dadoonet

@dadoonet Hello David. That's perfect. I will wait for your response. And maybe do you implement "Date created" field at first, because it is more important for us for now) Thanks again!

konovalcev avatar Dec 06 '17 15:12 konovalcev

@dadoonet Happy new year! First i want to thank you for all job you done. I have to get ACL from windows file with fscrawler, hope there is news for this issue.. thanks.

hatemjaafar avatar Jan 02 '18 09:01 hatemjaafar

So I'm working on that. Sadly I'm a bit blind as working on MacOS and not on Windows :( I'll try to see if I can run tests afterwards on Windows though.

Anyway, what could be a good representation of the data? Let's take as an example that we have:

Owner:
    BUILTIN\Administrators (Alias)

ACL:
    NT AUTHORITY\SYSTEM (Well-known group)        [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
    BUILTIN\Administrators (Alias)        [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]
    MYDOMAIN\thundercat (User)        [APPEND_DATA, WRITE_ATTRIBUTES, DELETE, SYNCHRONIZE, READ_DATA, WRITE_ACL, WRITE_DATA, READ_ATTRIBUTES, WRITE_NAMED_ATTRS, READ_ACL, DELETE_CHILD, WRITE_OWNER, EXECUTE, READ_NAMED_ATTRS]

This is somewhat related to #567 where we generated:

  "attributes": {
     "owner": "david",
     "group": "staff",
     "permissions": 764
  },

I believe that owner in such a case should be replaced by BUILTIN\Administrators (Alias). And that I should add an acl structure like:

  "attributes": {
     "owner": "david",
     "acl": [{
        "user": "NT AUTHORITY\\SYSTEM (Well-known group)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "BUILTIN\\Administrators (Alias)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "MYDOMAIN\\thundercat (User)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       }
     ]
  },

With attributes.acl defined in mapping as a nested type. But this will have consequences in the number of documents indexed in Lucene at the end.

Wondering what would be the representation. What do you think @konovalcev @hatemjaafar @Denis4545 ?

dadoonet avatar Aug 01 '18 14:08 dadoonet

Ping @konovalcev @hatemjaafar @Denis4545. Any thoughts?

dadoonet avatar Jan 28 '19 00:01 dadoonet

@dadoonet Thank you for this project, it's helping us indexing thounsds of files for our client.

Did you added the acl feature to FSCrawler ? this really usefull.

sn0opr avatar Mar 13 '19 15:03 sn0opr

@sn0opr Nope. This issue is still opened. Do you have any idea about my proposals here: https://github.com/dadoonet/fscrawler/issues/464#issuecomment-409589061 ?

dadoonet avatar Mar 13 '19 19:03 dadoonet

Thank you @dadoonet for the quick reply, Actually that's what we are looking for :) . I think this structure let us know the access types for each user.

"attributes": {
     "owner": "david",
     "acl": [{
        "user": "NT AUTHORITY\\SYSTEM (Well-known group)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "BUILTIN\\Administrators (Alias)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       },{
        "user": "MYDOMAIN\\thundercat (User)",
        "access": ["APPEND_DATA", "WRITE_ATTRIBUTES", "DELETE", "SYNCHRONIZE", "READ_DATA", "WRITE_ACL", "WRITE_DATA", "READ_ATTRIBUTES", "WRITE_NAMED_ATTRS", "READ_ACL", "DELETE_CHILD", "WRITE_OWNER", "EXECUTE", "READ_NAMED_ATTRS"]
       }
     ]
  },

Please can you push the branch that you are working on for this feature, I would like to help in this part if possible. Thank's

sn0opr avatar Mar 14 '19 17:03 sn0opr

I pushed that here: https://github.com/dadoonet/fscrawler/tree/wip/acl

But nothing fancy yet... Just few lines of code to see where this can go. Let me know if you want to take it from here and contribute or if I need to implement it (longer delay I'm afraid :) ).

dadoonet avatar Mar 14 '19 18:03 dadoonet

@dadoonet thank you for pushing the branch, it looks clear, we will work on it and do a pull request ASAP. Thank's again!

sn0opr avatar Mar 15 '19 10:03 sn0opr

@sn0opr I'm wondering if you did anything on your side regarding this feature?

dadoonet avatar Feb 16 '23 10:02 dadoonet

Howdy, was wondering if there has been any progress on extracting and indexing windows ACL.

Thurdi avatar Dec 14 '23 21:12 Thurdi

Not on my side sadly. Wanna work on it?

dadoonet avatar Jan 16 '24 01:01 dadoonet

I have a WIP fork that I've been tinkering with here. I'll get it polished up as best as I can, but feel free to give it a look over and let me know your thoughts.

Thurdi avatar Jan 16 '24 20:01 Thurdi

Hey @Thurdi

Sorry for the delay. Would you like to create a proper branch for this in your fork and then send a draft PR so we can more easily discuss on that?

dadoonet avatar May 03 '24 09:05 dadoonet