identify Added parsing for saltstack interpreted files

Hey! This is the follow up to https://github.com/chriskuehl/identify/issues/61 and the continuation of https://github.com/chriskuehl/identify/pull/60, I hope this is more inline with how you would like. Let me know what you think.

I'm not 100% positive of how different tags should be added, IE if all the salt fiels should be tagged with 'salt' and then additionally the 'salt-' tag, for now I only added the non-default renderer tags since I think matching a default renderer tag when the contents of the file is not actually readable by that renderer would cause issues. I'm on the fence though.

Nov 14 '19 18:11 raddessi

Also I'm not sure of a way to cleanly deal with the possible combination of any of these renderers as you are able to mix and match them at will, IE: #! jinja|py.. Should they just not be tagged?

Nov 14 '19 18:11 raddessi

Actually now that I think about it I do think this needs some work, but I'll need your input for direction. I've added some failing tests that illustrate the problem.. The test expectations should probably be set first though and I'm not sure about that. Thanks

Nov 14 '19 18:11 raddessi

I think the approach is better than last time, I'm not sure the #! foo|bar syntax is something that identify should try and handle -- it's not at all standard and seems invented by saltstack (?)

let me know what you think, sorry for the brevity I'm trying to get through my backlog of email :S happy to elaborate on things!

Nov 19 '19 06:11 asottile

Definitely cleaner than last time :) Yes you are correct, they allow for multiple rendering passes on files so foo|bar means it would first be rendered with foo then with bar. The default renderer is no shebang is specified is actually jinja|yaml (docs here).. so I'm not really sure how to represent all the possible combinations in a tag as anything other than just "salt" which is why in the current way I set this up a file with the .sls extension should only get the salt tag if there is no renderer specified, and it gets a more specific tag for renderers when there is one and only one specifed. IE files with #! py get tagged with both salt-py and python, but not salt since their format is not the default jinja/yaml.. I'm not 100% on the logic there though like I said.

The backstory here is that I have a lot of pillars written in pure python using either the py or pyobjects renderer and I want pre-commit to be able to identify those and run checks against them, but since it currently only tags the files as salt that is too broad as that matches all .sls files.

I think the usefulness of tagging the mix-and-match combinations is very low since it would not be easy to run linters or other tools against them as the combinations of syntaxes usually produces some hard to parse contents, but tagging the files when there is only one renderer that is not the normal jinja|yaml would be very useful. At least to me.

Nov 19 '19 17:11 raddessi

Maybe it is better to keep tagging anything ending with .sls as salt only, but then add the extra interpreter tag when we match a single renderer? So #! py would get tagged with salt and python rather than trying to make a subset of mixed tags like salt-py etc?

Nov 19 '19 18:11 raddessi

yeah I think that makes the most sense, and would be less of a breaking change from the current behaviour. I think it's also fine / expected to not handle the foo|bar syntax since that's salt specific and imo not really in the spirit of this tool (to handle ever weird combination that other tools might do)

Nov 19 '19 18:11 asottile

Hey there! I'm so sorry for letting this sit in your PR queue, I'll do my best to get this closed as quickly as possible. Check out this iteration, I think it is much cleaner. I wasn't able to make the code completely generic since the logic to parse shebang formats would probably change with different types but I think this is pretty decent.

One remaining issue is that coverage thinks some lines are not being tested and as far as I can see I think they are.. do you see an obvious issue?

Thanks again for keeping this alive, I hope your year is going well :)

Feb 14 '21 03:02 raddessi

I'm just a random contributor, not @asottile , but I had some comments on your file type additions based on my experience.

With regard to your overall approach, it would simplify the implementation considerably, make it more flexible/general, and less brittle/special case, if you avoided implementing a bunch of bespoke logic just for parsing salt files (requiring any other extension added to EXTENSIONS_NEED_SHEBANG_CHECK to do the same, or silently not work properly). Instead, why not move the salt extension key/value as-is to the EXTENSIONS_NEED_SHEBANG_CHECK dict, add the appropriate interpreters to the interpreters file, add a function that returns whether the file needs a shebang check (e.g. via the first 3-4 lines of your existing new one), and tweak the execution flow to cut out the else on line 57 and change the if below to if (not len(t) and executable) or needs_shebang_check(os.path.basename(path)), so the shebang check runs on such files? This is also a little more efficient (the extra extension check only runs when the shebang isn't already known to be needed), matches the format of the existing EXTENSIONS_NEED_BINARY_CHECK, and avoids unnecessary bespoke code and data to retrieve extensions from tags and having to hardcode interpreters for each extension, and extensions for each interpreter. Plus, it completely avoids needing any of the code the test coverage is complaining about.

There is a lot to unpack here :) Yeah I wasn't very happy about the setup here overall and I think this sounds like a good idea but I would have to think it over more. I'm currently on a work trip and will be knee deep in patch panels for the next week or so but as soon as I get back I'll look this over. Thanks for the reviews!

Feb 25 '21 16:02 raddessi

FYI only, I'm playing around with this still. I don't really love any of these approaches so far as there is either a spaghetti code path or some duplicated processing.. I'm playing with either a custom path object class with @property getters to cache the duplicated calls or just a functools.lru_cache on top of some functions.. I need to do some performance comparisons before I send anything up. I don't want to add duplicated calls but I also don't want to add a ton of changes just to avoid them either.. I'm still on the fence about the whole path forward.

EDIT: When I mentioned a custom path object I mean something like:

class IdentifyPath:

    def __init__(self, path: str) -> None:
        self._path = path

        super().__init__()

    @property
    def mode(self):
        try:
            return self._mode
        except AttributeError:
            try:
                self._mode = os.lstat(self.path).st_mode
            except (OSError, ValueError):  # same error-handling as `os.lexists()`
                raise ValueError(f'{self.path} does not exist.')

        return self._mode

    @property
    def path(self):
        return self._path

    @property
    def filename(self):
        try:
            return self._filename
        except AttributeError:
            _, filename = os.path.split(self.path)
            self._filename = filename

        return self._filename

    @property
    def ext(self):
        try:
            return self._ext
        except AttributeError:
            _, ext = os.path.splitext(self.filename)
            self._ext = ext

        return self._ext

    @property
    def executable(self):
        try:
            return self._executable
        except AttributeError:
            try:
                self._executable = os.access(self.path, os.X_OK)
            except (OSError, ValueError):  # same error-handling as `os.lexists()`
                raise ValueError(f'{self.path} does not exist.')

        return self._executable

    @property
    def shebang(self):
        try:
            return self._shebang
        except AttributeError:
            self._shebang = parse_shebang_from_file(self)

        return self._shebang

It's named terribly I know, the entire test suite would have to be changed, and I'm worried about setup cost performance like I said. I'll keep poking at it..

Apr 08 '21 17:04 raddessi