Scan between 2 SHAs
Is your feature request related to a problem? Please describe. For really huge pieces of code, we may not always want to scan all of git history.
Describe the solution you'd like
There should be a way to scan between 2 SHAs. So, when I run, say, talisman --scan <sha1> <sha2>, the scan should begin with sha1 and scan uptil sha2 (both sha1 and sha2 included in the scans).
The execution command and parameter structure doesn't have to be exactly as mentioned here. Feel free to experiment with that technically.
Additional info. Currently scanner gets all commits. See https://github.com/thoughtworks/talisman/blob/master/scanner/scanner.go#L66. Can possibly use git revision ranges to solve this issue: https://git-scm.com/docs/gitrevisions
I think the issue comes down to this line: https://github.com/thoughtworks/talisman/blob/master/scanner/scanner.go#L45
git ls-tree -r will list all files in the repo at the time of that commit, regardless of if the change was made in that commit.
Might be able to use git diff <sha1> <sha2> --name-only to get the file names and then feed that into git ls-tree -r <commit> <list of file names>
Hello 😄 I am interested in working/collaborating on this issue. @tinamthomas are you still working on it? Can I assign myself or collaborate with you? I have used talisman in a project before and can relate to the issue of wanting faster scans.
regarding
I think the issue comes down to this line: https://github.com/thoughtworks/talisman/blob/master/scanner/scanner.go#L45
git ls-tree -rwill list all files in the repo at the time of that commit, regardless of if the change was made in that commit.
I think from reading https://github.com/thoughtworks/talisman#talisman-in-action that talisman will only check the diff of staged files when used as pre-commit hook. The pre-post checks the entirety of files in the diffs to be pushed.
When it comes to --scan and --ignoreHistory the docs are not super clear IMHO --scan scanner scans the git commit history for potential secrets. Does that mean checks every file touched in each commit or every file you would have in the working directory when checking out each commit? I assume from your comment that its the latter. Since pre-post checks the entire file in a diff I would also assume its a deliberate decision to do the same when checking the entire history. But it would be great to hear what the design decisions were.
Scanning every file entirely on each commit in the history is probably not necessary. If a secret has been added in some commit it should be found in an addition in a diff since we check every diff anyway.
Am interested in hearing your thoughts and how I can help 😄
Hi @teleivo, Feel free to assign yourself to it and we can collaborate on it.
@tinamthomas thank you :) I cannot assign myself. Could this be because I am not a contributor/collaborator?
@teleivo I've assigned myself to the issue for now.
What I see happening now from the code:
- Get all commits
- For each commit, find the files that were changed, and check the entire file contents at that time for any secrets.
I like your idea of only looking in the additions for potential secrets!
This could potentially improve scan speed. Let me start putting the changes together and we can collaborate on it.