Performance issue in repositories with lots of gitignored files
Hi, thanks for creating this awesome tool. It works great as a git hook.
Unfortunately, if the current working directory includes a lot of gitignored files, it appears that validating the paths of GitHub Actions workflows suffers from a performance issue, taking the execution time from 195.10ms to 15.72s. 15 seconds of execution time is slow enough that it's unappealing to use action-validator in Git hooks.
I think the issue arises from the following functionality described in the main readme:
and it makes sure that any globs used in paths / paths-ignore match at least one file in the repo.
this has been one of the more helpful features and the feature that makes this tool stand out from actionlint
I think a way to only compare against version-controlled files would be to use git ls-files, in a manner similar to https://github.com/anttiharju/relcheck/blob/b1476f36403249d1776bc127c2f05d73b11b56bf/internal/git/git.go#L10:
out, err := exec.CommandContext(ctx, "git", "ls-files", "-z", "*.md").Output()
A very simple benchmark illustrating the issue:
# https://github.com/anttiharju/vmatch/tree/66cd2a3c3bbaf51ed44d397dc9f4ff9a272764a7
# we start with nix-direnv activated, 'direnv allow'
$ time git ls-files -z '.github/workflows/*.yml' '*/action.yml' | xargs -0 action-validator --verbose
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating build.yml as a Workflow definition
Treating plan.yml as a Workflow definition
Treating release.yml as a Workflow definition
Treating wildcard-action-validator.yml as a Workflow definition
Treating wildcard-actionlint.yml as a Workflow definition
Treating wildcard-brew.yml as a Workflow definition
Treating wildcard-github.yml as a Workflow definition
Treating wildcard-go-binary.yml as a Workflow definition
Treating wildcard-golangci-lint.yml as a Workflow definition
Treating wildcard-mkdocs.yml as a Workflow definition
Treating wildcard-nix.yml as a Workflow definition
Treating wildcard-prettier.yml as a Workflow definition
Treating wildcard-shellcheck.yml as a Workflow definition
________________________________________________________
Executed in 15.72 secs fish external
usr time 1.01 secs 2.33 millis 1.00 secs
sys time 11.30 secs 4.07 millis 11.30 secs
$ git reset --hard && git clean -dfx
HEAD is now at 66cd2a3 Add missing permissions (#365)
Removing .direnv/
Removing result
$ time git ls-files -z '.github/workflows/*.yml' '*/action.yml' | xargs -0 action-validator --verbose
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating build.yml as a Workflow definition
Treating plan.yml as a Workflow definition
Treating release.yml as a Workflow definition
Treating wildcard-action-validator.yml as a Workflow definition
Treating wildcard-actionlint.yml as a Workflow definition
Treating wildcard-brew.yml as a Workflow definition
Treating wildcard-github.yml as a Workflow definition
Treating wildcard-go-binary.yml as a Workflow definition
Treating wildcard-golangci-lint.yml as a Workflow definition
Treating wildcard-mkdocs.yml as a Workflow definition
Treating wildcard-nix.yml as a Workflow definition
Treating wildcard-prettier.yml as a Workflow definition
Treating wildcard-shellcheck.yml as a Workflow definition
________________________________________________________
Executed in 195.10 millis fish external
usr time 88.72 millis 2.19 millis 86.53 millis
sys time 87.74 millis 4.44 millis 83.30 millis
My questions:
- Is a PR to fix this welcome?
- Is it preferred to use a dependency for git ls-files or is shelling out considered acceptable? I imagine dependency is preferred.
Also if someone familiar with the codebase is able to spend a few minutes (really, feel free to timebox it) pointing out where to start in the code I would consider that very helpful.
The advice does not need to be perfect, providing starting points is fine, just hoping to make the pr review easy.
Turns out there a similar issue already open, but with a bit different pov ("it's matching stuff against a ton of files that are OUTSIDE the repo"):
- https://github.com/mpalmer/action-validator/issues/77
I'm strictly only interested files in the repo, not sure what's going on in that other issue
In answer to your initial questions:
- A PR to improve performance would be accepted, as long as it didn't adversely impact the other goals of the project; and
- I'd actually prefer to shell out, as that gets you the "official" implementation, rather than (presumably) a Rust reimplementation.
The similar issue you found never really made a huge amount of sense to me, and without a clean repro I doubt it'll ever be fully understood and fixed. However, there is one important point that I made in that issue that I think bears highlighting here, too: the current implementation of glob-checking is really derpy, in that it just says "match me this glob" and checks whether the list of matched files is empty or not. Since we don't actually care what files match, it'd be a lot quicker if matching was short-circuited to be "return true as soon as we find a file that matches".
Thinking about how this would be implemented, I'm having trouble coming up with a solution which doesn't require basically reimplementing globbing support anyway. The (derpy) way it's done now matches filesystem paths using the glob crate. Presumably if we instead matched against the output of git ls-files, that couldn't be done using the glob crate, and we'd need to find another crate (that would accept a list of files, rather than hitting the filesystem) or implement list-o'-files globbing ourselves. Does that match with your understanding?
Does that match with your understanding?
What you've written down makes sense to me :+1:
-
I think I'll try to fix this, presumably can also fix https://github.com/mpalmer/action-validator/issues/27#issuecomment-3272786758 while I'm at it
-
But I'll tackle https://github.com/mpalmer/action-validator/issues/106#issuecomment-3476441475 first
I'll post any progress I have here, so in case I don't get around to it someone else can also go ahead
Promised to share progress, see here
Writing a library has been progressing well, I now have a maintainable setup for it and a CI-passing changeset for action-validator in my fork.
In trying to benchmark how the execution time has improved, I noticed that the time reported in this issue description is from v0.6.0. By simply updating to v0.8.0 the execution time improves from 13.04 secs to 4.63 secs.
Fortunately, there's still a very nice speedup to 235 millis by replacing glob with [email protected]. Some tests were initially not passing, but as a bonus it appears they were faulty and it's just a case of accuracy improving. Yielding a roughly 19x (4630/235) speedup! :grin: As an action-validator Git hook user this makes me very happy.
I might still work on the library a bit for the purposes of the namesake github action (unrelated to action-validator), but since all the tests I've written and all the tests that action-validator has, I think submitting a PR for review with the current version is appropriate.
Good to hear you're making progress! That's a solid speed improvement you've got.
actionlint (MIT license) iirc does validation for paths, I think I could port it to Rust as well and have the library return errors
edit: perhaps just get the public function signature so that it may return errors, and see about implementing path validation later -- for now using actionlint alongside ought to be enough and less to maintain for me
edit2: turns out just trying to use proper crates for the performance-sensitive bits already surfaces a lot of validation errors so didn't get around to actionlint
There's now a PR open:
- https://github.com/mpalmer/action-validator/pull/113