action-validator icon indicating copy to clipboard operation
action-validator copied to clipboard

Performance issue in repositories with lots of gitignored files

Open anttiharju opened this issue 3 months ago • 9 comments

Hi, thanks for creating this awesome tool. It works great as a git hook.

Unfortunately, if the current working directory includes a lot of gitignored files, it appears that validating the paths of GitHub Actions workflows suffers from a performance issue, taking the execution time from 195.10ms to 15.72s. 15 seconds of execution time is slow enough that it's unappealing to use action-validator in Git hooks.

I think the issue arises from the following functionality described in the main readme:

and it makes sure that any globs used in paths / paths-ignore match at least one file in the repo.

this has been one of the more helpful features and the feature that makes this tool stand out from actionlint

I think a way to only compare against version-controlled files would be to use git ls-files, in a manner similar to https://github.com/anttiharju/relcheck/blob/b1476f36403249d1776bc127c2f05d73b11b56bf/internal/git/git.go#L10:

	out, err := exec.CommandContext(ctx, "git", "ls-files", "-z", "*.md").Output()

A very simple benchmark illustrating the issue:

# https://github.com/anttiharju/vmatch/tree/66cd2a3c3bbaf51ed44d397dc9f4ff9a272764a7
# we start with nix-direnv activated, 'direnv allow'
$ time git ls-files -z '.github/workflows/*.yml' '*/action.yml' | xargs -0 action-validator --verbose
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating build.yml as a Workflow definition
Treating plan.yml as a Workflow definition
Treating release.yml as a Workflow definition
Treating wildcard-action-validator.yml as a Workflow definition
Treating wildcard-actionlint.yml as a Workflow definition
Treating wildcard-brew.yml as a Workflow definition
Treating wildcard-github.yml as a Workflow definition
Treating wildcard-go-binary.yml as a Workflow definition
Treating wildcard-golangci-lint.yml as a Workflow definition
Treating wildcard-mkdocs.yml as a Workflow definition
Treating wildcard-nix.yml as a Workflow definition
Treating wildcard-prettier.yml as a Workflow definition
Treating wildcard-shellcheck.yml as a Workflow definition

________________________________________________________
Executed in   15.72 secs    fish           external
   usr time    1.01 secs    2.33 millis    1.00 secs
   sys time   11.30 secs    4.07 millis   11.30 secs

$ git reset --hard && git clean -dfx
HEAD is now at 66cd2a3 Add missing permissions (#365)
Removing .direnv/
Removing result
$ time git ls-files -z '.github/workflows/*.yml' '*/action.yml' | xargs -0 action-validator --verbose
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating action.yml as an Action definition
Treating build.yml as a Workflow definition
Treating plan.yml as a Workflow definition
Treating release.yml as a Workflow definition
Treating wildcard-action-validator.yml as a Workflow definition
Treating wildcard-actionlint.yml as a Workflow definition
Treating wildcard-brew.yml as a Workflow definition
Treating wildcard-github.yml as a Workflow definition
Treating wildcard-go-binary.yml as a Workflow definition
Treating wildcard-golangci-lint.yml as a Workflow definition
Treating wildcard-mkdocs.yml as a Workflow definition
Treating wildcard-nix.yml as a Workflow definition
Treating wildcard-prettier.yml as a Workflow definition
Treating wildcard-shellcheck.yml as a Workflow definition

________________________________________________________
Executed in  195.10 millis    fish           external
   usr time   88.72 millis    2.19 millis   86.53 millis
   sys time   87.74 millis    4.44 millis   83.30 millis

My questions:

  1. Is a PR to fix this welcome?
  2. Is it preferred to use a dependency for git ls-files or is shelling out considered acceptable? I imagine dependency is preferred.

anttiharju avatar Oct 31 '25 12:10 anttiharju

Also if someone familiar with the codebase is able to spend a few minutes (really, feel free to timebox it) pointing out where to start in the code I would consider that very helpful.

The advice does not need to be perfect, providing starting points is fine, just hoping to make the pr review easy.

anttiharju avatar Oct 31 '25 12:10 anttiharju

Turns out there a similar issue already open, but with a bit different pov ("it's matching stuff against a ton of files that are OUTSIDE the repo"):

  • https://github.com/mpalmer/action-validator/issues/77

I'm strictly only interested files in the repo, not sure what's going on in that other issue

anttiharju avatar Oct 31 '25 12:10 anttiharju

In answer to your initial questions:

  1. A PR to improve performance would be accepted, as long as it didn't adversely impact the other goals of the project; and
  2. I'd actually prefer to shell out, as that gets you the "official" implementation, rather than (presumably) a Rust reimplementation.

The similar issue you found never really made a huge amount of sense to me, and without a clean repro I doubt it'll ever be fully understood and fixed. However, there is one important point that I made in that issue that I think bears highlighting here, too: the current implementation of glob-checking is really derpy, in that it just says "match me this glob" and checks whether the list of matched files is empty or not. Since we don't actually care what files match, it'd be a lot quicker if matching was short-circuited to be "return true as soon as we find a file that matches".

Thinking about how this would be implemented, I'm having trouble coming up with a solution which doesn't require basically reimplementing globbing support anyway. The (derpy) way it's done now matches filesystem paths using the glob crate. Presumably if we instead matched against the output of git ls-files, that couldn't be done using the glob crate, and we'd need to find another crate (that would accept a list of files, rather than hitting the filesystem) or implement list-o'-files globbing ourselves. Does that match with your understanding?

mpalmer avatar Nov 01 '25 03:11 mpalmer

Does that match with your understanding?

What you've written down makes sense to me :+1:

  • I think I'll try to fix this, presumably can also fix https://github.com/mpalmer/action-validator/issues/27#issuecomment-3272786758 while I'm at it

  • But I'll tackle https://github.com/mpalmer/action-validator/issues/106#issuecomment-3476441475 first

I'll post any progress I have here, so in case I don't get around to it someone else can also go ahead

anttiharju avatar Nov 01 '25 15:11 anttiharju

Promised to share progress, see here

anttiharju avatar Nov 03 '25 19:11 anttiharju

Writing a library has been progressing well, I now have a maintainable setup for it and a CI-passing changeset for action-validator in my fork.

In trying to benchmark how the execution time has improved, I noticed that the time reported in this issue description is from v0.6.0. By simply updating to v0.8.0 the execution time improves from 13.04 secs to 4.63 secs.

Fortunately, there's still a very nice speedup to 235 millis by replacing glob with [email protected]. Some tests were initially not passing, but as a bonus it appears they were faulty and it's just a case of accuracy improving. Yielding a roughly 19x (4630/235) speedup! :grin: As an action-validator Git hook user this makes me very happy.

I might still work on the library a bit for the purposes of the namesake github action (unrelated to action-validator), but since all the tests I've written and all the tests that action-validator has, I think submitting a PR for review with the current version is appropriate.

anttiharju avatar Nov 27 '25 16:11 anttiharju

Good to hear you're making progress! That's a solid speed improvement you've got.

mpalmer avatar Nov 27 '25 19:11 mpalmer

actionlint (MIT license) iirc does validation for paths, I think I could port it to Rust as well and have the library return errors

edit: perhaps just get the public function signature so that it may return errors, and see about implementing path validation later -- for now using actionlint alongside ought to be enough and less to maintain for me

edit2: turns out just trying to use proper crates for the performance-sensitive bits already surfaces a lot of validation errors so didn't get around to actionlint

anttiharju avatar Nov 27 '25 22:11 anttiharju

There's now a PR open:

  • https://github.com/mpalmer/action-validator/pull/113

anttiharju avatar Dec 01 '25 09:12 anttiharju