codeql icon indicating copy to clipboard operation
codeql copied to clipboard

Go: support extracting test code

Open smowton opened this issue 1 year ago • 2 comments

This implements support for test extraction by two mechanisms:

  • In autobuild mode, setting CODEQL_EXTRACTOR_GO_EXTRACT_TESTS to true.
  • In manual build mode, tracing a go test command (go test -c is to be recommended for efficiency).

Go deals with test compilation by creating several extra packages on top of those expected from inspection of the source code (see docs of packages.Load for more detail): packages whose IDs include a suffix like mydomain.com/mypackage [mydomain.com/mypackage.test], and packages containing generated test driver code like mydomain.com/mypackage.test. There are also additional packages like mydomain.com/mypackage_tests which are explicitly present in source code, but not compiled by a normal go build.

So far as I can tell, the purpose of the two variants of the package is to resolve dependency cycles (because the tests variant of the package can have more dependencies than the non-tests variant, and non-test code can compile against non-test package variants). Since the test package variants seems to be a superset of the non-tests variant, I employ the simple heuristic of ignoring the variant of each package with the shortest ID. I haven't seen a case where there are three or more variants of a package, so I expect this to always identify the tests variant as the preferred one. If several variants were extracted, and we were to attempt to match Golang's linkage strategy among the different variants, we would need to extend trap-file name and most top-level symbol trap IDs with the package variant they come from; I hope this won't prove necessary.

"Real" _tests packages, and wholly synthetic driver code packages, are extracted just like normal.

Implementation status: needs tests. Have manually tested on one project with a substantial test-suite (https://github.com/oauth2-proxy/oauth2-proxy) which produced a consistent database and an expected-sized increase in extracted files in both auto and manual build modes.

smowton avatar Aug 13 '24 18:08 smowton

@owen-mc comments addressed and tests added; this is ready for re-review. I'll run DCA on a branch that forces the environment variable on to check for database inconsistencies, and QA on the same branch to check for new extraction failures.

smowton avatar Aug 21 '24 11:08 smowton

DCA results were as expected: lots of new files extracted, some new front-end errors in projects that already had some but not in previously-clean projects, some new results from various queries when tests are extracted, crucially no new database inconsistencies.

Started QA at https://github.com/github/codeql-qa-ops/issues/188 to cross-check if there is anything overtly worrying; if not I'll add a change-note and merge this.

Followups to do:

  • Amend our existing support for vendor dir extraction to also use a "proper" extraction option rather than an ad-hoc environment variable.
  • Amend the log line that reports extraction mode (with tests or without) to also report vendor-dir extraction status)
  • Make a docs PR to briefly document both of these options.

smowton avatar Aug 22 '24 14:08 smowton

QA revealed two problems:

  1. My simple extraction strategy of skipping the normal variant of a package when a test variant exists didn't quite work, because we can get package visit orders like [normal variant (skipped?), normal variant user, test variant, test variant user] (the order is topological, so that packages always come before their users), and so skipping the normal variant led to crashes when in particular type parameters were referenced by "normal variant user" that hadn't been extracted yet. Instead, I now only skip the normal variant for file extraction purposes; for object and type extraction I extract at both the normal and test variant steps. This could be made more efficient by skipping entities present in both variants; for now I just accept the cost of visiting them twice and producing duplicate information. Solved in https://github.com/github/codeql/pull/17216/commits/bcb84a84e1184b08efccd9d9edec39c41d9d4fc6
  2. GetPkgsInfo needed to also add a flag to retrieve information about test packages; otherwise packages that only contain tests didn't have a map entry documenting the package name -> filesystem path mapping for those packages. Solved in https://github.com/github/codeql/pull/17216/commits/bb44a2fc8ca094560d98d1503bdd75360b7d45ed

DCA repeating a sample of the projects showing a crash in QA reveals they all now produce the expected results: switching on test extraction results in more files extracted, a larger database, more alerts. Therefore I think this is ready to merge.

One more possible followup: trying to reproduce the problems mentioned above in test cases.

smowton avatar Sep 21 '24 21:09 smowton

@owen-mc please re-review

smowton avatar Sep 21 '24 21:09 smowton