codeql-action codeql-action/init should explicitly delete an already present database (for self-hosted runners)

We are running CodeQL as an action on self-hosted runners and have enabled the 'over-write' option to clean the CodeQL database before each run.

However, some environments fail with the following message:

Failure invoking /__w/_tool/CodeQL/2.13.3-20230524/x64/codeql/codeql with arguments database,init,--db-cluster,/__w/_temp/codeql_databases,--source-root=/__w/a1_aggregation_mtf/a1_aggregation_mtf,--language=java,--begin-tracing,--trace-process-name=Runner.Worker.exe,--codescanning-config=/__w/_temp/user-config.yaml,--overwrite.
 
       Exit code 2 and error was:

       A fatal error occurred: Ran with database overwriting enabled, but the directory does not appear to be a CodeQL database or database cluster. Please check you do indeed wish to delete it, and do so manually.

As suggested in the error message, we can manually delete the directory, allowing for a successful run.

We do not understand what is causing CodeQL to believe the directory is not a CodeQL database. We are not explicitly altering the contents or structure. As the directories involved are created by CodeQL within the workflow, the hope is to have CodeQL able to clean up it's workspace from within the workflow as well.

Ideally, we want to avoid having to implement an external cleanup step for the process.

We are curious to see what logic is being used to determine the 'CodeQL database' status of the targeted directory.

Regards, Richard Seeton

Jun 09 '23 17:06 rseeton

In brief, the directory is taken for a codeql database if it contains codeql-database.yml, or a database cluster if it contains exactly dirs log, working, temp and one or more directories that look like databases (i.e., which contain codeql-database.yml). I suggest logging what's really in the directory so you can see what is either unexpectedly missing or present when that error is raised.

Jun 09 '23 20:06 smowton

I'm assuming you're using the github/codeql-action action on your self-hosted runner. But the action doesn't seem to have any overt feature for asking for the database directory to be overwritten. Did you use $CODEQL_ACTIONS_EXTRA_OPTIONS to inject the --overwrite option on the codeql database init command line?

The original use case for --overwrite was as a convenience feature for interactive use where one executes the same codeql database create command again and again after making changes to the source tree and/or extractor. In those cases, deleting the database manually is a low-cost workaround, so the "is this actually a database we're deleting" check deliberately tries to err in the direction of not deleting a completely wrong directory tree, e.g. because the interactive user misremembered the command-line syntax.

Looking through the code that implements the check, I can see a few theoretical possibilities where an earlier run that was killed a just the wrong time may have left the directory in a state that's not recognizable as a database. This would be a fairly rare occurrence -- but once a particular self-hosted runner has a directory in that state, there would be nothing that automatically clears it away. (In contrast, GitHub-hosted runners always start all just in a pristine container without anything left over from earlier jobs).

As Chris says, it's hard to say exactly what has go wrong without looking at the actual directory content.

Arguably the right fix ought to be that the action should clear out the database directory itself when it starts up rather than leaving it to the CLI to do so. (At least when it finds itself on a self-hosted runner, if that condition is easily detectable).

Jun 09 '23 20:06 hmakholm

(More precisely, if an earlier attempt to run codeql-action/init was killed after it had created a language subdirectory of the database cluster but before it wrote a codeql-database.yml into it, then I think codeql database init --overwrite would not recognize the resulting directory tree being safe to delete).

Jun 09 '23 20:06 hmakholm

If this guess is right, there seems to be a good chance that cleaning out the offending directory on the affected runner manually could get your setup working again while we decide whether to change the behavior of the action.

Jun 09 '23 21:06 hmakholm

Hello @hmakholm

Your assumption is correct - we are running a GitHub Enterprise environment with selfhosted runners calling the codeql-action. To enable the 'overwrite' functionality, we use the $CODEQL_ACTIONS_EXTRA_OPTIONS environment setup.

The problem occurs on every run of the workflow, regardless of whether the previous run completed successfully or not. As suggested above, if we force the deletion of the codeql_databases directory, we can get around the problem (this is being implemented in the runners using the ACTIONS_RUNNER_HOOK_JOB_STARTED option)

The concern/question though, is why is this external cleanup needed? On a successful execution the codeql_databases directory is populated and happily generates the sarif results, but the subsequent run aborts saying that if doesn't think the directory is a codeql_database - this suggests we are violating the rules used to identify a codeql_database, but we don't have clarity on what that rule requires.

A clean run produces the following directories and a codeql-database.yml file ... and yet this triggers a failure on the subsequent run (if not cleaned up):

**_pwd_**
/<runner_dir>/_work/_temp/codeql_databases

**_ls -lrta ./_**
total 0
drwxr-xr-x 3 root    root          20 Jun  9 23:21 diagnostic
drwxr-xr-x 3 root    root          32 Jun  9 23:21 temp
drwxr-xr-x 7 root    root          74 Jun  9 23:21 .
drwxr-xr-x 2 root    root          93 Jun  9 23:21 working
drwxr-xr-x 7 root    root         145 Jun  9 23:24 java
drwxr-xr-x 2 root    root         134 Jun  9 23:27 log
drwxr-xr-x 7 rseeton role-algodev 147 Jun  9 23:27 ..

**_$ ls -lrta ./java/codeql-database.yml_** 
-rw-r--r-- 1 root root 310 Jun  9 23:24 ./java/codeql-database.yml

We aren't blocked but we are confused.
Regards.

Jun 10 '23 03:06 rseeton

Thanks, that was helpful.

The diagnostic directory is one we have started generating since the --overwrite logic was written, and it looks like we missed updating the latter accordingly to deal with it.

I'll file an internal note to get that fixed.

Jun 12 '23 11:06 hmakholm

@hmakholm - Excellent, thanks. Testing in our environments confirms that if we remove the codeql_databases/diagnostic library, the overwrite function recognizes the codeql_databases as a CodeQL database and allows the over-write operation to complete.

Jun 12 '23 15:06 rseeton