Setup `scripts/es-sarif/` for MRVA -> SARIF -> Elasticsearch

Open data-douser opened this issue 2 months ago • 0 comments

Description

This pull request (PR) sets up a new environment and documentation for running gh mrva to produce massive SARIF result sets and index such SARIF results into Elasticsearch for visualization in Kibana. This PR introduces scripts and documentation to help users create a Python virtual environment, manage dependencies, and interact with Elasticsearch and MRVA tooling. The changes are organized into environment setup, documentation, and supporting files.

Outline of Changes

Environment setup and utilities:

Added setup-venv.sh to automate creation of a Python 3.11 virtual environment, install dependencies, and provide usage instructions for MRVA and SARIF indexing workflows.
Added activate.sh as a convenience script to activate the SARIF Elasticsearch Indexer environment with a custom shell prompt.
Added a comprehensive .gitignore for Python, editor, and platform-specific files, as well as local service directories.

Documentation and dependency management:

Added requirements.txt specifying the elasticsearch Python package required for SARIF indexing.
Added detailed usage and workflow documentation for both the SARIF Elasticsearch indexer (index-sarif-results-in-elasticsearch.md) and the MRVA query suite runner (run-gh-mrva-for-query-suite.md), including setup, example commands, and troubleshooting. [1] [2]

Change request type

[X] Release or process automation (GitHub workflows, internal scripts)
[x] Internal documentation
[ ] External documentation
[ ] Query files (.ql, .qll, .qls or unit tests)
[ ] External scripts (analysis report or other code shipped as part of a release)

Rules with added or modified queries

[X] No rules added
[ ] Queries have been added for the following rules:
- rule number here
[ ] Queries have been modified for the following rules:
- rule number here

Release change checklist

A change note (development_handbook.md#change-notes) is required for any pull request which modifies:

The structure or layout of the release artifacts.
The evaluation performance (memory, execution time) of an existing query.
The results of an existing query in any circumstance.

If you are only adding new rule queries, a change note is not required.

Author: Is a change note required?

[ ] Yes
[X] No

🚨🚨🚨 Reviewer: Confirm that format of shared queries (not the .qll file, the .ql file that imports it) is valid by running them within VS Code.

[ ] Confirmed

Reviewer: Confirm that either a change note is not required or the change note is required and has been added.

[ ] Confirmed

Query development review checklist

For PRs that add new queries or modify existing queries, the following checklist should be completed by both the author and reviewer:

Author

[ ] Have all the relevant rule package description files been checked in?
[ ] Have you verified that the metadata properties of each new query is set appropriately?
[ ] Do all the unit tests contain both "COMPLIANT" and "NON_COMPLIANT" cases?
[ ] Are the alert messages properly formatted and consistent with the style guide?
[ ] Have you run the queries on OpenPilot and verified that the performance and results are acceptable?
As a rule of thumb, predicates specific to the query should take no more than 1 minute, and for simple queries be under 10 seconds. If this is not the case, this should be highlighted and agreed in the code review process.
[ ] Does the query have an appropriate level of in-query comments/documentation?
[ ] Have you considered/identified possible edge cases?
[ ] Does the query not reinvent features in the standard library?
[ ] Can the query be simplified further (not golfed!)

Reviewer

[ ] Have all the relevant rule package description files been checked in?
[ ] Have you verified that the metadata properties of each new query is set appropriately?
[ ] Do all the unit tests contain both "COMPLIANT" and "NON_COMPLIANT" cases?
[ ] Are the alert messages properly formatted and consistent with the style guide?
[ ] Have you run the queries on OpenPilot and verified that the performance and results are acceptable?
As a rule of thumb, predicates specific to the query should take no more than 1 minute, and for simple queries be under 10 seconds. If this is not the case, this should be highlighted and agreed in the code review process.
[ ] Does the query have an appropriate level of in-query comments/documentation?
[ ] Have you considered/identified possible edge cases?
[ ] Does the query not reinvent features in the standard library?
[ ] Can the query be simplified further (not golfed!)

Oct 26 '25 21:10 data-douser