query for existing issues might fail silently and a new issue created for every issue detected by the task
2024-07-23 06:00:42.000212 [INFO ] [info ] Checking for existing issues in the backend base_revision_changeset=ac4a1f84adfa69b77ccec3589f2a28ec7089fe10
2024-07-23 06:17:48.000023 [INFO ] [info ] Found 780 new issues (over 780 total detected issues) task=ZIJe3nlqQ4CvIivkOYtMNg
It took 17 min 06 s to query for known issues, yet no known issue has been detected by the task.
The time is close to 16 min 40 s, or 1000 s as a timeout.
If there is a performance issue which would cause the retrieval of the data to fail, the creation of tickets for every issue afterwards will further degrade the performance of the code review server.
Should the bulk of the known issues be served from a downloaded artifact and only the newest known issues be served as query of incremental data?
@La0 @marco-c
Time to create a ticket for a new issue varies between 0.5 and 5 seconds per ticket.
The bot code only iterate on all issues path and query the list_repo_issues endpoint.
We could look into performance and even if the whole output is needed (the bot only consume hashes).
Or even build a new endpoint that directly check if a hash for a specific path+repo is known: it may be way faster to query in DB
I just started a manual backup on heroku so we can test locally for performance issues.
I was able to restore the backup, and test API queries. The list issue endpoint is indeed super-slow (taking several seconds per hit...)
I noticed a few immediate issues:
- no index on
Revision.head_changeset&Issue.pathwhich are used to filter the endpoint - we only need to serialize issue id & hash (so we only need to load these in the queryset)
- the main slow query is joining twice on
IssueLinkjust because of the multiple.filterORM calls: by aggregating all filters into a dict, then calling once.filter, the ORM becomes smarter and only make a single join
I used the following test code & payload, but you can also simply hit the following url
from datetime import datetime
import json
from code_review_bot.backend import BackendAPI
from code_review_bot import taskcluster
taskcluster.secrets = {
"backend": {
"url": "http://localhost:8000",
"username": "bot",
"password": "Teklia12345",
}
}
current_date = datetime.now().strftime("%Y-%m-%d")
api = BackendAPI()
with open("payload.json") as f:
payload = json.load(f)
for path in payload["paths"]:
print(path)
out = api.list_repo_issues(
"mozilla-central", date=current_date, revision_changeset=payload['revision_changeset'], path=path
)
print(out)