SOLR phantom duplicate discovery
SOLR duplicate that doesn't exist in DB, but does in SOLR
- https://catalog.data.gov/api/action/package_search?q=identifier:%2217253338-966d-43d8-975e-444a0a4ce05c%22
- https://catalog.data.gov/dataset/influences-of-water-chemistry-on-eight-populations-of-rio-grande-cutthroat-trout-in-northe-24977 (first dataset, exists normally)
- https://catalog.data.gov/dataset/influences-of-water-chemistry-on-eight-populations-of-rio-grande-cutthroat-trout-in-northe-c5c0c (second dataset, 404 not found)
Note that these have different names; it's not a bug in SOLR that is causing this duplicate. CKAN is for some reason creating it twice, but only on SOLR
How to reproduce
- Unknown
Expected behavior
If a dataset doesn't exist in the DB, it can't exist in SOLR
Actual behavior
Duplicate record only exists in SOLR
Sketch
Since this came from CKAN, we expect that it is related to a logic issue. It doesn't seem to be replicable (it didn't occur in dev, and it more duplicates aren't created when re-harvesting). This will be mitigated by https://github.com/GSA/data.gov/issues/2213, but it won't fix how this occurred initially. It could be that a restart at the wrong time caused the system to fail at the wrong moment, but not sure. Could theoretically validate by examining logs. The goal of this ticket is to solve the problem (code, infrastructure, restarts, whatever it is) and stop this from occurring.
There should be a follow up ticket to this to
- Discover how many items this affects
- Clean up any affected records This may be done by utilizing https://github.com/GSA/data.gov/issues/2213.
-
We can finish https://github.com/GSA/data.gov/issues/2213 first to clean up the solr to be free of duplicate, then wait for it to happen again and exam the log.
-
One possible cause for the issue is that we force restart fetch process every 30 mins. Let us change the way of restarting it. it might help to resolve this one. A new ticket is created.
Moving back to backlog. Hopefully this is no issue any more after above two ticket have been addressed.
Monitoring the scheduled db-solr-sync job:
10/12 0 packages need to be removed from Solr 1 packages need to be updated/added to Solr
10/13 1 packages need to be removed from Solr 4 packages need to be updated/added to Solr
10/14 0 packages need to be removed from Solr 5 packages need to be updated/added to Solr
10/17 0 packages need to be removed from Solr 1 packages need to be updated/added to Solr
10/18 0 packages need to be removed from Solr 1 packages need to be updated/added to Solr
Marking as done
This is awesome! This is the fruits of the daily db-solr-sync task that we run. Granted, this bug should not need to be fixed with a custom script in the way that it is written, but that's a different story haha...
This is indeed fixed right now, but if the db-solr-sync task were to break this would come back. Ticket that automated this fix:
- https://github.com/GSA/data.gov/issues/3985