data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

SOLR phantom duplicate discovery

Open jbrown-xentity opened this issue 3 years ago • 3 comments

SOLR duplicate that doesn't exist in DB, but does in SOLR

  • https://catalog.data.gov/api/action/package_search?q=identifier:%2217253338-966d-43d8-975e-444a0a4ce05c%22
  • https://catalog.data.gov/dataset/influences-of-water-chemistry-on-eight-populations-of-rio-grande-cutthroat-trout-in-northe-24977 (first dataset, exists normally)
  • https://catalog.data.gov/dataset/influences-of-water-chemistry-on-eight-populations-of-rio-grande-cutthroat-trout-in-northe-c5c0c (second dataset, 404 not found)

Note that these have different names; it's not a bug in SOLR that is causing this duplicate. CKAN is for some reason creating it twice, but only on SOLR

How to reproduce

  1. Unknown

Expected behavior

If a dataset doesn't exist in the DB, it can't exist in SOLR

Actual behavior

Duplicate record only exists in SOLR

Sketch

Since this came from CKAN, we expect that it is related to a logic issue. It doesn't seem to be replicable (it didn't occur in dev, and it more duplicates aren't created when re-harvesting). This will be mitigated by https://github.com/GSA/data.gov/issues/2213, but it won't fix how this occurred initially. It could be that a restart at the wrong time caused the system to fail at the wrong moment, but not sure. Could theoretically validate by examining logs. The goal of this ticket is to solve the problem (code, infrastructure, restarts, whatever it is) and stop this from occurring.

There should be a follow up ticket to this to

  1. Discover how many items this affects
  2. Clean up any affected records This may be done by utilizing https://github.com/GSA/data.gov/issues/2213.

jbrown-xentity avatar Sep 07 '22 20:09 jbrown-xentity

  • We can finish https://github.com/GSA/data.gov/issues/2213 first to clean up the solr to be free of duplicate, then wait for it to happen again and exam the log.

  • One possible cause for the issue is that we force restart fetch process every 30 mins. Let us change the way of restarting it. it might help to resolve this one. A new ticket is created.

FuhuXia avatar Sep 20 '22 15:09 FuhuXia

Moving back to backlog. Hopefully this is no issue any more after above two ticket have been addressed.

FuhuXia avatar Sep 28 '22 17:09 FuhuXia

Monitoring the scheduled db-solr-sync job:

10/12 0 packages need to be removed from Solr 1 packages need to be updated/added to Solr

10/13 1 packages need to be removed from Solr 4 packages need to be updated/added to Solr

10/14 0 packages need to be removed from Solr 5 packages need to be updated/added to Solr

Jin-Sun-tts avatar Oct 14 '22 14:10 Jin-Sun-tts

10/17 0 packages need to be removed from Solr 1 packages need to be updated/added to Solr

10/18 0 packages need to be removed from Solr 1 packages need to be updated/added to Solr

Jin-Sun-tts avatar Oct 18 '22 13:10 Jin-Sun-tts

Marking as done

hkdctol avatar Oct 20 '22 20:10 hkdctol

This is awesome! This is the fruits of the daily db-solr-sync task that we run. Granted, this bug should not need to be fixed with a custom script in the way that it is written, but that's a different story haha...

This is indeed fixed right now, but if the db-solr-sync task were to break this would come back. Ticket that automated this fix:

  • https://github.com/GSA/data.gov/issues/3985

nickumia-reisys avatar Feb 02 '23 22:02 nickumia-reisys