data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Clear bad production data

Open jbrown-xentity opened this issue 3 years ago • 1 comments

Since we re-implemented the db-solr sync, we found that we have data in a bad state sitting in prod (not indexed on solr, but still valid). We need to clear this bad data

How to reproduce

  1. Examine duplicate records that don't have a correct/corresponding record in harvest_object table. A full table scan doesn't seem to complete, but you can look by harvest source:
SELECT COUNT(*), max(package.metadata_created) 
FROM package 
LEFT JOIN "group" ON package.owner_org = "group".id 
WHERE "group".name = 'dhs-gov' 
AND package.type = 'dataset' 
AND package.id NOT IN (SELECT package_id FROM harvest_object WHERE harvest_source_id = '803bdba9-bfcb-453c-ae2a-ed81f240ff5a' AND current);

Expected behavior

The above query should result in 0 datasets

Actual behavior

Thousands

Sketch

The following organizations have duplicates, this may be affecting all of them (or just some), and they are sorted in highest value order:

  • [ ] noaa-gov
  • [x] doi-gov (complete, re-harvest ongoing)
  • [ ] doc-gov
  • [ ] ca-gov
  • [x] dhs-gov (complete, re-harvest ongoing)
  • [ ] city-of-ferndale-michigan
  • [ ] lake-county-illinois
  • [ ] hhs-gov
  • [ ] state-of-oklahoma
  • [ ] ed-gov
  • [ ] usaid-gov
  • [ ] city-of-new-york
  • [ ] doe-gov
  • [ ] epa-gov
  • [ ] federal-laboratory-consortium
  • [ ] state-of-connecticut
  • [ ] usitc-gov
  • [ ] king-county-washington
  • [ ] city-of-tempe
  • [ ] state-of-new-york
  • [ ] national-institute-of-standards-and-technology
  • [ ] centers-for-disease-control-and-prevention
  • [ ] dot-gov
  • [ ] city-of-baltimore
  • [ ] city-of-austin
  • [ ] usda-gov
  • [ ] city-of-bloomington
  • [ ] city-of-los-angeles
  • [ ] va-gov
  • [ ] state-of-maryland
  • [ ] fcc-gov
  • [ ] rrb-gov
  • [ ] city-of-baton-rouge
  • [ ] fgdc-gov
  • [ ] doj-gov

The process to follow for each organization:

  • Find the offending harvest source (if more than one, will involve more analysis)
  • Get the harvest source id (via UI)
  • Get the organization id: SELECT id FROM "group" WHERE name = 'doc-gov';
  • If more than one harvest source for the org, need to update the following query
  • Replace the following query values org-id and harvest-source-id with the values above
UPDATE package 
SET state = 'to_delete' 
WHERE package.owner_org = 'org-id' 
AND package.type = 'dataset' 
AND package.id NOT IN (
  SELECT package_id 
  FROM harvest_object 
  WHERE harvest_source_id = 'harvest-source-id' 
  AND current
);
  • Clear the harvest source (either via UI or via run-task)
  • Re-Harvest the source

Please note that the above marking to_delete is to match ckanext-harvest clearing. We also need to delete the records in SOLR, so we need to do a full harvest clear. An alternative approach would be manually running the db-solr-sync job after deleting the records, and validating that the job removed the records from solr...

jbrown-xentity avatar Oct 12 '22 21:10 jbrown-xentity

Should review #3742 after resolution of this ticket, as that may be resolved as well.

jbrown-xentity avatar Oct 13 '22 18:10 jbrown-xentity

The following SQL script picks up the duplicates (from https://github.com/GSA/data.gov/issues/3567)

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

10-17-2022

doc-gov                                        | 23425
ca-gov                                         | 12455
noaa-gov                                       | 10869
city-of-ferndale-michigan                      |   795
usaid-gov                                      |   511
city-of-seattle                                |   494
state-of-connecticut                           |   474
lake-county-illinois                           |   348
vcgi-org                                       |   186
loudoun-county-virginia                        |   147
hhs-gov                                        |   138
city-of-baltimore                              |   111
state-of-oklahoma                              |   100
ed-gov                                         |    90
doe-gov                                        |    64
usgs-gov                                       |    54
city-of-new-york                               |    40
sba-gov                                        |    36
national-institute-of-standards-and-technology |    35
wake-county                                    |    33
epa-gov                                        |    33
doi-gov                                        |    30
federal-laboratory-consortium                  |    29
city-of-sioux-falls                            |    14
dot-gov                                        |    13
city-of-austin                                 |    11
usitc-gov                                      |     6
city-of-tempe                                  |     5
king-county-washington                         |     5
state-of-new-york                              |     5
centers-for-disease-control-and-prevention     |     4
fema-gov                                       |     3
usda-gov                                       |     3
city-of-los-angeles                            |     2
state-of-maryland                              |     2
city-of-bloomington                            |     2
va-gov                                         |     2
state-gov                                      |     1
rrb-gov                                        |     1
census-gov                                     |     1
doj-gov                                        |     1
city-of-baton-rouge                            |     1
fcc-gov                                        |     1 

Following sql to update the package state:

UPDATE package 
SET state = 'to_delete' 
where id in (
    SELECT package.id
    FROM package
    JOIN "group" ON package.owner_org = "group".id 
    LEFT JOIN harvest_object ON package.id = harvest_object.package_id 
    WHERE package.state='active' AND package.type='dataset'   
    AND harvest_object.package_id IS NULL  
    AND "group".name in (<group_names>)
);

Clear the harvest source in the sandbox test org. Run db-sync-solr job manually to clear solr.

Jin-Sun-tts avatar Oct 18 '22 13:10 Jin-Sun-tts

Will do following cleanup today:

10-18-2022

 doc-gov                                        | 23425
 ca-gov                                         | 12455
 noaa-gov                                       | 10869

Jin-Sun-tts avatar Oct 18 '22 13:10 Jin-Sun-tts

10-24-2022 There is new one duplicate in dhs-gov today.

Jin-Sun-tts avatar Oct 24 '22 19:10 Jin-Sun-tts

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

Jin-Sun-tts avatar Oct 24 '22 19:10 Jin-Sun-tts

just cleaned up duplicates for ca-gov, it only took about 4 min for 12455 records with new deletion method (defer the commit to the end).

That is 50/sec on deleting, faster than 10/sec adding/updating. What makes the speed difference?

FuhuXia avatar Oct 24 '22 19:10 FuhuXia

The new delete function, only has one solr connection for all deletions. And the adding/updating has new connection for each call.

Jin-Sun-tts avatar Oct 24 '22 19:10 Jin-Sun-tts

The following duplicates are also be cleared : doc-gov 23425 noaa-gov 10869

So there is no duplicates in DB as of today. Will continue monitor for couple days to see if we get new duplicates.

Jin-Sun-tts avatar Oct 24 '22 21:10 Jin-Sun-tts

checked the duplicate today, no new item returned

SELECT "group".name, COUNT(*) FROM package JOIN "group" ON package.owner_org = "group".id LEFT JOIN harvest_object ON package.id = harvest_object.package_id WHERE package.state='active' AND package.type='dataset' AND harvest_object.package_id IS NULL GROUP BY 1 ORDER BY 2 DESC ;

name | count ------+------- (0 rows)

Jin-Sun-tts avatar Oct 26 '22 17:10 Jin-Sun-tts

To eliminate packages that have no current harvest_object, we can use this query.

SELECT "group".name, COUNT(*)
FROM package
JOIN "group" ON package.owner_org = "group".id 
LEFT JOIN harvest_object ON package.id = harvest_object.package_id AND harvest_object.current
WHERE package.state='active'
AND package.type='dataset'
AND harvest_object.package_id IS NULL
GROUP BY 1 ORDER BY 2 DESC;

   name    | count
-----------+-------
 noaa-gov  | 37695
 usda-gov  |    27
 epa-gov   |    25
 usaid-gov |     8
 fgdc-gov  |     1
 dot-gov   |     1
 doi-gov   |     1
(7 rows)

FuhuXia avatar Nov 16 '22 17:11 FuhuXia

query result returns 0. This ticket can be closed.

FuhuXia avatar Nov 21 '22 20:11 FuhuXia

I might have confused https://github.com/GSA/data.gov/issues/3567 with this ticket. Or maybe the same thing happened twice? Even having 1.6 years of experience on data.gov, I wouldn't touch the production DB without Fuhu around. @Jin-Sun-tts did it twice!

This ticket removed the bad data from being searchable or discoverable by users. But it is still in the system (hence https://github.com/GSA/data.gov/issues/3999). As @FuhuXia mentioned above, the validating query above shows success for this ticket.

nickumia-reisys avatar Feb 02 '23 22:02 nickumia-reisys