data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Optimize sitemaps and improve integration into Google Search Console

Open dlennox24 opened this issue 2 years ago • 9 comments

User Story

Currently, sitemaps in Google Search Console are returning a greater amount of pages than are currently available in the catalog. We need to determine why this is happening and ensure sitemaps are being generated correctly and are being ingested by Google correctly.

Acceptance Criteria

  • [ ] GIVEN the current extra sitemap data in Google Search Console THEN research what the cause of the extra pages and lack of ingestion are AND create suggestions for a solution to reduce extra pages and Google's ingestion of our sitemap data

Sketch

  • [ ] Determine the root cause of the mismatch of the number of records in catalog vs in the dynamic sitemap
  • [ ] Create suggestions to fix these discovered issues

dlennox24 avatar Jun 15 '23 19:06 dlennox24

Search Console has detected our changes to the sitemap sizes and begun parsing the larger sitemaps. There are now a total of 38 individual sitemaps in the sitemap index with each containing roughly 10k records.

Image

dlennox24 avatar Jun 26 '23 17:06 dlennox24

Image

Image

Search Console is showing nearly 2.5 million potential pages to be index. Data.gov doesn't have this many pages (sitemap is at ~380k pages). A large portion of this are marked as 404s (~1.1m).

dlennox24 avatar Nov 03 '23 16:11 dlennox24

Image

Image

dlennox24 avatar Dec 08 '23 18:12 dlennox24

I do not have write access to 18F/dns. @FuhuXia it looks like you have access. Would you be able to add the following to the data.gov.tf file?

https://github.com/18F/dns/blob/main/terraform/data.gov.tf#L503

  records = [
    "621df521f1e44ac69a670f325dc86889",
    "v=spf1 ip4:34.193.244.109 include:gsa.gov ~all",
-   "n6fgn8dyh1hhqsmghskdplss7zp7yt7q"
+   "n6fgn8dyh1hhqsmghskdplss7zp7yt7q",
+   "google-site-verification=K1_M1KkxyZYMiqHHAmlUVcXgYxV6myWSNYAyLrUk_PA"
  ]
}

dlennox24 avatar Dec 21 '23 17:12 dlennox24

Hi @dlennox24 do you think this will be a one-time update or would it be worth getting you added to 18F organization?

btylerburton avatar Dec 21 '23 17:12 btylerburton

Hi @btylerburton! One time. This should give us verification for the whole data.gov domain including all subdomains within Google Search Console.

dlennox24 avatar Dec 21 '23 18:12 dlennox24

I can push up a PR today then.

btylerburton avatar Dec 21 '23 18:12 btylerburton

Ah yes, I rememember now. We have to fork and then PR that into here.

btylerburton avatar Dec 21 '23 18:12 btylerburton

Thanks Tyler! The 18F PR was merged and I verified that the data.gov domain is available in the Search Console. I added the permissions to the team as the same as what was on the other domain. This will allow for the capture and monitoring of any subdomain on data.gov.

The Sitemaps ingestion also appears stable, and Google is reporting that all have been parsed. chrome_zH6N942LvM

Non-indexed pages are still at a slight trend down and indexed pages are trending up. The vast majority of the non-index pages are 404s or duplicates (Google considers queries to be duplicates so this is what most of those are, eg https://catalog.data.gov/dataset?tags=asthma&_organization_limit=0 is a dup of https://catalog.data.gov/dataset?organization=noaa-gov&_tags_limit=0). I believe the harvest process naturally creates some churn so there will always be some 404s as datasets are removed or their names/url changed. chrome_K88ppBLWTd

I believe we can move this ticket to done and move to a monitoring state for Search Console.

dlennox24 avatar Jan 04 '24 17:01 dlennox24