data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Add robots.txt & sitemap to catalog on cloud.gov

Open jbrown-xentity opened this issue 4 years ago • 14 comments

User Story

In order to make sure crawls can occur on our site, data.gov admins want the sitemap to be available in cloud.gov.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • [ ] GIVEN an s3 bucket is provisioned for catalog
    WHEN the s3-to-sitemap function is run as a task on cloud.gov
    THEN all datasets are cataloged in the sitemap

Background

Current robots.txt links to filestore.data.gov, which is the FCS s3 bucket. Need to port this information to cloud.gov, and setup a recurring job (currently runs every day at 5 am, see here)

Also in rough analysis, all datasets are published in the sitemap. This may be able to be optimized to exclude collection level data, and only include the parent record. This might set us up to improve search engine optimization.

Security Considerations (required)

None, all public data

Sketch

  • [ ] Create s3 bucket for catalog in each environment (see inventory for example), make sure is for public use.
  • [ ] Update code to take the credentials of s3 bucket to use in the ckanext-geodatagov code
  • [ ] Validate code can create and push sitemap to configured s3 bucket in dev
  • [ ] Setup github action to run regularly (probably daily, time doesn't really matter)
  • [ ] Add/update robots.txt to point to the new s3 bucket.

jbrown-xentity avatar Jan 19 '22 17:01 jbrown-xentity

We will edit this and should probably remove this

jbrown-xentity avatar Jan 19 '22 19:01 jbrown-xentity

More context: https://github.com/ckan/ckan/pull/5648#issuecomment-948274173

nickumia-reisys avatar Jan 20 '22 17:01 nickumia-reisys

For now we can just point to the version that we pulled over from the FCS environment, which is substantially up-to-date.

mogul avatar Jan 20 '22 21:01 mogul

This will require fixing ckanext-geodatagov extension to upgrade the cli to py3 and ckan 2.9. See similar work done on ckanext-dcat_usmetadata:

  • https://github.com/GSA/ckanext-dcat_usmetadata/pull/289/files
  • https://github.com/GSA/ckanext-dcat_usmetadata/pull/291/files (make sure to bump the version)

jbrown-xentity avatar Jul 22 '22 16:07 jbrown-xentity

Yesterday, I spent a lot of time trying to track down issues with the requested change of moving requirements into the setup.py. Something down the chain is still requiring boto, but I am not finding it and have been unable to run tests because of it.

This morning I have had an issue with getting local make commands to run: Error response from daemon: invalid mount config for type "volume": invalid mount path: 'docker-entrypoint.d/* /docker-entrypoint.d' mount path must be absolute. I have not made any changes to the entry point... am investigating. This does not appear to happen with github actions and is the same reason I raised this issue last week, a lot of dev time has been spent tracking down these sort of things.

robert-bryson avatar Sep 07 '22 16:09 robert-bryson

Thanks to everyone jumping on the huddle yesterday. Especailly thanks to @nickumia-reisys for getting this unstuck with your PR. I should be able to do the last couple things on this ticket and move along.

robert-bryson avatar Sep 09 '22 16:09 robert-bryson

This had been blocked by issues with building catalog that were resolved with this PR. My sitemap code is now available with the ckan geodatagov sitemap-to-s3 cli command. Running it, however:

Image

Am trying to figure out how to debug in this environment.

robert-bryson avatar Sep 14 '22 16:09 robert-bryson

The fix for above is simply running the cli commands as a cf run-task instead of sshing into an app. Running the task raises a small bug with the filename_number not getting incremented, but also an exception: raise ValueError(f'Required parameter {identifier} not set') from the s3 uploading code.

robert-bryson avatar Sep 20 '22 16:09 robert-bryson

I have been blocked on platform issues. Our own upstream containers aren't building correctly (and taking 2k+ seconds) on my system architecture:

Image

I can change some build parameters, but I think I would need to rebuild the images upstream to support multi-platform. There is a related issue with the pyproj wheel as the binary fails to build in my local venv.

In the absence of a solve locally, I am looking into running tmate in test action with the fancy new action.

robert-bryson avatar Sep 28 '22 17:09 robert-bryson

My blockers yesterday turned out to be related to an issue with the ckan 2.9.6 release yesterday. Pinning the ckan version to 2.9.5 allowed passing (though will need to be updated at some point) and PRs in ckanex-geodatagov and catalog bumped the versions to allow testing with a new action: sitemap-to-s3 .

robert-bryson avatar Sep 29 '22 19:09 robert-bryson

Well, same issue with my refactor:

Image

I have another idea to call the actual aws s3 cli from python, but that feels very janky.

robert-bryson avatar Oct 06 '22 23:10 robert-bryson

Thanks to @nickumia-reisys's hard work, I'm unblocked and testing his solve.

robert-bryson avatar Oct 11 '22 16:10 robert-bryson

Well, it's not quite right but it's something:

Image

Huzzah!

robert-bryson avatar Oct 12 '22 18:10 robert-bryson

The s3 upload test has been failing. Hmm..

Image

It's some sort of special magic that it can upload a file to a bucket, but with only the message that the bucket doesn't exist.

robert-bryson avatar Oct 14 '22 17:10 robert-bryson

With https://github.com/GSA/ckanext-geodatagov/pull/224, the work on ckanext-geodatagov should be done (hopefully). The work on the catalog side should be done with https://github.com/GSA/catalog.data.gov/pull/578.

robert-bryson avatar Oct 18 '22 18:10 robert-bryson

Catalog work is merged and building!

robert-bryson avatar Oct 20 '22 19:10 robert-bryson

Looks like just staging catalog-proxy has an issue with the sed command: ERR sed: -e expression #1, char 22: unknown option to s'`. It should be the same as the similar command above. Am investigating.

robert-bryson avatar Oct 20 '22 20:10 robert-bryson

🥳 https://catalog.data.gov/robots.txt now has a good link to the sitemap bucket! And a mostly valid (should be on merge of PR 615) sitemap file!

robert-bryson avatar Oct 25 '22 15:10 robert-bryson

Image

!!!

robert-bryson avatar Oct 31 '22 19:10 robert-bryson

With https://github.com/GSA/catalog.data.gov/pull/637 I believe all the work on this is done. The robots file correctly points to the sitemap. The sitemaps are being generated nightly by a github action. Each sitemap recored correctly refers to a dataset:

Image

robert-bryson avatar Nov 03 '22 16:11 robert-bryson