Add robots.txt & sitemap to catalog on cloud.gov
User Story
In order to make sure crawls can occur on our site, data.gov admins want the sitemap to be available in cloud.gov.
Acceptance Criteria
[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]
- [ ] GIVEN an s3 bucket is provisioned for catalog
WHEN the s3-to-sitemap function is run as a task on cloud.gov
THEN all datasets are cataloged in the sitemap
Background
Current robots.txt links to filestore.data.gov, which is the FCS s3 bucket. Need to port this information to cloud.gov, and setup a recurring job (currently runs every day at 5 am, see here)
Also in rough analysis, all datasets are published in the sitemap. This may be able to be optimized to exclude collection level data, and only include the parent record. This might set us up to improve search engine optimization.
Security Considerations (required)
None, all public data
Sketch
- [ ] Create s3 bucket for catalog in each environment (see inventory for example), make sure is for public use.
- [ ] Update code to take the credentials of s3 bucket to use in the ckanext-geodatagov code
- [ ] Validate code can create and push sitemap to configured s3 bucket in dev
- [ ] Setup github action to run regularly (probably daily, time doesn't really matter)
- [ ] Add/update robots.txt to point to the new s3 bucket.
More context: https://github.com/ckan/ckan/pull/5648#issuecomment-948274173
For now we can just point to the version that we pulled over from the FCS environment, which is substantially up-to-date.
This will require fixing ckanext-geodatagov extension to upgrade the cli to py3 and ckan 2.9. See similar work done on ckanext-dcat_usmetadata:
- https://github.com/GSA/ckanext-dcat_usmetadata/pull/289/files
- https://github.com/GSA/ckanext-dcat_usmetadata/pull/291/files (make sure to bump the version)
Yesterday, I spent a lot of time trying to track down issues with the requested change of moving requirements into the setup.py. Something down the chain is still requiring boto, but I am not finding it and have been unable to run tests because of it.
This morning I have had an issue with getting local make commands to run: Error response from daemon: invalid mount config for type "volume": invalid mount path: 'docker-entrypoint.d/* /docker-entrypoint.d' mount path must be absolute. I have not made any changes to the entry point... am investigating. This does not appear to happen with github actions and is the same reason I raised this issue last week, a lot of dev time has been spent tracking down these sort of things.
Thanks to everyone jumping on the huddle yesterday. Especailly thanks to @nickumia-reisys for getting this unstuck with your PR. I should be able to do the last couple things on this ticket and move along.
This had been blocked by issues with building catalog that were resolved with this PR. My sitemap code is now available with the ckan geodatagov sitemap-to-s3 cli command. Running it, however:

Am trying to figure out how to debug in this environment.
The fix for above is simply running the cli commands as a cf run-task instead of sshing into an app. Running the task raises a small bug with the filename_number not getting incremented, but also an exception: raise ValueError(f'Required parameter {identifier} not set') from the s3 uploading code.
I have been blocked on platform issues. Our own upstream containers aren't building correctly (and taking 2k+ seconds) on my system architecture:

I can change some build parameters, but I think I would need to rebuild the images upstream to support multi-platform. There is a related issue with the pyproj wheel as the binary fails to build in my local venv.
In the absence of a solve locally, I am looking into running tmate in test action with the fancy new action.
My blockers yesterday turned out to be related to an issue with the ckan 2.9.6 release yesterday. Pinning the ckan version to 2.9.5 allowed passing (though will need to be updated at some point) and PRs in ckanex-geodatagov and catalog bumped the versions to allow testing with a new action: sitemap-to-s3 .
Well, same issue with my refactor:

I have another idea to call the actual aws s3 cli from python, but that feels very janky.
Well, it's not quite right but it's something:

Huzzah!
The s3 upload test has been failing. Hmm..

It's some sort of special magic that it can upload a file to a bucket, but with only the message that the bucket doesn't exist.
With https://github.com/GSA/ckanext-geodatagov/pull/224, the work on ckanext-geodatagov should be done (hopefully). The work on the catalog side should be done with https://github.com/GSA/catalog.data.gov/pull/578.
Looks like just staging catalog-proxy has an issue with the sed command: ERR sed: -e expression #1, char 22: unknown option to s'`. It should be the same as the similar command above. Am investigating.
🥳 https://catalog.data.gov/robots.txt now has a good link to the sitemap bucket! And a mostly valid (should be on merge of PR 615) sitemap file!

!!!
With https://github.com/GSA/catalog.data.gov/pull/637 I believe all the work on this is done. The robots file correctly points to the sitemap. The sitemaps are being generated nightly by a github action. Each sitemap recored correctly refers to a dataset:
