Replace Broken Link Checker
User Story
In order to ensure the quality of our sites, datagovteam would like a reliable report on broken links.
Acceptance Criteria
- [ ] GIVEN a list of pages
- [ ] (optional) GIVEN a sitemap
- [ ] (optional) GIVEN a list of sitemaps (ex. catalog) WHEN I run a scan THEN a list of dead URLs is reported
- [ ] (optional) THEN a new issue is created in the datagov repo
Background
Datagov team uses a broken link checker currently for our static sites, but it's unreliable and consistently fails with false positives. The new link checker should, ideally, be configurable to ignore certain status codes, or a list of pages, and should produce a report that will be able to be "made green" in the near term so that a failing report can be made to fail the build. As it stands now the report is always failing, and not for valid reasons, so no triggers can be configured around its status.
Security Considerations (required)
Fixing old links will improve the quality of the site and the user experience, but will likely not address any security concerns related to any domains that have come into the possession of bad actors.
Sketch
- [ ] Spike on the available options for link checkers
-
- There are a number of pages like this: https://medevel.com/os-broken-link-checkers-to-improve-your-seo/
- [ ] Test reliability, configurability, activity of the repo
- [ ] Implement link checker in Static Site QA Template
- [ ] Implement dependency / config upgrades in static sites:
-
- [ ] https://github.com/GSA/datagov-11ty
-
- [ ] https://github.com/GSA/resources.data.gov
-
- [ ] https://github.com/GSA/data-strategy
-
- [ ] https://github.com/GSA/us-data-federation
-
- [ ] (optionally if supported) https://github.com/GSA/catalog.data.gov
Also related:
- [ ] (optionally) Create an issue when broken links are reported https://github.com/GSA/data.gov/issues/2922
@btylerburton so might this one also be addressed by #4476 ?
Yes ideally @hkdctol
added new relic link crawler here
clicking a point in a location graph navigates to the list of links tested. there's a difference of tested links between htmlproofer and new relic. htmlproofer may be traversing more than we need?
notes on new relic link checker:
- so far it seems like there's no way to filter/ignore status codes
- the link checker appears to focus entirely on anchor elements
- images aren't checked
- scripts aren't checked
- the least frequent check is 1 day and the most is 5 minutes. currently, our link checker generally runs every 1-2 weeks.
htmlproofer currently checks:
- links
- images
- scripts
- html validation errors
update: looks like the new relic link checker can identify a variety of types
Can the link checker alert us to 404's? Can it post to Slack?
I just checked resources.data.gov and it shows no 404's but I know that's not the case as there's a few I confirmed from this run...
https://github.com/GSA/resources.data.gov/actions/runs/7006861012/job/19059663585
ex.
- https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/memoranda/2013/m-13-13.pdf
- https://data.gov/glossary/
- I assume it would be able to alert us to 404's but surprisingly none have occurred for resource yet in any of the 6 locations in the monitor.
- looks like there's a hook to slack
- i've noticed a discrepancy between what htmlproofer and new relic checks. so far neither of those links appear to be checked in new relic.
after upgrading htmlproofer from 3.x to 5.x to potentially address some issues the resources site produces 284 failures. this includes checks on links, images, scripts, and html validation. this is a considerable amount of failures and switching to another utility ( see examples in the link of alternatives in the sketch ) won't fix them. some examples of failures worth mentioning:
- localhost links
- http://localhost:4000/resources/data-gov-open-data-howto/
- insecure connections ( e.g. ERR_SSL_VERSION_OR_CIPHER_MISMATCH )
- https://viewer.nationalmap.gov/advanced-viewer/
- anchor elements not containing a hyperlink reference ( html validation error )
summary of failures using htmlproofer with the following flags: ignore-status-codes "301,302,401,403,429" --checks='Links,Images,Scripts,Html' --no-check-external-hash --no-check-internal-hash --no-enforce-https
- datagov-11ty
- 216 failures
- resources.data.gov
- 284 failures
- data-strategy
- 550 failures
- us-data-federation
- 5 failures
pausing work on this until group discussion on how we want to proceed.
let's chat about this at sync. looks like you found some good flags to use. however, i do believe we should be tracking 4xx series as errors since that means they're not publicly accessible.
htmlproofer offers a --only-4xx flag
here's the errors for the 4 static sites. this is the raw data from the terminal so if it's best i format them let me know. I used these flags for the runs --checks='Links,Images,Scripts,Html' --only-4xx --no-enforce-https --allow-missing-href --ignore-urls '/localhost./'. the error count for these will differ slightly from what i reported before because i'm using different flags. I think the ones i've chosen this time make sense but i'm okay with changing them to whatever we want.
In my prior role at NIST, we have had great success with using lychee to check links against a generated version of the site in CI. This workflow builds the site and runs linking checking on the generated sources. The workflow is setup to work with Hugo, but other static site generators can be easily configured.
I am considering setting up something like this for fedramp.gov and marketplace.fedramp.gov.
Thanks for the recommendation @david-waltermire! I also found that lychee has a github action as well, so even easier to road test than before: https://github.com/lycheeverse/lychee-action