data.gov icon indicating copy to clipboard operation
data.gov copied to clipboard

Replace Broken Link Checker

Open btylerburton opened this issue 2 years ago • 16 comments

User Story

In order to ensure the quality of our sites, datagovteam would like a reliable report on broken links.

Acceptance Criteria

  • [ ] GIVEN a list of pages
  • [ ] (optional) GIVEN a sitemap
  • [ ] (optional) GIVEN a list of sitemaps (ex. catalog) WHEN I run a scan THEN a list of dead URLs is reported
  • [ ] (optional) THEN a new issue is created in the datagov repo

Background

Datagov team uses a broken link checker currently for our static sites, but it's unreliable and consistently fails with false positives. The new link checker should, ideally, be configurable to ignore certain status codes, or a list of pages, and should produce a report that will be able to be "made green" in the near term so that a failing report can be made to fail the build. As it stands now the report is always failing, and not for valid reasons, so no triggers can be configured around its status.

Security Considerations (required)

Fixing old links will improve the quality of the site and the user experience, but will likely not address any security concerns related to any domains that have come into the possession of bad actors.

Sketch

  • [ ] Spike on the available options for link checkers
    • There are a number of pages like this: https://medevel.com/os-broken-link-checkers-to-improve-your-seo/
  • [ ] Test reliability, configurability, activity of the repo
  • [ ] Implement link checker in Static Site QA Template
  • [ ] Implement dependency / config upgrades in static sites:
    • [ ] https://github.com/GSA/datagov-11ty
    • [ ] https://github.com/GSA/resources.data.gov
    • [ ] https://github.com/GSA/data-strategy
    • [ ] https://github.com/GSA/us-data-federation
    • [ ] (optionally if supported) https://github.com/GSA/catalog.data.gov

Also related:

  • [ ] (optionally) Create an issue when broken links are reported https://github.com/GSA/data.gov/issues/2922

btylerburton avatar Apr 11 '23 16:04 btylerburton

@btylerburton so might this one also be addressed by #4476 ?

hkdctol avatar Sep 29 '23 20:09 hkdctol

Yes ideally @hkdctol

btylerburton avatar Sep 29 '23 21:09 btylerburton

added new relic link crawler here

rshewitt avatar Nov 28 '23 17:11 rshewitt

clicking a point in a location graph navigates to the list of links tested. there's a difference of tested links between htmlproofer and new relic. htmlproofer may be traversing more than we need? image

Image

rshewitt avatar Nov 28 '23 18:11 rshewitt

notes on new relic link checker:

  • so far it seems like there's no way to filter/ignore status codes
    • apparently a variety of status codes are monitored and should be visible in the resources page source & context
  • the link checker appears to focus entirely on anchor elements
    • images aren't checked
    • scripts aren't checked
  • the least frequent check is 1 day and the most is 5 minutes. currently, our link checker generally runs every 1-2 weeks.

htmlproofer currently checks:

  • links
  • images
  • scripts
  • html validation errors

update: looks like the new relic link checker can identify a variety of types

Image

rshewitt avatar Nov 28 '23 23:11 rshewitt

Can the link checker alert us to 404's? Can it post to Slack?

I just checked resources.data.gov and it shows no 404's but I know that's not the case as there's a few I confirmed from this run...

https://github.com/GSA/resources.data.gov/actions/runs/7006861012/job/19059663585

ex.

  • https://www.whitehouse.gov/sites/whitehouse.gov/files/omb/memoranda/2013/m-13-13.pdf
  • https://data.gov/glossary/

btylerburton avatar Nov 28 '23 23:11 btylerburton

  • I assume it would be able to alert us to 404's but surprisingly none have occurred for resource yet in any of the 6 locations in the monitor.
  • looks like there's a hook to slack
  • i've noticed a discrepancy between what htmlproofer and new relic checks. so far neither of those links appear to be checked in new relic.

rshewitt avatar Nov 29 '23 01:11 rshewitt

after upgrading htmlproofer from 3.x to 5.x to potentially address some issues the resources site produces 284 failures. this includes checks on links, images, scripts, and html validation. this is a considerable amount of failures and switching to another utility ( see examples in the link of alternatives in the sketch ) won't fix them. some examples of failures worth mentioning:

  • localhost links
    • http://localhost:4000/resources/data-gov-open-data-howto/
  • insecure connections ( e.g. ERR_SSL_VERSION_OR_CIPHER_MISMATCH )
    • https://viewer.nationalmap.gov/advanced-viewer/
  • anchor elements not containing a hyperlink reference ( html validation error )

rshewitt avatar Dec 04 '23 17:12 rshewitt

summary of failures using htmlproofer with the following flags: ignore-status-codes "301,302,401,403,429" --checks='Links,Images,Scripts,Html' --no-check-external-hash --no-check-internal-hash --no-enforce-https

  • datagov-11ty
    • 216 failures
  • resources.data.gov
    • 284 failures
  • data-strategy
    • 550 failures
  • us-data-federation
    • 5 failures

rshewitt avatar Dec 04 '23 18:12 rshewitt

pausing work on this until group discussion on how we want to proceed.

rshewitt avatar Dec 04 '23 18:12 rshewitt

let's chat about this at sync. looks like you found some good flags to use. however, i do believe we should be tracking 4xx series as errors since that means they're not publicly accessible.

btylerburton avatar Dec 05 '23 17:12 btylerburton

htmlproofer offers a --only-4xx flag

rshewitt avatar Dec 05 '23 18:12 rshewitt

here's the errors for the 4 static sites. this is the raw data from the terminal so if it's best i format them let me know. I used these flags for the runs --checks='Links,Images,Scripts,Html' --only-4xx --no-enforce-https --allow-missing-href --ignore-urls '/localhost./'. the error count for these will differ slightly from what i reported before because i'm using different flags. I think the ones i've chosen this time make sense but i'm okay with changing them to whatever we want.

rshewitt avatar Dec 05 '23 20:12 rshewitt

In my prior role at NIST, we have had great success with using lychee to check links against a generated version of the site in CI. This workflow builds the site and runs linking checking on the generated sources. The workflow is setup to work with Hugo, but other static site generators can be easily configured.

I am considering setting up something like this for fedramp.gov and marketplace.fedramp.gov.

david-waltermire avatar Mar 21 '24 15:03 david-waltermire

Thanks for the recommendation @david-waltermire! I also found that lychee has a github action as well, so even easier to road test than before: https://github.com/lycheeverse/lychee-action

btylerburton avatar Mar 21 '24 17:03 btylerburton