api icon indicating copy to clipboard operation
api copied to clipboard

Invalid html in BillStatus documents

Open evan-benoit opened this issue 4 years ago • 4 comments

Hello GovInfo! We're seeing about 2% of the BILLSTATUS documents that we examine have faulty HTML in the <billSummaries> section. For example:

https://www.govinfo.gov/bulkdata/BILLSTATUS/117/s/BILLSTATUS-117s294.xml

The <billSummaries> section has unclosed <p> tags. Some of the <p> tags have corresponding </p> tag, but others do not. Any idea why this is? Can anything be done about it?

Thanks! -Evan

evan-benoit avatar Jun 16 '21 10:06 evan-benoit

@evan-benoit - updated your comment to include code fencing around the tags.

I'm looking into this. If you can provide a few additional example IDs, that will help me to investigate. My initial thinking is that this is in the source data.

jonquandt avatar Jun 16 '21 10:06 jonquandt

Sure, here's a few other examples, all with unmatched <p> tags

  • https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr7.xml
  • https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr1037.xml
  • https://www.govinfo.gov/bulkdata/BILLSTATUS/117/hr/BILLSTATUS-117hr78.xml

I'm finding this problem in about ~2% of the BILLSTATUS documents.

evan-benoit avatar Jun 29 '21 13:06 evan-benoit

Thank you -- the team that helps supply this is aware of the issue and working to address it by replacing a legacy system. I don't know the exact timeline for this to be completed.

jonquandt avatar Jun 29 '21 13:06 jonquandt

Thanks, I appreciate the speedy response!

evan-benoit avatar Jun 29 '21 13:06 evan-benoit

As an update, this is still in work upstream of us. This is being tracked by the Library of Congress here: https://github.com/LibraryOfCongress/api.congress.gov/issues/2

I am closing the issue here because it will end up being resolved upstream and then we will update our BILLSTATUS and BILLSUM files.

jonquandt avatar Jul 07 '23 14:07 jonquandt