reach icon indicating copy to clipboard operation
reach copied to clipboard

ODT format support - gov.uk scraper

Open jdu opened this issue 6 years ago • 2 comments

Looking at gov.uk, it seems that for a lot of documents they are moving to ODT format (Open Document Format), i'm not sure if there are documents in the gov_uk site which we want data from which are specified in the ODT format, but worth investigating as their policies search page (at the moment) has predominantly HTML and PDF documents for download.

jdu avatar Dec 02 '19 16:12 jdu

Yup this is super common among govt. There is pressure to move away from pdf into a more 'open' format, which has traditionally meant ODTs, but also you may see some publications just being switched over into straight HTML (probably a smaller proportion).

ivyleavedtoadflax avatar Dec 11 '19 16:12 ivyleavedtoadflax

That's fine, this is pegged against a 1.1.0 release of reach (so post-alpha) to address scraping and parsing the ODT and other file formats available on some of the sites, some of the ones I've seen lead into essentially micro-sites where content is split across a sub-menu so if we need to pull in that data as well we'll need to implement around scraping a set of sub-pages as a single unit comprising a policy document. Luckily our scrapers are fairly simple at the moment so we can make some design decisions now on how we're going to manage identifying and handling the different "content type" that we're likely to need to scrape out.

This will likely break down into multiple concrete development tasks to handle the different types once we get past alpha.

jdu avatar Dec 12 '19 11:12 jdu