Add dataset: europeana_newspapers
A URL for this dataset
https://pro.europeana.eu/page/iiif#download
Dataset description
This is a dataset of historic newspapers digitised by various national libraries and made available via the Europeana platform.
Dataset modality
Text
Dataset licence
Other license
Other licence
Public Domain Mark for full text and http://creativecommons.org/publicdomain/zero/1.0/ for the metadata
How can you access this data
As a download from a repository/website
Confirm the dataset has an open licence
- [X] To the best of my knowledge, this dataset is accessible via an open licence
Contact details for data custodian
No response
I suggest leaving this as a candidate dataset until we have worked out the best approach. Tagging others who have been discussing this: @bmschmidt @stefan-it
Data access:
Currently, we have a few options for accessing the data:
- use the data from https://pro.europeana.eu/page/iiif#download
- use the API for access (do this once and save output)
- use the API inside the loading script (I think this is only a good idea if we're sure more titles will be added regularly)
Sharing the data:
Since this is a large corpus, and there may also be a little bit of concern about things being removed from the Europeana website, I think it makes sense to upload a version of the data that has been made more amenable for computational research. @bmschmidt has some code for doing this we could use as a starting point.
There are then a few options we need to decide on:
- do we create on 'top level' dataset and use the loading script to allow people to control what they download?
- what filters are likely to be helpful when downloading data. the ones I can immediately think of:
- date of publication (min, max)
- OCR quality
- language
- possibly title
- possibly source country
- What fields do we want to include in the data. We want a balance between not losing too much fidelity and including surplus data that isn't going to be useful for most users (for very bespoke use cases, people can always go back to the XML).
Could you explain the notion of a "loading script"? I don't think I understand how the huggingface model--which seems to basically organized hierarchically--works with something like this.
Especially around what seems like the fundamental question which is file ordering. Like I think it makes sense to have files be individual newspapers (or chronological subsets of newspapers), but that means there's waste if you try to subset by date of publication; and vice versa.
Could you explain the notion of a "loading script"? I don't think I understand how the huggingface model--which seems to basically organized hierarchically--works with something like this.
This depends a bit on how we decide to distribute the files. But one option is to have a dataset_script, which allows some control over what parts of the data are loaded. For example, if we have a bunch of files with a naming structure like TITLE_ID_YEAR.arrow, a script could be used to only load the requested parts. This means in practice when someone downloads those files, they only need to download the files they actually plan to use. It is also possible to do this filtering once files are downloaded but obviously, this only saves some processing time/space since the data/files still had to be downloaded before.
Especially around what seems like the fundamental question which is file ordering. Like I think it makes sense to have files be individual newspapers (or chronological subsets of newspapers), but that means there's waste if you try to subset by date of publication; and vice versa.
Perhaps a compromise between granularity and keeping the total number of files reasonable would be to organize each title into decade (or some other time span) buckets. Something like:
TITLE_A_1850_1859.arrow
TITLE_A_1860_1869.arrow
TITLE_A_1870_1879.arrow
TITLE_B_1850_1859.arrow
Then the dataset script can filter which files to load/download. I'll try and dig out some example scripts that have this kind of functionality and link them here. Happy to hear other suggestions for structuring things too.
It seems like it would be possible to create the dataset according to reasonable chunkings and then afterwards write any post-hoc loading scripts that seemed like they'd be especially important? ("Austrian Papers," "German-language papers", "Communist papers", "Papers that published in the 1870s," etc.?)
FWIW, my solution for this was to break up newspapers into multiple files only when they got above a certain size. There are a lot of weekly or monthly publications of only a few pages which run for 20-30 years that it might be overkill to break up by decade. I think that I chunked by day but no more--it would certainly make sense to round to the nearest year to trim corpus items.
In terms of metadata--I think that the smallest unit of text should be the page, and that as much non-redundant metadata as possible should be supplied about each page. Columnar compression means that this won't be especially wasteful.
FWIW, my solution for this was to break up newspapers into multiple files only when they got above a certain size. There are a lot of weekly or monthly publications of only a few pages which run for 20-30 years that it might be overkill to break up by decade. I think that I chunked by day but no more--it would certainly make sense to round to the nearest year to trim corpus items.
I guess we could also use bigger chunks if we're going to end up with some very small slices. This is probably a little unavoidable for titles with very few issues but perhaps using 20 year chunks or larger would make sense for these titles?
In terms of metadata--I think that the smallest unit of text should be the page, and that as much non-redundant metadata as possible should be supplied about each page. Columnar compression means that this won't be especially wasteful.
Agree with this. I also feel uneasy using suggested article segmentation information since the quality of that can be so variable and will vary between titles/dates of publication. Do you start from the ALTO XML in your current script?
Hi guys!
I worked with the Europeana Dumps last week/weekend. Here are some obvervations:
The language information is stored in dc:language from edm:ProvidedCHO attribute. Normally, you would expect a string or an array of languages. But... it is mixed. For some issues, it is a string and for some it is an array. So in our final metadata representation we should use an array. And the language code is e.g. "de" instead of "deu".
Regarding to the OCR confidence: it is not stored in the metadata dump. You need to manually calculate it, and it is stored in ALTO on word-level (!). For page-level or issue-level it needs to be manually calculated. And you definitely need to download the ALTO dump for that!
For German and French I did create some plots that show the number of issues per year, based on the language information in the metadata and using the dcterms:issued information.
For German it is:

For French:

So it seems that French data is very limited. I talked to @cneud and a license change from public domain to Gallica could explain that.
I've also extracted plain text data from ALTO files for German. The resulting plain text file has a size of 63GB. For pretraining the German Europeana BERT models I've used an older dump and the resulting plain text data had a size of 51GB, so this newer dump is larger.
Thanks so much for that @stefan-it. @bmschmidt @stefan-it, my suggested next step is to start with the smallest dataset from that dump to get to a format we're happy with. This will likely involve starting from the ALTO XML.
I think between us we probably all have some code for doing the ALTO XML parsing, as a starting point, I suggest we share that code (either linking here or adding a pull request to this repository), so we're not starting from scratch.
Does that sound okay to you both?
Hi, just to briefly chime in (I hope I can devote more time to this tomorrow) - I have a lot of background info, provenance and documentation about these datasets. While I am not passionate about the data formatting, I would appreciate a lot if this information can somehow be integrated with the dataset (e.g. as a simple README.txt), as I often get questions about this and I believe there is a lot more relevant information available than what is shared on Europeana. Any thoughts on how to best include this are very welcome! Otherwise I can offer to write sth down as Markdown or plain text when we have a shared repo.
Hi, just to briefly chime in (I hope I can devote more time to this tomorrow) - I have a lot of background info, provenance and documentation about these datasets. While I am not passionate about the data formatting, I would appreciate a lot if this information can somehow be integrated with the dataset (e.g. as a simple README.txt), as I often get questions about this and I believe there is a lot more relevant information available than what is shared on Europeana. Any thoughts on how to best include this are very welcome! Otherwise I can offer to write sth down as Markdown or plain text when we have a shared repo.
That would be great — one option would be to include this in the datacard? We could also include it as part of the dataset too.
It would also be great to have any context for this data. If you think there is anyone in particular at Europeana who would be good to keep in the loop about this work, let me know.
one option would be to include this in the datacard
Good suggestion, but indeed I wonder if the datacard will always be distributed with the data? If not, a simple README.txt might be more suitable perhaps?
anyone in particular at Europeana
Well, that would mainly be me as I was coordinator of the project where the data was produced :) I have also been working with/been in contact with ~20-30 researchers/initiatives that used this dataset, created subsets and derivatives etc which may also be worthwhile sharing. And I can also name a colleague employed by Europeana whom we should loop in once any concrete steps are taken.
Well, that would mainly be me as I was coordinator of the project where the data was produced :) I have also been working with/been in contact with ~20-30 researchers/initiatives that used this dataset, created subsets and derivatives etc which may also be worthwhile sharing. And I can also name a colleague employed by Europeana whom we should loop in once any concrete steps are taken.
Perfect! If you have time I'm happy to set up a meeting to discuss an approach that also works from the Europeana side? Would also be good to hear about any similar efforts, definitely don't want to duplicate existing work.
Great! I don't want to overload this with things from the past, but I think this would present a great opportunity to capture and document some of the background and context that have been sitting in my head/inbox/fragmented over multiple project websites for a while. Should we try to find a suitable date/time via email?
Great! I don't want to overload this with things from the past, but I think this would present a great opportunity to capture and document some of the background and context that have been sitting in my head/inbox/fragmented over multiple project websites for a while. Should we try to find a suitable date/time via email?
That sounds good, I'll drop you an email.
code for doing the ALTO XML parsing
Perhaps some of this code could be useful/repurposed:
Some initial input for the dataset card/README:
- The dataset was produced by the project partners in the Europeana Newspapers project (2012-2014)
- A subset out of the newspaper collections of 12 national and university libraries from Europe was selected for the project. Some background on the selection criteria is available in D-2.1_Dataset_for_refinement.pdf
- The OCR was produced using ABBYY FineReader SDK v11, additional details on the workflow and setup are available in D-2.2_Specification_of_requirements
- An evaluation of the OCR quality using Ground Truth was made with results published in D-3.5_Performance_Evaluation_Report
- For the ingest into Europeana, metadata and OCR were transformed into the EDM data model for newspapers which is documented in D-4.4_EDM_for_Newspapers
- The main portal for search & browse in the collection was TEL, which was shutdown suddenly due to lack of funding in 2016
- While Europeana took over the data from TEL, it has since struggled to find resources to re-implement core access features
- Some derivative datasets have been produced for specific purposes such as
- List of researchers/projects that made use of the dataset
- ... (TBC)
Hi @cneud , many thanks for that list!
I have one question left : was there any re-ocr done in the past years?
was there any re-ocr done in the past years
Unfortunately no. We are currently finalizing a report where we compare the old OCR quality with the performance that can be achieved with state-of-the-art neural OCR/layout analysis methods (such as e.g. our eynollah) and I can already say that the quality improvements by re-OCRing would be considerable. Europeana currently has no capacity to re-OCR though, and the computational and organisational effort for doing this in a distributed setting would likely require another project with funding :(
Here is a quick mapping from Europeana Dataset IDs to content providers
| europeana-ID | library |
|---|---|
| 9200359 | National Library of the Netherlands |
| 9200356 | National Library of Estonia |
| 9200301 | National Library of Finland |
| 9200408 | National Library of France (unpublished due to license) |
| 9200333 | Tessmann Library South-Tyrol |
| 9200303 | National Library of Latvia |
| 9200357 | National Library of Poland |
| 9200300 | Austrian National Library |
| 9200338 | Hamburg State and University Library |
| 9200355 | Berlin State Library |
| 9200339 | Belgrade University Library |
| 9200396 | National Library of Luxembourg |
Notes for discussion:
Background
- Background documentation
- Are ALTO formats consistent across collections?
Documentation
- What to document
source
- API or bulk downloads (https://pro.europeana.eu/page/iiif#download)
target format
- Target output format: jsonl.gz, arrow?
- What to include in target output (metadata + content)
- how to split between files and/or directories
Info to include for each page:
{'OCRProcessing': {'processingDateTime': '2014-09-08',
'softwareCreator': 'ABBYY',
'softwareName': 'ABBYY FineReader Engine',
'softwareVersion': '11'},
'language': 'FR',
'mean_ocr': 0.8,
'std_ocr': 0.1,
'text': 'Text for page'}
@id: large_string
nc:text: large_string
newspaper_id: large_string
page: int32
dc:identifier: large_string
dc:language: large_string
dc:source: large_string
dc:subject: large_string
dc:title: large_string
dc:type: large_string
dc:extent: large_string
dc:isPartOf: large_string
dc:spatial: large_string
dc:relation: large_string
dc:hasPart: large_string
newspaper: large_string
dc:issued: date32[day]
- IIIF image URLs for each page
Info to store in the path:
- the title of the newspaper?
- year(s) of publication (or range)?
- language?
configuration
- sample pack
- text mining pack
- XML co-ordinates
@davanstrien I will investigate some of issues that have multiple languages in the dc:language field (resulting in array as data type) for both dump and API.
Are ALTO formats consistent across collections?
Within Europeana Newspapers, all OCR xml files are consistent, in that they are all using ALTO schema version 2.0.
Info to include for each page:
{'OCRProcessing': {'processingDateTime': '2014-09-08', 'softwareCreator': 'ABBYY', 'softwareName': 'ABBYY FineReader Engine', 'softwareVersion': '11'},
This part should be identical for most files and could also be documented on global dataset level. If there are any different entries though, this would allow identifying pages that were also processed with article separation by CCS software (docWORKS) merely from the ALTO (i.e. without EDM or METS).
* sample pack * text mining pack * XML co-ordinates
I personally like the different "packs" example from the National Library of Luxemburg (see https://data.bnl.lu/data/historical-newspapers/ and scroll down a bit) - they offer different sizes and different flavours of the data. I wonder how much of the creation of such "packs" could be done dynamically by the loading script?
The plain text should be straightforward to extract (beware of hyphenation and reading order), but I suppose @stefan-it has already done that.
A simplistic way to calculate OCR confidence per page is here, but there are certainly better ways that consider string length, compute mean/avg etc.
For those interested in image content in newspapers, it may be sufficient to extract bounding box coordinates of illustrations (and possibly also check <GraphicalElement>) and keep that with the IIIF image URLs for each page, so that snippets of any image content detected on the pages can be automatically collected.
As mentioned in the call, I'm slapping my parsing code online. As mentioned in the blog post this is all throwaway notebooks I wrote primarily just to get the Neue Freie Presse out for grad student in my class, but I suspect it wouldn't be crazy hard to get it working on the other ALTO-XML dumps, if desirable.
Repo at https://github.com/bmschmidt/europapers
@bmschmidt @cneud @stefan-it
Just to let you know, I am currently putting some processing code together for this. I'm essentially Frankensteinining the code you all shared already. I'll hopefully have something to share tomorrow.
Hi @davanstrien , I prepared a GIST that shows how to parse metadata information.
You basically just need to download the zip archives, there's no need to unpack them (it is all done in-memory):
https://gist.github.com/stefan-it/2b9b04caad3fd1d3ec94e5f1456cbd63
There are two examples:
- issue per year distribution for German and French issues
- Extract all issue ids and metadata for issues with more than one detected languages
Here's the list of issues with more than one detected language:
issues_with_more_languages.txt
It is pretty interesting, because we need to discuss them, e.g. these kind of entries:
3000051869684: MetaData(title='Österreichische Buchhändler-Correspondenz - 1870-02-20', year='1870', pages=8, languages=['==', 'de'])
Where == is mistakenly used as language identifier?!
The ALTO parsing stuff is coming in another GIST, soon :)
Thanks for this, @stefan-it. I have the alto parsing done (adapting code from @cneud) but feel free to share if it's ready anyway :) For the metadata, I'm currently getting this via the API (adapting @bmschmidt's code).
I will check to see how different these two sets of metadata are for an item. I assume the API should hold fresher metadata in theory, but I don't know how much of a difference this makes. If it's possible to use the dumps and there isn't much of a difference in the metadata, then we'll probably prefer to use the dump files instead.
I'm also adding the IIIF manifest URLs for each item, plus links to the IIIF image of the page (and for those items where the ALTO XML predicts illustrations, I'm including a list of IIIF URLs with those regions cropped).
@bmschmidt @cneud I don't feel it makes sense to include the full IIIF manifest in the records (just the URL) but let me know if you disagree. I would suggest including some demo code of how to grab the full manifest in the dataset card.
Where
==is mistakenly used as language identifier?!
I'm shocked you don't speak == 😜
Example of metadata from dump:
{'rdf:RDF': {'@xmlns:cc': 'http://creativecommons.org/ns#',
'@xmlns:dc': 'http://purl.org/dc/elements/1.1/',
'@xmlns:dcterms': 'http://purl.org/dc/terms/',
'@xmlns:doap': 'http://usefulinc.com/ns/doap#',
'@xmlns:edm': 'http://www.europeana.eu/schemas/edm/',
'@xmlns:foaf': 'http://xmlns.com/foaf/0.1/',
'@xmlns:ore': 'http://www.openarchives.org/ore/terms/',
'@xmlns:owl': 'http://www.w3.org/2002/07/owl#',
'@xmlns:rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'@xmlns:rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
'@xmlns:skos': 'http://www.w3.org/2004/02/skos/core#',
'@xmlns:svcs': 'http://rdfs.org/sioc/services#',
'@xmlns:wgs84_pos': 'http://www.w3.org/2003/01/geo/wgs84_pos#',
'edm:Place': {'@rdf:about': 'http://d-nb.info/gnd/4016680-6',
'skos:prefLabel': 'Feldkirch'},
'edm:ProvidedCHO': {'@rdf:about': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663',
'dc:identifier': 'oai:fue.onb.at:EuropeanaNewspapers_Delivery_2:ONB_00268/1850/ONB_00268_18500115.zip',
'dc:language': 'de',
'dc:source': {'@rdf:resource': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=voz&datum=18500115'},
'dc:subject': {'@rdf:resource': 'http://d-nb.info/gnd/4067510-5'},
'dc:title': 'Vorarlberger Zeitung - 1850-01-15',
'dc:type': [{'@rdf:resource': 'http://schema.org/PublicationIssue'},
{'#text': 'Analytic serial', '@xml:lang': 'en'},
{'#text': 'Newspaper', '@xml:lang': 'en'},
{'#text': 'Newspaper Issue', '@xml:lang': 'en'}],
'dcterms:extent': {'#text': 'Pages: 4', '@xml:lang': 'en'},
'dcterms:isPartOf': [{'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073527530'},
{'@rdf:resource': 'http://data.theeuropeanlibrary.org/Collection/a0600'},
{'#text': 'Europeana Newspapers', '@xml:lang': 'en'}],
'dcterms:issued': '1850-01-15',
'dcterms:spatial': {'@rdf:resource': 'http://d-nb.info/gnd/4016680-6'},
'edm:isNextInSequence': {'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073479497'},
'edm:type': 'TEXT'},
'edm:WebResource': [{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg',
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg',
'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg',
'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg'},
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004/full/full/0/default.jpg',
'edm:isNextInSequence': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg'},
'svcs:has_service': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004'}}],
'ore:Aggregation': {'@rdf:about': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663#aggregation',
'edm:aggregatedCHO': {'@rdf:resource': 'http://data.theeuropeanlibrary.org/BibliographicResource/3000073475663'},
'edm:dataProvider': 'Österreichische Nationalbibliothek - Austrian National Library',
'edm:hasView': [{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002/full/full/0/default.jpg'},
{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003/full/full/0/default.jpg'},
{'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004/full/full/0/default.jpg'}],
'edm:isShownAt': {'@rdf:resource': 'http://anno.onb.ac.at/cgi-content/anno?apm=0&aid=voz&datum=18500115'},
'edm:isShownBy': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
'edm:object': {'@rdf:resource': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001/full/full/0/default.jpg'},
'edm:provider': {'#text': 'The European Library', '@xml:lang': 'en'},
'edm:rights': {'@rdf:resource': 'http://creativecommons.org/publicdomain/mark/1.0/'}},
'svcs:Service': [{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000003',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000002',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000004',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}},
{'@rdf:about': 'http://iiif.onb.ac.at/images/ANNO/voz18500115/00000001',
'dcterms:conformsTo': {'@rdf:resource': 'http://iiif.io/api/image'},
'doap:implements': {'@rdf:resource': 'http://iiif.io/api/image/2/level2.json'}}]}}
@albertvillanova does the Huggingsets datasets API support any standards for rich descriptions like this in the arrow metadata, at either the file or recordbatch level? It seems like a shame to throw it away. I've had on the back burner for a while a scheme to get ML people using the column description format of the W3C's CSV on the web spec, which is a bit too much to bite off here; but as a stopgap I often try to put some of this stuff into arrow metadata where it won't get into anyone's way. But sometimes loading scripts won't copy the metadata parts of the arrow schema.
(Sorry if I'm just making this over-complicated--I'm asking b/c I think this is an interesting test case of some places where these fields don't speak each other's language.)
@davanstrien Thanks for tackling all this. One small note--all the metadata I could find was of the form'dc:title': 'Vorarlberger Zeitung - 1850-01-15', but I think for typical use cases it's important to drop the date information from that field (which is captured in dcterms:issued) from the back of the title to allow more regular filtering.