feature request: extract pdf references
Hi
I don't know if it's possible already but I think a useful feature would be to extract al citations (to other papers) inside a PDF file and automatically add those to the current (or another) library. There could be a right click menu "Extract PDF references to library/new library" when clicking an entry that has a PDF file in its 'file' field. The goal is to quickly build a libray of related papers based on the references inside papers you already have in your database.
It can also answer the question "what to read next" when you have all references extracted, because you don't need to check the papers themselves for the references.
Great idea. I want to have this too.
What we currently use to extract pdf metadata is Grobid. It features following functionality:
- Header extraction and parsing from article in PDF format. The extraction here covers the usual bibliographical information (e.g. title, abstract, authors, affiliations, keywords, etc.).
- References extraction and parsing from articles in PDF format, around .87 F1-score against on an independent PubMed Central set of 1943 PDF containing 90,125 references, and around .90 on a similar bioRxiv set of 2000 PDF (using the Deep Learning citation model). All the usual publication metadata are covered (including DOI, PMID, etc.).
- Citation contexts recognition and resolution of the full bibliographical references of the article. The accuracy of citation contexts resolution is between .76 and .91 F1-score depending on the evaluation collection (this corresponds to both the correct identification of the citation callout and its correct association with a full bibliographical reference).
- Full text extraction and structuring from PDF articles, including a model for the overall document segmentation and models for the structuring of the text body (paragraph, section titles, reference and footnote callouts, figures, tables, etc.).
- PDF coordinates for extracted information, allowing to create "augmented" interactive PDF based on bounding boxes of the identified structures.
- Parsing of references in isolation (above .90 F1-score at instance-level, .95 F1-score at field level, using the Deep Learning model).
- Parsing of names (e.g. person title, forenames, middle name, etc.), in particular author names in header, and author names in references (two distinct models).
- Parsing of affiliation and address blocks.
- Parsing of dates, ISO normalized day, month, year.
- Consolidation/resolution of the extracted bibliographical references using the biblio-glutton service or the CrossRef REST API. In both cases, DOI/PMID resolution performance is higher than 0.95 F1-score from PDF extraction.
- Extraction and parsing of patent and non-patent references in patent publications.
I would not know how to, but Grobid would be the starting point.
The feature is also important in the context of reviews organized by some IEEE groups. They wait for this feature. In case it is implemented, it saves them plenty of time. Therefore, I put it to higher priority.
I would like to take up this issue!
Similar wish came up again in the forum: https://discourse.jabref.org/t/creating-bibtex-or-doi-list-from-bibliography/4109