OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

Collect 70B S2 tokens

Open kyleclo opened this issue 3 years ago • 2 comments

Exact spec still WIP, but TODOs are basically:

  1. Athena query to get Titles & Abstracts from S2AG. Form JSON blob per document of the form:
{"text": "...", "paper_id": <identifier>}
  • To concatenate, " ".join([title, abstract]) should be sufficient.
  • Double-check whether structured abstracts preserve whitespace.
  1. Athena query to get S2ORC-OA papers. Form JSON blob per document of the form:
{"text": "...", "paper_id": <identifier>}
  • To concatenate, also whitespace join to linearize structured content.
  • Keep everything for now, including tables, bibliographies, etc.
  1. Build a blocklist of papers. For now, should just be a single file mapping paper_ids to some note or reason for removal. To start, this should be documents that are part of the test set for Catwalk evaluation, especially Pubmed/arXiv abstract generation.

kyleclo avatar Feb 21 '23 00:02 kyleclo

Quick estimate for open-access papers in S2orc:

Titles + Abstracts = 14.4B characters Body Text = 468B characters

rodneykinney avatar Feb 21 '23 17:02 rodneykinney

Collected a first version of the corpus. Steps I followed are here, but a summary is as follows:

Data info:

  • Corpus is located at s3://ai2-s2-research-public/lucas/s2orc_oa_2022_01_03
  • It is comprised of 30 gzipped JSONL files.
  • Each line is a JSON object with the following fields:
    • id: the corpus ID of the paper in Semantic Scholar. If you want to look up the paper, use https://api.semanticscholar.org/CorpusID:<id>
    • text: the text of the paper. Sections are separated by double newlines, i.e. \n\n

The current set of filters is:

  • language is en as identified by pycld3
  • number of whitespace-separated tokens is at least 50
    • abstracts below 50 are typically parsing errors.
  • number of whitespace-separated tokens is at most 50,000
    • past 50k, you typically have large books, vocabulary, number heavy reports, etc. Not worth it.
  • the most frequent token matches the regex ^[A-Za-z][a-z]+$
    • documents that have parsing errors or are number heavy usually have a non alpha token as the most frequent, e.g. . or \n.
  • for documents that have at least 500 tokens, the most frequent token is at most 7.5% of the total number of tokens.
    • estimate for English put frequency of top word in a document at 5-10% of the total number of tokens. splitting differences and going with 7.5%.
  • for documents that are less than 500 tokens, the most frequent token is at most 30% of the total number of tokens.
    • for shorter documents, frequency estimates from above are not as reliable. going for a more generous 30%.

Final counts:

  • Number of whitespace-separated tokens: 72,582,009,602
  • Number of documents: 74,772,626

soldni avatar Feb 25 '23 23:02 soldni