OLMo Collect 70B S2 tokens

Exact spec still WIP, but TODOs are basically:

Athena query to get Titles & Abstracts from S2AG. Form JSON blob per document of the form:

{"text": "...", "paper_id": <identifier>}

{"text": "...", "paper_id": <identifier>}

Build a blocklist of papers. For now, should just be a single file mapping paper_ids to some note or reason for removal. To start, this should be documents that are part of the test set for Catwalk evaluation, especially Pubmed/arXiv abstract generation.

Feb 21 '23 00:02 kyleclo

Quick estimate for open-access papers in S2orc:

Titles + Abstracts = 14.4B characters Body Text = 468B characters

Feb 21 '23 17:02 rodneykinney

Collected a first version of the corpus. Steps I followed are here, but a summary is as follows:

Data info:

Corpus is located at s3://ai2-s2-research-public/lucas/s2orc_oa_2022_01_03
It is comprised of 30 gzipped JSONL files.
Each line is a JSON object with the following fields:
- id: the corpus ID of the paper in Semantic Scholar. If you want to look up the paper, use https://api.semanticscholar.org/CorpusID:<id>
- text: the text of the paper. Sections are separated by double newlines, i.e. \n\n

The current set of filters is:

language is en as identified by pycld3
number of whitespace-separated tokens is at least 50
- abstracts below 50 are typically parsing errors.
number of whitespace-separated tokens is at most 50,000
- past 50k, you typically have large books, vocabulary, number heavy reports, etc. Not worth it.
the most frequent token matches the regex ^[A-Za-z][a-z]+$
- documents that have parsing errors or are number heavy usually have a non alpha token as the most frequent, e.g. . or \n.
for documents that have at least 500 tokens, the most frequent token is at most 7.5% of the total number of tokens.
- estimate for English put frequency of top word in a document at 5-10% of the total number of tokens. splitting differences and going with 7.5%.
for documents that are less than 500 tokens, the most frequent token is at most 30% of the total number of tokens.
- for shorter documents, frequency estimates from above are not as reliable. going for a more generous 30%.

Final counts:

Feb 25 '23 23:02 soldni