OLMo
OLMo copied to clipboard
Collect 70B S2 tokens
Exact spec still WIP, but TODOs are basically:
- Athena query to get Titles & Abstracts from S2AG. Form JSON blob per document of the form:
{"text": "...", "paper_id": <identifier>}
- To concatenate,
" ".join([title, abstract])should be sufficient. - Double-check whether structured abstracts preserve whitespace.
- Athena query to get S2ORC-OA papers. Form JSON blob per document of the form:
{"text": "...", "paper_id": <identifier>}
- To concatenate, also whitespace join to linearize structured content.
- Keep everything for now, including tables, bibliographies, etc.
- Build a blocklist of papers. For now, should just be a single file mapping
paper_idsto some note or reason for removal. To start, this should be documents that are part of the test set for Catwalk evaluation, especially Pubmed/arXiv abstract generation.
Quick estimate for open-access papers in S2orc:
Titles + Abstracts = 14.4B characters Body Text = 468B characters
Collected a first version of the corpus. Steps I followed are here, but a summary is as follows:
Data info:
- Corpus is located at
s3://ai2-s2-research-public/lucas/s2orc_oa_2022_01_03 - It is comprised of 30 gzipped JSONL files.
- Each line is a JSON object with the following fields:
-
id: the corpus ID of the paper in Semantic Scholar. If you want to look up the paper, usehttps://api.semanticscholar.org/CorpusID:<id> -
text: the text of the paper. Sections are separated by double newlines, i.e.\n\n
-
The current set of filters is:
- language is
enas identified by pycld3 - number of whitespace-separated tokens is at least 50
- abstracts below 50 are typically parsing errors.
- number of whitespace-separated tokens is at most 50,000
- past 50k, you typically have large books, vocabulary, number heavy reports, etc. Not worth it.
- the most frequent token matches the regex
^[A-Za-z][a-z]+$- documents that have parsing errors or are number heavy usually have a non alpha token as the most frequent, e.g.
.or\n.
- documents that have parsing errors or are number heavy usually have a non alpha token as the most frequent, e.g.
- for documents that have at least 500 tokens, the most frequent token is at most 7.5% of the total number of tokens.
- estimate for English put frequency of top word in a document at 5-10% of the total number of tokens. splitting differences and going with 7.5%.
- for documents that are less than 500 tokens, the most frequent token is at most 30% of the total number of tokens.
- for shorter documents, frequency estimates from above are not as reliable. going for a more generous 30%.
Final counts:
- Number of whitespace-separated tokens: 72,582,009,602
- Number of documents: 74,772,626