Original Documents Not Found
Hello, I love your dataset! I was wondering if you had a version of the dataset with the original documents. I see there are document keys, and in the README for the dataset it says there should be original source documents in some directory, but I do not think they are there on the latest version. I would love to be able to use the scraped documents rather than trying to do the scraping again using the source links. Any help would be great.
Thanks!
The data should be downloadable from here: https://nlp.jhu.edu/rams/, let me know if the download link doesn't work, or if you were looking for something else
Thanks for the quick response! When I download the files I do not see an "individual_files" directory. I get the jsonlines data with the 5 sentence windows, but not the entire original documents. Am I missing something?
Sorry, I don't think I have the full documents readily accessible anymore either --- I could dig around a bit more, but not until next week. As described in the paper, the data that we scraped went through several steps of processing/filtering/annotation to result in the RAMS dataset.
The "individual files" directory contained a file per example but it still only contained a 5-sentence window. This means that it is equivalent to the train/dev/test.jsonlines, and each file looked like the example.json on our website (it was a bit more human-readable). Since this data was identical to the jsonlines files, we removed it to make the download smaller.