Closes #42
Finished data loader for source schema only, because the Bigbio KB schema does not currently support all features that exist in the source data - per conversation with @jason-fries
- Name: RadGraph
- Description: This dataset is derived from radiology reports and is designed for named entity recognition and relatation extraction.
- Paper: https://doi.org/10.13026/hm87-5p47
- Data: https://physionet.org/content/radgraph/1.0.0/
Checkbox
- [x] Confirm that this PR is linked to the dataset issue.
- [x] Create the dataloader script
biodatasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming). - [x] Provide values for the
_CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables. - [x] Implement
_info(),_split_generators()and_generate_examples()in dataloader script. - [x] Make sure that the
BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema. - [x] Confirm dataloader script works with
datasets.load_datasetfunction. - [ ] Confirm that your dataloader script passes the test suite run with
python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py. - [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.
@jason-fries @ruisi-su this is ready for review. Per guidance from Jason, this version includes only functionality to generate examples for the source schema, as it is not possible to represent RadGraph records properly using the KB schema as is. Thus, it will not pass tests.
@alisoncallahan is this a local dataset? Can you give us a print out of the following command?
from datasets import load_dataset
x = load_dataset("biodatasets/radgraph/radgraph.py")
print(x["train"]["entities"][-1])
print(x["train"]["relations"][-1])
@hakunanatasha yes, it is a local dataset b/c RadGraph is provided by PhysioNet, which requires user registration and vetting.
In the source schema, relations are nested in entities. The output of print(x["train"]["entities"][-1]) is:
[{'entity_id': '1', 'tokens': 'lungs', 'label': 'ANAT-DP', 'start_ix': 24, 'end_ix': 24, 'labeler': '', 'relations': []}, {'entity_id': '2', 'tokens': 'clear', 'label': 'OBS-DP', 'start_ix': 26, 'end_ix': 26, 'labeler': '', 'relations': [{'relation_id': '9667', 'type': 'located_at', 'arg': '1'}]}, {'entity_id': '3', 'tokens': 'Cardiomediastinal', 'label': 'ANAT-DP', 'start_ix': 28, 'end_ix': 28, 'labeler': '', 'relations': []}, {'entity_id': '4', 'tokens': 'hilar', 'label': 'ANAT-DP', 'start_ix': 30, 'end_ix': 30, 'labeler': '', 'relations': []}, {'entity_id': '5', 'tokens': 'contours', 'label': 'ANAT-DP', 'start_ix': 31, 'end_ix': 31, 'labeler': '', 'relations': [{'relation_id': '9668', 'type': 'modify', 'arg': '3'}, {'relation_id': '9669', 'type': 'modify', 'arg': '4'}]}, {'entity_id': '6', 'tokens': 'normal', 'label': 'OBS-DP', 'start_ix': 33, 'end_ix': 33, 'labeler': '', 'relations': [{'relation_id': '9670', 'type': 'located_at', 'arg': '3'}, {'relation_id': '9671', 'type': 'located_at', 'arg': '4'}]}, {'entity_id': '7', 'tokens': 'pleural', 'label': 'ANAT-DP', 'start_ix': 38, 'end_ix': 38, 'labeler': '', 'relations': []}, {'entity_id': '8', 'tokens': 'effusions', 'label': 'OBS-DA', 'start_ix': 39, 'end_ix': 39, 'labeler': '', 'relations': [{'relation_id': '9672', 'type': 'located_at', 'arg': '7'}]}, {'entity_id': '9', 'tokens': 'pneumothorax', 'label': 'OBS-DA', 'start_ix': 41, 'end_ix': 41, 'labeler': '', 'relations': []}, {'entity_id': '10', 'tokens': 'acute', 'label': 'OBS-DA', 'start_ix': 46, 'end_ix': 46, 'labeler': '', 'relations': [{'relation_id': '9673', 'type': 'modify', 'arg': '12'}]}, {'entity_id': '11', 'tokens': 'cardiopulmonary', 'label': 'ANAT-DP', 'start_ix': 47, 'end_ix': 47, 'labeler': '', 'relations': []}, {'entity_id': '12', 'tokens': 'process', 'label': 'OBS-DA', 'start_ix': 48, 'end_ix': 48, 'labeler': '', 'relations': [{'relation_id': '9674', 'type': 'located_at', 'arg': '11'}]}]