biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

Closes #42

Open alisoncallahan opened this issue 3 years ago • 3 comments

Finished data loader for source schema only, because the Bigbio KB schema does not currently support all features that exist in the source data - per conversation with @jason-fries

  • Name: RadGraph
  • Description: This dataset is derived from radiology reports and is designed for named entity recognition and relatation extraction.
  • Paper: https://doi.org/10.13026/hm87-5p47
  • Data: https://physionet.org/content/radgraph/1.0.0/

Checkbox

  • [x] Confirm that this PR is linked to the dataset issue.
  • [x] Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • [x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • [x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • [x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • [x] Confirm dataloader script works with datasets.load_dataset function.
  • [ ] Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • [ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

alisoncallahan avatar Apr 20 '22 16:04 alisoncallahan

@jason-fries @ruisi-su this is ready for review. Per guidance from Jason, this version includes only functionality to generate examples for the source schema, as it is not possible to represent RadGraph records properly using the KB schema as is. Thus, it will not pass tests.

alisoncallahan avatar Apr 27 '22 05:04 alisoncallahan

@alisoncallahan is this a local dataset? Can you give us a print out of the following command?

from datasets import load_dataset
x = load_dataset("biodatasets/radgraph/radgraph.py")
print(x["train"]["entities"][-1])
print(x["train"]["relations"][-1])

hakunanatasha avatar Apr 27 '22 05:04 hakunanatasha

@hakunanatasha yes, it is a local dataset b/c RadGraph is provided by PhysioNet, which requires user registration and vetting.

In the source schema, relations are nested in entities. The output of print(x["train"]["entities"][-1]) is:

[{'entity_id': '1', 'tokens': 'lungs', 'label': 'ANAT-DP', 'start_ix': 24, 'end_ix': 24, 'labeler': '', 'relations': []}, {'entity_id': '2', 'tokens': 'clear', 'label': 'OBS-DP', 'start_ix': 26, 'end_ix': 26, 'labeler': '', 'relations': [{'relation_id': '9667', 'type': 'located_at', 'arg': '1'}]}, {'entity_id': '3', 'tokens': 'Cardiomediastinal', 'label': 'ANAT-DP', 'start_ix': 28, 'end_ix': 28, 'labeler': '', 'relations': []}, {'entity_id': '4', 'tokens': 'hilar', 'label': 'ANAT-DP', 'start_ix': 30, 'end_ix': 30, 'labeler': '', 'relations': []}, {'entity_id': '5', 'tokens': 'contours', 'label': 'ANAT-DP', 'start_ix': 31, 'end_ix': 31, 'labeler': '', 'relations': [{'relation_id': '9668', 'type': 'modify', 'arg': '3'}, {'relation_id': '9669', 'type': 'modify', 'arg': '4'}]}, {'entity_id': '6', 'tokens': 'normal', 'label': 'OBS-DP', 'start_ix': 33, 'end_ix': 33, 'labeler': '', 'relations': [{'relation_id': '9670', 'type': 'located_at', 'arg': '3'}, {'relation_id': '9671', 'type': 'located_at', 'arg': '4'}]}, {'entity_id': '7', 'tokens': 'pleural', 'label': 'ANAT-DP', 'start_ix': 38, 'end_ix': 38, 'labeler': '', 'relations': []}, {'entity_id': '8', 'tokens': 'effusions', 'label': 'OBS-DA', 'start_ix': 39, 'end_ix': 39, 'labeler': '', 'relations': [{'relation_id': '9672', 'type': 'located_at', 'arg': '7'}]}, {'entity_id': '9', 'tokens': 'pneumothorax', 'label': 'OBS-DA', 'start_ix': 41, 'end_ix': 41, 'labeler': '', 'relations': []}, {'entity_id': '10', 'tokens': 'acute', 'label': 'OBS-DA', 'start_ix': 46, 'end_ix': 46, 'labeler': '', 'relations': [{'relation_id': '9673', 'type': 'modify', 'arg': '12'}]}, {'entity_id': '11', 'tokens': 'cardiopulmonary', 'label': 'ANAT-DP', 'start_ix': 47, 'end_ix': 47, 'labeler': '', 'relations': []}, {'entity_id': '12', 'tokens': 'process', 'label': 'OBS-DA', 'start_ix': 48, 'end_ix': 48, 'labeler': '', 'relations': [{'relation_id': '9674', 'type': 'located_at', 'arg': '11'}]}]

alisoncallahan avatar Apr 27 '22 05:04 alisoncallahan