bc5cdr drug/chemical counts dont match paper
When I count the number of chemical and disease entities in the different splits of bc5cdr, I get different numbers than what is reported in the paper and in BLURB
from BLURB
Dataset | Task| Train| Dev | Test | Evaluation Metrics
BC5-chem | NER | 5203 | 5347 | 5385 | F1 entity-level
BC5-disease | NER | 4182 | 4244 | 4424 | F1 entity-level
vs from bigbio dataset
def get_entity_type_counts(ds):
counter = Counter()
for sample in ds:
for entity in sample['entities']:
counter[entity['type']] += 1
return counter
ds = load_dataset('path/to/bc5cdr', name='bc5cdr_bigbio_kb')
for split in ['train', 'validation', 'test']:
print(get_entity_type_counts(ds[split]))
# -- train: Counter({'Chemical': 5207, 'Disease': 4363})
# -- val: Counter({'Disease': 4421, 'Chemical': 5352})
# -- test: Counter({'Chemical': 5394, 'Disease': 4534})
some of these are approximately 5% off in terms of counts
@galtay Interesting. I think there is a mismatch between how entities are represented in BioC.xml files and PubTator files -- there might be some double counting on mentions going in in BioC. Our BigBio implementation uses BioC but in prior work I've used the PubTator files. You can see in this Trove tutorial notebook that the chemical entity counts are dumped in this notebook and match BLURB
Tagged Entities: 5203
Tagged Entities: 5347
Tagged Entities: 5385
We should run down the difference and then commit to a version to load.
@jason-fries thanks for taking a look and the link! This seems like a nice well defined task we can ask a volunteer to handle yea? Just wondering if we should promote it in slack or if we can hold off for a while?
@galtay Yup! Let's add this as the paper TODOs and ping the channel for a volunteer.