biomedical bc5cdr drug/chemical counts dont match paper

When I count the number of chemical and disease entities in the different splits of bc5cdr, I get different numbers than what is reported in the paper and in BLURB

from BLURB

Dataset     | Task| Train| Dev  | Test | Evaluation Metrics

BC5-chem    | NER | 5203 | 5347 | 5385 | F1 entity-level
BC5-disease | NER | 4182 | 4244 | 4424 | F1 entity-level

vs from bigbio dataset

def get_entity_type_counts(ds):                                                                      
    counter = Counter()                                                                              
    for sample in ds:                                                                                
        for entity in sample['entities']:                                                            
            counter[entity['type']] += 1                                                             
    return counter  

ds = load_dataset('path/to/bc5cdr', name='bc5cdr_bigbio_kb')
for split in ['train', 'validation', 'test']:
    print(get_entity_type_counts(ds[split]))

#   -- train: Counter({'Chemical': 5207, 'Disease': 4363})                                           
#   -- val: Counter({'Disease': 4421, 'Chemical': 5352})                                             
#   -- test: Counter({'Chemical': 5394, 'Disease': 4534})

some of these are approximately 5% off in terms of counts

May 20 '22 03:05 galtay

@galtay Interesting. I think there is a mismatch between how entities are represented in BioC.xml files and PubTator files -- there might be some double counting on mentions going in in BioC. Our BigBio implementation uses BioC but in prior work I've used the PubTator files. You can see in this Trove tutorial notebook that the chemical entity counts are dumped in this notebook and match BLURB

Tagged Entities: 5203
Tagged Entities: 5347
Tagged Entities: 5385

We should run down the difference and then commit to a version to load.

May 21 '22 01:05 jason-fries

@jason-fries thanks for taking a look and the link! This seems like a nice well defined task we can ask a volunteer to handle yea? Just wondering if we should promote it in slack or if we can hold off for a while?

May 23 '22 03:05 galtay

@galtay Yup! Let's add this as the paper TODOs and ping the channel for a volunteer.

May 23 '22 03:05 jason-fries