biomedical Multiple kb entities are created when one would do b/c entity type is not a list

Describe the bug

Medmentions can have multiple entries with the same span/text/type, yet different normalisation terms

Steps to reproduce the bug

def conflicting_normalisation(row):
    span_to_entity = {}
    for entity in row["entities"]:
        dedup_key = tuple(tuple(span) for span in entity["offsets"])
        if (dedup_key) not in span_to_entity:
            span_to_entity[dedup_key] = entity
        else:
            primary_entity = span_to_entity[dedup_key]
            assert primary_entity["text"] == entity["text"]
            if primary_entity["normalized"] != entity["normalized"]:
                print(f"doc_id: {row['document_id']}, span: {dedup_key}, "
                      f"norm1: {span_to_entity[dedup_key]['normalized']}, "
                      f"norm2: {entity['normalized']}")


medmentions = [x for x in BigBioConfigHelpers() if x.config.name == "medmentions_full_bigbio_kb"][0].load_dataset()
for row in medmentions["train"]:
    conflicting_normalisation(row)

Expected results

in cases where span/text/type are the same, multiple normalisation annotations should be grouped under a single entity

Actual results

doc_id: 27708774, span: ((2036, 2052),), norm1: [{'db_name': 'UMLS', 'db_id': 'C1135971'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C0600225'}] doc_id: 27778115, span: ((1085, 1088),), norm1: [{'db_name': 'UMLS', 'db_id': 'C0679083'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C4284008'}] doc_id: 27801847, span: ((496, 500),), norm1: [{'db_name': 'UMLS', 'db_id': 'C0034821'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C1366529'}] doc_id: 27979025, span: ((100, 104),), norm1: [{'db_name': 'UMLS', 'db_id': 'C0016452'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C0016470'}] doc_id: 28464073, span: ((2110, 2112),), norm1: [{'db_name': 'UMLS', 'db_id': 'C0282507'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C0282498'}] doc_id: 28526565, span: ((1372, 1377),), norm1: [{'db_name': 'UMLS', 'db_id': 'C1456627'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C0087111'}] doc_id: 28548949, span: ((1809, 1812),), norm1: [{'db_name': 'UMLS', 'db_id': 'C1337030'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C1506534'}] doc_id: 28548949, span: ((1809, 1812),), norm1: [{'db_name': 'UMLS', 'db_id': 'C1337030'}], norm2: [{'db_name': 'UMLS', 'db_id': 'C1506534'}]

the original annotation for two of the results are provided 27708774 2036 2052 antral follicles T023 C1135971 27708774 2036 2052 antral follicles T023 C0600225

28548949 1809 1812 XPA T028 C1337030 28548949 1809 1812 XPA T116,T123 C1506534

medmentions includes separate rows where multiple normalisation ids are found. I assume the schema is intended to be populated by grouping these together. tmVar v3 for examples here creates additional normalisation dicts under the same entity.

The second original annotation example given is more complicated. Under the current guidelines this is expected behaviour. imo this is one of the downsides of the decision to duplicate rows where multiple types are found, preventing the ability to assert that no two entries have the same span (which would pick up the first example as being unintended). fwiw I discovered this behaviour while implementing a util to merge the duplicate row/different type entities

Apologies if this is intended behaviour. Maybe some clarification on the schema page would help

May 31 '22 16:05 CallumMcMahon

thanks for this @CallumMcMahon !

good writeup of the problem and indeed, agree that we should have made type a list in the kb schema. the decision to make multiple entities when there were multiple types was a hack that was easier than going back and changing most other implemented datasets to create single element lists for type. in the future we will revisit the kb schema and make a few changes. i've created a new issue tag for these schema improvements. for now we'll reserve the bug keyword for code that is not behaving as we intended (here the code is behaving as intended, but we can improve our intentions)

for now, is it clear to you how you would merge the existing entities to do the offset assertion you want to do?

May 31 '22 22:05 galtay

Ahh nice! happy to hear there will eventually be an amendment to make type a list. I guess my comment was pretty minor

original annotation 27708774 2036 2052 antral follicles T023 C1135971 27708774 2036 2052 antral follicles T023 C0600225

current {'id': '83874', 'type': 'T023', 'text': ['antral follicles'], 'offsets': [[2036, 2052]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C1135971'}]} {'id': '83875', 'type': 'T023', 'text': ['antral follicles'], 'offsets': [[2036, 2052]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C0600225'}]}

expected {'id': '83874', 'type': 'T023', 'text': ['antral follicles'], 'offsets': [[2036, 2052]], 'normalized': [{'db_name': 'UMLS', 'db_id': 'C1135971'}, {'db_name': 'UMLS', 'db_id': 'C0600225'}]} (because only one type entry, row not expected to be duplicated)

but because of the extra logic needed to not confuse these cases with the intentional ones for multiple types, it might make more sense to just leave it until that change happens.

Either way it's clear how I can do the merge, thanks!

Jun 01 '22 09:06 CallumMcMahon