biomedical JNLPBA implementation issues -- missing passages / entity only implementation

Currently JNLPBA is setup such that every token and tag is a single entity. This is not the correct setup for this task/schema -- we need to create passages with entity spans.

Jun 28 '22 19:06 jason-fries

#self-assign

Jun 28 '22 20:06 shamikbose

@jason-fries Am I reading this wrong or is this dataset loader using itself?

Jun 28 '22 23:06 shamikbose

Looks like it. I would look at the datasets hub implementation and copy over any code that supports reading the source schema directly vs. calling datasets like this.

Jun 29 '22 00:06 jason-fries

@jason-fries Here's an example datapoint from the corpus:

###MEDLINE:95369245

IL-2	B-DNA
gene	I-DNA
expression	O
and	O
NF-kappa	B-protein
B	I-protein
activation	O
through	O
CD28	B-protein
requires	O
reactive	O
oxygen	O
production	O
by	O
5-lipoxygenase	B-protein
.	O

What should be in passages and entities in the kb schema? Looking through some training examples and comparing with Medline, it seems that the ###MEDLINE is the medline id, the first sentence is the title and the rest of the sentences make up the abstract. Should the title and the abstracts be recreated to keep in tune with other PUBMED datasets? As an aside, the datasets implementation of this also seems to be wrong. There are two files in the corpus with the same information, but the datasets implementation reads both of them and inserts them into the dataset, meaning there's repeated information

Jun 29 '22 21:06 shamikbose

this is an example of a bigbio dataset loader that attempted to start with the existing huggingface datasets implementation and then modify it. there was a full discussion in the PR ... let me see if I can track it down ... [EDIT] its here https://github.com/bigscience-workshop/biomedical/pull/589

initially we were attempting to leverage the existing implementation by using it directly ... now I think it would be cleaner (as jason said) to use the fundamentals of the code from the HF datasets implementation but not directly "load with HF datasets and then modify"

Jun 30 '22 16:06 galtay

@galtay Thank you for that link. That's really helpful! I'm going to build it out like the one you outlined in this comment.

Jun 30 '22 21:06 shamikbose

@shamikbose following the guide @galtay outlined will work great. One request -- make certain you are loading the raw JNLPBA annotated data available on the GENIA BioNLP / JNLPBA Share Task website and not wrapping the datasets dataloader.

Jun 30 '22 22:06 jason-fries

Yeah, I’m reusing the code used in the datasets dataloader to download the raw data from the wesbite

On Thu, Jun 30, 2022 at 6:41 PM Jason Alan Fries @.***> wrote:

@shamikbose https://github.com/shamikbose following the guide @galtay https://github.com/galtay outlined will work great. One request -- make certain you are loading the raw JNLPBA annotated data available on the GENIA BioNLP / JNLPBA Share Task http://www.geniaproject.org/shared-tasks/bionlp-jnlpba-shared-task-2004 website and not wrapping the datasets dataloader.

— Reply to this email directly, view it on GitHub https://github.com/bigscience-workshop/biomedical/issues/714#issuecomment-1171744239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMD3OJMTXZTSWFY6A26PTGTVRYPB3ANCNFSM52DJ7LEQ . You are receiving this because you were mentioned.Message ID: @.***>

--

-Regards, Shamik Bose

Jun 30 '22 23:06 shamikbose