biobert icon indicating copy to clipboard operation
biobert copied to clipboard

chemprot dataset problem with run_re.py script

Open ghost opened this issue 5 years ago • 3 comments

Hi, I would like to know how to run the run_re.py script for chemprot dataset. CHEMPROT dataset is a multi-class classification dataset, I simply run the run_re.py script with the same format like "gad" task, but error happened with no train.tsv file found.

I attached the chemprot dataset file below. Could you please help me out? ChemProt_Corpus.zip

thanks.

ghost avatar Nov 27 '20 06:11 ghost

The ChemProt dataset format is different with other RE dataset, basically the normal RE dataset like eduar and GAD has three tsv files: train.tsv, test.tsv and dev.tsv.

But for ChemProt training dataset, there are several files: chemprot_training_abstracts.tsv, chemprot_training_entities.tsv, chemprot_training_gold_standard.tsv, chemprot_training_relations.tsv, so how to use run_re.py in BioBert for ChemProt dataset?

ghost avatar Nov 27 '20 08:11 ghost

Thank you for asking that question @wangxinyi-gsafety,

I have tried many different preprocessing of ChemProt, combining the datasets to stick to the format used by GAD dataset and others. The script is functionnal but the results are quite poor and I strongly suspect the preprocessing to be the cause of such poor results (I think it alters to much the nature of the original data).

How did you manage to preprocess the data to conserve enough of the information so the model performs well on it ?

Best regards, Arthur

LedaguenelArthur avatar Feb 28 '21 18:02 LedaguenelArthur

Anybody found a way to run chemprot RE?

arunpatala avatar Mar 23 '22 05:03 arunpatala