psytar schema is not implemented correctly

Open galtay opened this issue 3 years ago • 0 comments

In [3]: dsd = load_dataset('bigbio/biodatasets/psytar/psytar.py', name='psytar_bigbio_text', data_dir='/home/galtay/data/
   ...: bigbio/psytar/PsyTAR_dataset.xlsx')
Using custom data configuration psytar_bigbio_text-7247dd615c830efa
Reusing dataset psy_tar_dataset (/home/galtay/.cache/huggingface/datasets/psy_tar_dataset/psytar_bigbio_text-7247dd615c830efa/1.0.0/149b2465b2445f8a388bc2f7af48f0d136d246f718f59743564f154ea3c2dfbf)
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1193.94it/s]

In [4]: dsd['train']
Out[4]: 
Dataset({
    features: ['id', 'document_id', 'text', 'labels'],
    num_rows: 6003
})

In [5]: dsd['train'][0]
Out[5]: 
{'id': '0',
 'document_id': 'lexapro.1_1',
 'text': "['ADR']",         
 'labels': ['s', 's', 'r', 'i']}

text should not be a stringified list and labels should not be a list of single letters.

Jun 04 '22 03:06 galtay