biomedical Create dataset loader for PubMedQA

From https://github.com/pubmedqa/pubmedqa

Jan 21 '22 22:01 hakunanatasha

Found here https://huggingface.co/datasets/pubmed_qa, however without official splits.

Feb 22 '22 16:02 nomisto

#self-assign

Mar 28 '22 03:03 SamuelCahyawijaya

@nomisto good catch - i think we'll implement with official splits. There is a small amount of datasets currently overlapping with the original library.

@jason-fries @galtay @leonweber thoughts?

Apr 06 '22 16:04 hakunanatasha

@SamuelCahyawijaya can you let us know if you still intend to work on this? We'd like to update our project board. Please let us know by Friday, April 8, so we can plan accordingly. You can ping me in a comment via @hakunanatasha or on Discord with @admins

Apr 06 '22 16:04 hakunanatasha

@hakunanatasha : Yes, actually I am working on this one right now and I find a problem with the PQA-L(abelled) as the official split on the github link above is actually a 10-fold CV for the training & dev set. Should I use only a single split (combining both train & dev) or should I provide different splits for each fold?

Apr 06 '22 16:04 SamuelCahyawijaya

Hi @SamuelCahyawijaya

For multiple splits, I see the default approach

(1) Create a source/bigbio for the combined splits (so 1 train/dev set, I think)

(2) Create source/bigbio for each split individually

Bioasq may be a useful example https://github.com/bigscience-workshop/biomedical/blob/0e35df219519fea9b14c58d26b6e26c81415160f/examples/bioasq.py#L494

Apr 08 '22 16:04 hakunanatasha

Also - I noticed you have a PR open for this, would you mind updating with the splits? I'll change the reqs at some point too.

Apr 08 '22 16:04 hakunanatasha

@hakunanatasha : I see, noted, let me add the 10-fold split then 👍🏻

Apr 08 '22 16:04 SamuelCahyawijaya

@SamuelCahyawijaya @galtay @hakunanatasha unit tests failing on bigbio_schema; please advise whether this is an unexpected behavior Unit test output is pasted below

INFO:__main__:args: Namespace(dataloader_path='biodatasets/pubmed_qa/pubmed_qa.py', data_dir=None, config_name=None)
INFO:__main__:all_config_names: ['pubmed_qa_artificial_source', 'pubmed_qa_unlabeled_source', 'pubmed_qa_artificial_bigbio_qa', 'pubmed_qa_unlabeled_bigbio_qa', 'pubmed_qa_labeled_fold0_source', 'pubmed_qa_labeled_fold1_source', 'pubmed_qa_labeled_fold2_source', 'pubmed_qa_labeled_fold3_source', 'pubmed_qa_labeled_fold4_source', 'pubmed_qa_labeled_fold5_source', 'pubmed_qa_labeled_fold6_source', 'pubmed_qa_labeled_fold7_source', 'pubmed_qa_labeled_fold8_source', 'pubmed_qa_labeled_fold9_source', 'pubmed_qa_labeled_fold0_bigbio_qa', 'pubmed_qa_labeled_fold1_bigbio_qa', 'pubmed_qa_labeled_fold2_bigbio_qa', 'pubmed_qa_labeled_fold3_bigbio_qa', 'pubmed_qa_labeled_fold4_bigbio_qa', 'pubmed_qa_labeled_fold5_bigbio_qa', 'pubmed_qa_labeled_fold6_bigbio_qa', 'pubmed_qa_labeled_fold7_bigbio_qa', 'pubmed_qa_labeled_fold8_bigbio_qa', 'pubmed_qa_labeled_fold9_bigbio_qa']
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_artificial_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_artificial_source
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_artificial_source to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_artificial_source/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
Downloading: 2.21kB [00:00, 715kB/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/6d6d623774b8015a704724c6ab74b78515b2a3a5376e6caee3ef9525dfb60eee/pqaa_train_set.json'

----------------------------------------------------------------------
Ran 1 test in 3.657s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_unlabeled_source
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_unlabeled_source
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_unlabeled_source to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_unlabeled_source/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
Downloading: 2.21kB [00:00, 204kB/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/c7dce2d93387f8e9dd420b3e238dff21faeddafcf933f4f02b8b619a9d73b242/ori_pqau.json'

----------------------------------------------------------------------
Ran 1 test in 0.532s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_artificial_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_artificial_bigbio_qa
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_artificial_bigbio_qa to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_artificial_bigbio_qa/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/6d6d623774b8015a704724c6ab74b78515b2a3a5376e6caee3ef9525dfb60eee/pqaa_train_set.json'

----------------------------------------------------------------------
Ran 1 test in 0.136s

FAILED (errors=1)
INFO:__main__:self.PATH: biodatasets/pubmed_qa/pubmed_qa.py
INFO:__main__:self.NAME: pubmed_qa_unlabeled_bigbio_qa
INFO:__main__:self.DATA_DIR: None
INFO:__main__:importing module ....
INFO:__main__:imported module <module 'biodatasets.pubmed_qa.pubmed_qa' from '/Users/skang/repo/bigscience/biomedical/biodatasets/pubmed_qa/pubmed_qa.py'>
INFO:__main__:Checking for _SUPPORTED_TASKS ...
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.QUESTION_ANSWERING: 'QA'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'QA'}
INFO:__main__:Checking load_dataset with config name pubmed_qa_unlabeled_bigbio_qa
Downloading and preparing dataset pubmed_qa_dataset/pubmed_qa_unlabeled_bigbio_qa to /Users/skang/.cache/huggingface/datasets/pubmed_qa_dataset/pubmed_qa_unlabeled_bigbio_qa/1.0.0/22a90484922661fc556054d92c28a4474b5a3fa05de00581ae0a53a569b7e972...
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/skang/repo/bigscience/biomedical/tests/test_bigbio.py", line 126, in runTest
    self.dataset = datasets.load_dataset(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/load.py", line 1702, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 594, in download_and_prepare
    self._download_and_prepare(
  File "/Users/skang/repo/bigscience/biomedical/bigbioenv/lib/python3.9/site-packages/datasets/builder.py", line 685, in _download_and_prepare
    raise OSError(
OSError: Cannot find data file.
Original error:
[Errno 20] Not a directory: '/Users/skang/.cache/huggingface/datasets/downloads/c7dce2d93387f8e9dd420b3e238dff21faeddafcf933f4f02b8b619a9d73b242/ori_pqau.json'

----------------------------------------------------------------------
Ran 1 test in 0.140s

FAILED (errors=1)

May 01 '22 14:05 sunnnymskang

Hi @sunnnymskang, just wondering whether it is caused by the datasets package version problem. Before we have a similar issue for pqaa and pqau data split which we have discussed here.

We need to upgrade the dependency to datasets>=2.0.0 since there is a datasets package bug with the google drive link download as mentioned here. Could you confirm that you tested using datasets>=2.0.0?

If the problem remains even with the correct datasets version, I can investigate further on this issue this Wednesday.

May 02 '22 06:05 SamuelCahyawijaya