medmcqa icon indicating copy to clipboard operation
medmcqa copied to clipboard

Dataset Issue: 713 examples have duplicate choices

Open bviggiano opened this issue 2 years ago • 2 comments

Hello! We are attempting to implement a benchmark evaluation utilizing medmcqa, and we believe we may have discovered an issue with the underlying dataset that we wanted to bring to your attention: many multiple-choice questions have duplicate choices.

We found 692 examples from the training set and 21 examples from the test set that have this issue.

Here are 10 ids of offending examples from the training set:

- 476a3ecd-7c42-4c85-9982-1ce80c95ab82
- 9f553c15-928f-41f8-8e94-021521702b9b
- 6d893f23-4404-4711-97df-e266c407ecdc
- 1154e512-eec5-4eae-b944-3de530532c4e
- deb53386-ca4b-48e0-b6de-489537df647b
- 5ba3d7de-9e3f-42cf-9ba8-7330fd1c1701
- 6c110742-768c-4dbd-8d12-7fa08d8d7d9c
- 67fddd3c-c80c-46e1-b28e-90b27214be8d
- 0be9175d-db8c-4da5-840c-38cb1060028d
- c5b72144-dbe1-47a6-8312-cdb42994bb01

In some cases, the correct answer is duplicated.

bviggiano avatar Aug 24 '23 18:08 bviggiano

Here is example 476a3ecd-7c42-4c85-9982-1ce80c95ab82

{'answer': '(B)',
 'choices': ['Mode - Mean/ SD',
             'Mean - Mode/ SD',
             'SD/Mode - mean',
             'Mean - Mode/ SD'],
 'context': "Ans. is b' i.e., Mean-Mode Measures of Skewness o There are "
            "following measures of skewness 1. Karl pearson's measure The "
            'formula for measuring skewness is divided into a) absolute '
            'measure Skewness = Mean - Mode b) relative measure The relative '
            'measure is known as the Coefficient of Skewness and is more '
            'frequently used than the absolute measure of skewness. Fuher, '
            'when a comparison between two or more distributions is involved, '
            'it is the relative measure of Skewness which is used.',
 'document_id': '476a3ecd-7c42-4c85-9982-1ce80c95ab82',
 'id': '476a3ecd-7c42-4c85-9982-1ce80c95ab82',
 'question': 'Pearsonian measure of skewness -',
 'question_id': '476a3ecd-7c42-4c85-9982-1ce80c95ab82',
 'type': 'single'}

'Mean - Mode/ SD' is repeated in the second answer choice spot (corresponding to choice B) AND the fourth answer choice spot (corresponding to choice D). The correct answer is indicated as B.

bviggiano avatar Aug 30 '23 21:08 bviggiano

Facing the same problem here. Also wondering how you are handling this issue #4

abhinand5 avatar Sep 10 '23 17:09 abhinand5