[C-MTEB] How to convert QA dataset to Retrieval & Reranking Dataset
I observed that some datasets such as CmedqaRetrieval, CMedQAv1, CMedQAv2 Built from QA datasets and converted to 'query-pos-neg' format. Do you have 1 instruction for building this data? QA dataset sample:
Instruct:
Output:
Reranking dataset sample:
Query:
Pos:
Neg:
Retrieval dataset sample:
Query:
Context:
Id:
For QA datasets, we use query as query, and use answer/context as pos. We use the candidate (except ground truth) provided by the original dataset as neg.
If there are no candidates for your datasets, you can find some candidates via an embedding model to construct neg.
For QA datasets, we use query as , and use answer/context as . We use the candidate (except ground truth) provided by the original dataset as .
query``pos``negIf there are no candidates for your datasets, you can find some candidates via an embedding model to construct .
neg
Thanks for answering, but I have a question if there is a way for me to filter out complex questions (tricky and subtextual questions whose answers are usually not directly related to the question)
A possible method is utilizing GPT to filter these questions. Using the cosine similarity between questions and answers is more simple, but the threshold is difficult to set.