data_tooling
data_tooling copied to clipboard
Create dataset multilingual_knowledge_questions_answers
- uid: multilingual_knowledge_questions_answers
- type: processed
- description:
- name: Multilingual Knowledge Questions & Answers
- description: MKQA is an open-domain question answering evaluation set comprising question-answer pairs aligned across typologically diverse languages.
- homepage: https://github.com/apple/ml-mkqa
- validated: True
- languages:
- language_names:
- Arabic
- English
- Spanish
- French
- Vietnamese
- Chinese
- Portuguese
- Danish
- German
- Finnish
- Hebrew
- Hungarian
- Italian
- Japanese
- Korean
- Norwegian
- Polish
- Russian
- Swedish
- Thai
- Turkish
- language_comments:
- language_locations:
- validated: False
- language_names:
- custodian:
- name: Shayne Longpre
- in_catalogue:
- type: A private individual
- location:
- contact_name:
- contact_email: [email protected]
- contact_submitter: True
- additional: https://www.shaynelongpre.com/
- validated: False
- availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: https://github.com/apple/ml-mkqa/raw/master/dataset/mkqa.jsonl.gz
- download_email:
- licensing:
- has_licenses: Yes
- license_text: https://github.com/apple/ml-mkqa/blob/master/LICENSE
- license_properties:
- open license
- license_list:
- cc-by-sa-3.0: Creative Commons Attribution Share Alike 3.0 Unported
- pii:
- has_pii: Unclear
- generic_pii_likely:
- generic_pii_list:
- numeric_pii_likely:
- numeric_pii_list:
- sensitive_pii_likely:
- sensitive_pii_list:
- no_pii_justification_class: other
- no_pii_justification_text: In the paper (https://arxiv.org/pdf/1911.02116.pdf), the author states that it follows Wenzek et al. (https://arxiv.org/abs/1911.00359) to build a clean CommonCrawl Corpus. After further investigation in Wenzek et al., descriptions regarding PII is not found.
- validated: False
- procurement:
- processed_from_primary:
- from_primary: Taken from primary source
- primary_availability: Yes - they are fully available
- primary_license: Unclear / I don't know
- primary_types:
- web | other
- validated: False
- from_primary_entries:
- media:
- category:
- text
- text_format:
- .TXT
- audiovisual_format:
- image_format:
- database_format:
- .JSON
- text_is_transcribed: No
- instance_type: An original English query, and then queries and answers in 26 languages.
- instance_count: 10K<n<100K
- instance_size: 10<n<100
- validated: False
- category:
- fname: multilingual_knowledge_questions_answers.json
Already available: https://huggingface.co/datasets/mkqa
#self-assign
Done! LM repos:
- ar: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_multilingual_knowledge_questions_answers
- da: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_da_multilingual_knowledge_questions_answers
- de: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_de_multilingual_knowledge_questions_answers
- en: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_multilingual_knowledge_questions_answers
- es: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_multilingual_knowledge_questions_answers
- fi: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fi_multilingual_knowledge_questions_answers
- fr: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_multilingual_knowledge_questions_answers
- he: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_he_multilingual_knowledge_questions_answers
- hu: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_hu_multilingual_knowledge_questions_answers
- it: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_it_multilingual_knowledge_questions_answers
- ja: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ja_multilingual_knowledge_questions_answers
- ko: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ko_multilingual_knowledge_questions_answers
- km: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_km_multilingual_knowledge_questions_answers
- ms: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ms_multilingual_knowledge_questions_answers
- nl: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nl_multilingual_knowledge_questions_answers
- no: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_no_multilingual_knowledge_questions_answers
- pl: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pl_multilingual_knowledge_questions_answers
- pt: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_multilingual_knowledge_questions_answers
- ru: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ru_multilingual_knowledge_questions_answers
- sv: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_sv_multilingual_knowledge_questions_answers
- th: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_th_multilingual_knowledge_questions_answers
- tr: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_tr_multilingual_knowledge_questions_answers
- vi: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_multilingual_knowledge_questions_answers
- zh-CN: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh-CN_multilingual_knowledge_questions_answers
- zh-HK: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh-HK_multilingual_knowledge_questions_answers
- zh-TW: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh-TW_multilingual_knowledge_questions_answer
Thanks @mariosasko.
Here we have more languages that the ones targeted by the BigScience workshop. Should we remove or keep the additional languages?
CC: @yjernite
@mariosasko this repo is empty: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_multilingual_knowledge_questions_answers