data_tooling icon indicating copy to clipboard operation
data_tooling copied to clipboard

Create dataset multilingual_knowledge_questions_answers

Open albertvillanova opened this issue 4 years ago • 5 comments

  • uid: multilingual_knowledge_questions_answers
  • type: processed
  • description:
    • name: Multilingual Knowledge Questions & Answers
    • description: MKQA is an open-domain question answering evaluation set comprising question-answer pairs aligned across typologically diverse languages.
    • homepage: https://github.com/apple/ml-mkqa
    • validated: True
  • languages:
    • language_names:
      • Arabic
      • English
      • Spanish
      • French
      • Vietnamese
      • Chinese
      • Portuguese
      • Danish
      • German
      • Finnish
      • Hebrew
      • Hungarian
      • Italian
      • Japanese
      • Korean
      • Norwegian
      • Polish
      • Russian
      • Swedish
      • Thai
      • Turkish
    • language_comments:
    • language_locations:
    • validated: False
  • custodian:
    • name: Shayne Longpre
    • in_catalogue:
    • type: A private individual
    • location:
    • contact_name:
    • contact_email: [email protected]
    • contact_submitter: True
    • additional: https://www.shaynelongpre.com/
    • validated: False
  • availability:
    • procurement:
      • for_download: Yes - it has a direct download link or links
      • download_url: https://github.com/apple/ml-mkqa/raw/master/dataset/mkqa.jsonl.gz
      • download_email:
    • licensing:
      • has_licenses: Yes
      • license_text: https://github.com/apple/ml-mkqa/blob/master/LICENSE
      • license_properties:
        • open license
      • license_list:
        • cc-by-sa-3.0: Creative Commons Attribution Share Alike 3.0 Unported
    • pii:
      • has_pii: Unclear
      • generic_pii_likely:
      • generic_pii_list:
      • numeric_pii_likely:
      • numeric_pii_list:
      • sensitive_pii_likely:
      • sensitive_pii_list:
      • no_pii_justification_class: other
      • no_pii_justification_text: In the paper (https://arxiv.org/pdf/1911.02116.pdf), the author states that it follows Wenzek et al. (https://arxiv.org/abs/1911.00359) to build a clean CommonCrawl Corpus. After further investigation in Wenzek et al., descriptions regarding PII is not found.
    • validated: False
  • processed_from_primary:
    • from_primary: Taken from primary source
    • primary_availability: Yes - they are fully available
    • primary_license: Unclear / I don't know
    • primary_types:
      • web | other
    • validated: False
    • from_primary_entries:
  • media:
    • category:
      • text
    • text_format:
      • .TXT
    • audiovisual_format:
    • image_format:
    • database_format:
      • .JSON
    • text_is_transcribed: No
    • instance_type: An original English query, and then queries and answers in 26 languages.
    • instance_count: 10K<n<100K
    • instance_size: 10<n<100
    • validated: False
  • fname: multilingual_knowledge_questions_answers.json

albertvillanova avatar Nov 23 '21 11:11 albertvillanova

Already available: https://huggingface.co/datasets/mkqa

albertvillanova avatar Jan 04 '22 16:01 albertvillanova

#self-assign

mariosasko avatar Jan 27 '22 13:01 mariosasko

Done! LM repos:

  • ar: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_multilingual_knowledge_questions_answers
  • da: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_da_multilingual_knowledge_questions_answers
  • de: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_de_multilingual_knowledge_questions_answers
  • en: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_multilingual_knowledge_questions_answers
  • es: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_multilingual_knowledge_questions_answers
  • fi: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fi_multilingual_knowledge_questions_answers
  • fr: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_multilingual_knowledge_questions_answers
  • he: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_he_multilingual_knowledge_questions_answers
  • hu: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_hu_multilingual_knowledge_questions_answers
  • it: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_it_multilingual_knowledge_questions_answers
  • ja: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ja_multilingual_knowledge_questions_answers
  • ko: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ko_multilingual_knowledge_questions_answers
  • km: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_km_multilingual_knowledge_questions_answers
  • ms: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ms_multilingual_knowledge_questions_answers
  • nl: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_nl_multilingual_knowledge_questions_answers
  • no: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_no_multilingual_knowledge_questions_answers
  • pl: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pl_multilingual_knowledge_questions_answers
  • pt: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_multilingual_knowledge_questions_answers
  • ru: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ru_multilingual_knowledge_questions_answers
  • sv: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_sv_multilingual_knowledge_questions_answers
  • th: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_th_multilingual_knowledge_questions_answers
  • tr: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_tr_multilingual_knowledge_questions_answers
  • vi: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_vi_multilingual_knowledge_questions_answers
  • zh-CN: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh-CN_multilingual_knowledge_questions_answers
  • zh-HK: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh-HK_multilingual_knowledge_questions_answers
  • zh-TW: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh-TW_multilingual_knowledge_questions_answer

mariosasko avatar Jan 27 '22 20:01 mariosasko

Thanks @mariosasko.

Here we have more languages that the ones targeted by the BigScience workshop. Should we remove or keep the additional languages?

CC: @yjernite

albertvillanova avatar Jan 31 '22 08:01 albertvillanova

@mariosasko this repo is empty: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_ar_multilingual_knowledge_questions_answers

albertvillanova avatar Feb 01 '22 10:02 albertvillanova