Namuwikitext

Wikitext format Korean corpus

나무위키의 덤프 데이터를 바탕을 제작한 wikitext 형식의 텍스트 파일입니다. 학습 및 평가를 위하여 위키페이지 별로 train (99%), dev (0.5%), test (0.5%) 로 나뉘어져있습니다.

Corpus size

To fetch data, run below script. Then three corpus, train / dev / test files are downloaded at ./data/

python fetch.py

This corpus is licensed with CC BY-NC-SA 2.0 KR which Namuwiki is licensed. For detail, visit https://creativecommons.org/licenses/by-nc-sa/2.0/kr/

Fetch and load using Korpora

Korpora is Korean Corpora Archives, implemented based on Python. We provide the fetch / load function at Korpora

이 코퍼스는 Korpora 프로젝트에서 사용할 수 있습니다.

from Korpora import Korpora

namuwikitext = Korpora.load('namuwikitext')

# or
Korpora.fetch('namuwikitext')