CoNLL 2003 dataset not including German
Hello, thanks for all the work on developing and maintaining this amazing platform, which I am enjoying working with!
I was wondering if there is a reason why the German CoNLL 2003 dataset is not included in the repository, since a copy of it could be found in some places on the internet such as GitHub? I could help adding the German data to the hub, unless there are some copyright issues that I am unaware of...
This is considering that many work use the union of CoNLL 2002 and 2003 datasets for comparing cross-lingual NER transfer performance in en, de, es, and nl. E.g., XLM-R.
Adding a Dataset
- Name: CoNLL 2003 German
- Paper: https://www.aclweb.org/anthology/W03-0419/
- Data: https://github.com/huggingface/datasets/tree/master/datasets/conll2003
Hello. I've been looking for information about German Conll2003 and found your question. Official site (https://www.clips.uantwerpen.be/conll2003/ner/) mentions that organizers provide only annotation. German texts (ECI Multilingual Text Corpus) are not freely available and can be ordered from the Linguistic Data Consortium.
But maybe something has changed since 2003.
You can find the reason for not including the German data here: https://github.com/huggingface/datasets/issues/4230.