datasets
datasets copied to clipboard
[data request] brWaC
- Name of dataset: brWaC (Brazilian Portuguese Web as Corpus)
- URL of dataset: https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
- License of dataset: not specified
- Short description of dataset and use case(s): brWac corpus composed by crawling of webpages in Portuguese language and is used by many works in Portuguese language as input data for semi-supervised training of language models. Having this dataset in TFDS would make future work easier; this dataset is already in HugginFace Hub.
I am already working on this dataset and would like to have this issue assigned to me.
Thank you @marcospiau for opening this issue! As requested, we are assigned this to you. Feel free to send a PR for review!
Hi @ccl-core, could please take a look at my pull request? Thanks, Marcos