datasets icon indicating copy to clipboard operation
datasets copied to clipboard

[data request] brWaC

Open marcospiau opened this issue 4 years ago • 2 comments

  • Name of dataset: brWaC (Brazilian Portuguese Web as Corpus)
  • URL of dataset: https://www.inf.ufrgs.br/pln/wiki/index.php?title=BrWaC
  • License of dataset: not specified
  • Short description of dataset and use case(s): brWac corpus composed by crawling of webpages in Portuguese language and is used by many works in Portuguese language as input data for semi-supervised training of language models. Having this dataset in TFDS would make future work easier; this dataset is already in HugginFace Hub.

I am already working on this dataset and would like to have this issue assigned to me.

marcospiau avatar Dec 22 '21 04:12 marcospiau

Thank you @marcospiau for opening this issue! As requested, we are assigned this to you. Feel free to send a PR for review!

ccl-core avatar Dec 22 '21 13:12 ccl-core

Hi @ccl-core, could please take a look at my pull request? Thanks, Marcos

marcospiau avatar Jun 28 '22 14:06 marcospiau