datatrove
datatrove copied to clipboard
Fastwarc reader
Including fastwarc would be nice. However, in the current text extraction pipeline for fineweb, the warc reader is not a bottleneck (<5% of runtime on my machine, while trafilatura is 95% of runtime). Of course, this might differ for other datasets.