datatrove icon indicating copy to clipboard operation
datatrove copied to clipboard

Fastwarc reader

Open jordane95 opened this issue 1 year ago • 1 comments

Can we add a new warc reader using the fastwarc?

It is said to be much more efficient than warcio

jordane95 avatar May 13 '24 13:05 jordane95

Including fastwarc would be nice. However, in the current text extraction pipeline for fineweb, the warc reader is not a bottleneck (<5% of runtime on my machine, while trafilatura is 95% of runtime). Of course, this might differ for other datasets.

maxidl avatar Jun 13 '24 10:06 maxidl