sungjun lee

Results 8 issues of sungjun lee

Hello, I'm currently working on text processing that involves filtering (like gopher) in various languages. But now, the default word_tokenization in datatrove filters is based on English, as shown in...

I fixed typo. Are you seriously using this prompt with a typo to generate cosmopedia? It probably didn’t greatly affect the generation of synthetic data, but it could increase the...

Added the expand_metadata option to JsonlWriter, available in HuggingfaceWriter and ParquetWriter. This enables consistent metadata handling across different writer types.

Support for zstd compression in both JSONL and Parquet file formats. Parquet Files: - The implementation applies compression directly within the internal write function (pq.ParquetWriter) using the compression option. -...

I’ve been using HuggingFaceDatasetWriter and noticed that it seems to default to uploading to the hub when I intended to save locally only. Could we consider adding a parameter to...

Added shuffle option on huggingface reader and also test code for shuffle. Before merge this commit, plz check the fixed seed value and also buffer size

I've been using datatrove to read .jsonl files and count tokens with token_counter in a local node. I'm encountering an issue where the process was killed due to memory overflow...

I'm working on collecting Korean-language fineweb (kofineweb) and currently working `v0.4.4` While processing a new snapshot, I noticed an issue that has been recurring. When running tasks in `local_executor`, some...