How to preserve original dataset fields when tokenizing with overflow?

Open srobertjames opened this issue 3 years ago • 0 comments

Tokenizers supports return_overflowing_tokens=True, which yields multiple token sequences per input string. When used under Dataset.map, this requires dropping the original columns, as documented at https://huggingface.co/docs/datasets/about_map_batch .

This means that the resulting tokens is not easily correlated to the original dataset. The correlation can be reconstructed by position from overflow_to_sample_mapping, but this is hard to do in batches via Dataloader.

How can I tell tokenizer to copy over the original columns from the dataset?

Here's a specific example: The Huggingface metrics for QA requires that each batch have ids of the question. But this id is lost during tokenization, and very hard to reconstruct in a batch (without loading the entire dataset into memory). How can I do this?

Jul 18 '22 21:07 srobertjames