How to preserve original dataset fields when tokenizing with overflow?
Tokenizers supports return_overflowing_tokens=True, which yields multiple token sequences per input string. When used under Dataset.map, this requires dropping the original columns, as documented at https://huggingface.co/docs/datasets/about_map_batch .
This means that the resulting tokens is not easily correlated to the original dataset. The correlation can be reconstructed by position from overflow_to_sample_mapping, but this is hard to do in batches via Dataloader.
How can I tell tokenizer to copy over the original columns from the dataset?
Here's a specific example: The Huggingface metrics for QA requires that each batch have ids of the question. But this id is lost during tokenization, and very hard to reconstruct in a batch (without loading the entire dataset into memory). How can I do this?