Fail to process SQuADv1.1 datasets with max_seq_length=128, doc_stride=96.

Open zhuango opened this issue 3 years ago • 0 comments

Describe the bug

datasets fail to process SQuADv1.1 with max_seq_length=128, doc_stride=96 when calling datasets["train"].train_dataset.map().

Steps to reproduce the bug

I used huggingface TF2 question-answering examples. And my scripts are as follows:

python run_qa.py \
  --model_name_or_path $BERT_DIR \
  --dataset_name $SQUAD_DIR \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2 \
  --max_seq_length 128 \
  --doc_stride 96 \
  --output_dir $OUTPUT \
  --save_steps 10000 \
  --overwrite_cache \
  --overwrite_output_dir \

Expected results

Normally process SQuADv1.1 datasets with max_seq_length=128, doc_stride=96.

Actual results

INFO:__main__:Padding all batches to max length because argument was set or we're on TPU.
WARNING:datasets.fingerprint:Parameter 'function'=<function main.<locals>.prepare_train_features at 0x7f15bc2d07a0> of the transform datasets.arrow_dataset.Dataset._map_single couldn't be hashed properly, a random hash was used instead. Make sure your transforms and parameters are serializable with pickle or dill for the dataset fingerprinting and caching to work. If you reuse this transform, the caching mechanism will consider it to be different from the previous calls and recompute everything. This warning is only showed once. Subsequent hashing failures won't be showed.
  0%|                                                                                                                                                               | 0/88 [00:00<?, ?ba/s]thread '<unnamed>' panicked at 'assertion failed: stride < max_len', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/encoding.rs:311:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
  0%|                                                                                                                                                               | 0/88 [00:00<?, ?ba/s]
Traceback (most recent call last):
  File "run_qa.py", line 743, in <module>
    main()
  File "run_qa.py", line 485, in main
    load_from_cache_file=not data_args.overwrite_cache,
  File "/anaconda3/envs/py37/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2394, in map
    desc=desc,
  File "/anaconda3/envs/py37/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 551, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/anaconda3/envs/py37/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 518, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/anaconda3/envs/py37/lib/python3.7/site-packages/datasets/fingerprint.py", line 458, in wrapper
    out = func(self, *args, **kwargs)
  File "anaconda3/envs/py37/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2768, in _map_single
    offset=offset,
  File "anaconda3/envs/py37/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2644, in apply_function_on_filtered_inputs
    processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
  File "anaconda3/envs/py37/lib/python3.7/site-packages/datasets/arrow_dataset.py", line 2336, in decorated
    result = f(decorated_item, *args, **kwargs)
  File "run_qa.py", line 410, in prepare_train_features
    padding=padding,
  File "anaconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2512, in __call__
    **kwargs,
  File "anaconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2703, in batch_encode_plus
    **kwargs,
  File "anaconda3/envs/py37/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 429, in _batch_encode_plus
    is_pretokenized=is_split_into_words,
pyo3_runtime.PanicException: assertion failed: stride < max_len
Traceback (most recent call last):
  File "./data/SQuADv1.1/evaluate-v1.1.py", line 92, in <module>
    with open(args.prediction_file) as prediction_file:
FileNotFoundError: [Errno 2] No such file or directory: './output/bert_base_squadv1.1_tf2/eval_predictions.json'

Environment info

datasets version: 2.3.2
Platform: Ubuntu, pytorch=1.11.0, tensorflow-gpu=2.9.1
Python version: 2.7
PyArrow version: 8.0.0

Jul 29 '22 11:07 zhuango