Receiving a `JSONDecodeError` when running `tevatron.driver.encode` on WQ dataset
I have first used tevatron to train DPR from bert-based-uncased:
python -m torch.distributed.launch --nproc_per_node=1 -m tevatron.driver.train \
--output_dir model_wq \
--dataset_name Tevatron/wikipedia-wq \
--model_name_or_path bert-base-uncased \
--do_train \
--save_steps 20000 \
--fp16 \
--per_device_train_batch_size 128 \
--train_n_passages 2 \
--learning_rate 1e-5 \
--q_max_len 32 \
--p_max_len 156 \
--num_train_epochs 40 \
--negatives_x_device \
--overwrite_output_dir
After the model was saved to model_wq/ (see footnote), I continued to follow the instructions to encode the passages:
export ENCODE_DIR="wq_corpus_encoded"
mkdir $ENCODE_DIR
for s in $(seq -f "%02g" 0 19)
do
python -m tevatron.driver.encode \
--output_dir=temp \
--model_name_or_path model_wq \
--fp16 \
--per_device_eval_batch_size 156 \
--dataset_name Tevatron/wikipedia-wq-corpus \
--encoded_save_path corpus_emb.$s.pkl \
--encode_num_shard 20 \
--encode_shard_index $s
done
I saved that inside a bash file and ran the bash file, but I multipleJSONDecodeError along the way, which does not seem to be expected (which is why I stopped the process):
$ bash encode_wq_corpus.sh
mkdir: cannot create directory ‘wq_corpus_encoded’: File exists
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder - try loading tied weight
07/11/2022 19:29:13 - INFO - tevatron.modeling.encoder - loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4573.94it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 429.92it/s]
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
main()
File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
cache_dir=data_args.data_cache_dir or model_args.cache_dir)
File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
use_auth_token=use_auth_token,
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1210, in _prepare_split
desc=f"Generating {split_info.name} split",
File "/opt/conda/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
for obj in iterable:
File "/tmp/.cache/huggingface/modules/datasets_modules/datasets/Tevatron--wikipedia-wq-corpus/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033/wikipedia-wq-corpus.py", line 82, in _generate_examples
data = json.loads(line)
File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 30 (char 29)
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder - try loading tied weight
07/11/2022 19:31:45 - INFO - tevatron.modeling.encoder - loading model weight from model_wq
Downloading and preparing dataset wikipedia-wq-corpus/default to /tmp/.cache/huggingface/datasets/Tevatron___wikipedia-wq-corpus/default/0.0.1/69d8ab11b0c3a7443dd4f41ec73edeb30ffe1f7a0b56fe2a6b316fb77c2ec033...
Downloading data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 5849.80it/s]
Extracting data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 517.24it/s]
Traceback (most recent call last): ^C
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 111, in <module>
main()
File "/tmp/.local/lib/python3.7/site-packages/tevatron/driver/encode.py", line 70, in main
cache_dir=data_args.data_cache_dir or model_args.cache_dir)
File "/tmp/.local/lib/python3.7/site-packages/tevatron/datasets/dataset.py", line 83, in __init__
data_files=data_files, cache_dir=cache_dir)[data_args.dataset_split]
File "/opt/conda/lib/python3.7/site-packages/datasets/load.py", line 1684, in load_dataset
use_auth_token=use_auth_token,
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 705, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1221, in _download_and_prepare
super()._download_and_prepare(dl_manager, verify_infos, check_duplicate_keys=verify_infos)
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 793, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/opt/conda/lib/python3.7/site-packages/datasets/builder.py", line 1212, in _prepare_split
example = self.info.features.encode_example(record)
File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1579, in encode_example
return encode_nested_example(self, example)
File "/opt/conda/lib/python3.7/site-packages/datasets/features/features.py", line 1136, in encode_nested_example
def encode_nested_example(schema, obj, level=0):
KeyboardInterrupt
Is this normal?
Libraries
This is my requirements file:
git+https://github.com/texttron/tevatron@b8f33900895930f9886012580e85464a5c1f7e9a
torch==1.12.*
faiss-cpu==1.7.2
transformers==4.15.0
datasets==1.17.0
pyserini
Footnote
- I originally saved it as
model_nqbut renamed it tomodel_wq, I don't think this makes a difference but if it does let me know. - I also tested with wikipedia-nq and with both the latest version on
masterand also with the 0.1 version on pypi and I'm getting the same error.
Hi @xhluca,
Sorry for the late reply.
Is it just the issue of Tevatron/wikipedia-wq-corpus? Tevatron/wikipedia-nq-corpus also not works?
It seems like a issue caused by the json environment?
data = json.loads(line)
File "/opt/conda/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/opt/conda/lib/python3.7/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/conda/lib/python3.7/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
let me know if you still having the issue
Xueguang
I'm not sure what json environment means here. I'm using the standard python 3.7 library in a fresh virtualenv
I tried different datasets and the problem seems to be present
Could you see if a simple jsonl file can be read in your environment? or could you try conda environment? My environment is python3.8 with conda
Yes, I tried the following example: https://stackoverflow.com/questions/50475635/loading-jsonl-file-as-json-objects
ANd it works fine in my environment
@MXueguang My bad, I was indeed using conda. However, do you think there should be a difference whether I"m using conda or virtualenv since the libraries were installed with pip and there's no conda-specific dependency?