Testing utf-8 model panics when not usinig scorer
- Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04.1 LTS
- TensorFlow installed from (our builds, or upstream TensorFlow): Using Docker
- TensorFlow version (use command below): Using Docker
- Python version: Using Docker
- Bazel version (if compiling from source): Docker
- GCC/Compiler version (if compiling from source): Docker
- CUDA/cuDNN version: Docker
- GPU model and memory: NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] - 4GB ram
- Exact command to reproduce: Provided below
I have been trying to create a japanese model, however during test phase it errors out. This is the command i am using to test the model.
python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/
Following is the logs of the error recieved
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00 Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 958, in main
test()
File "/DeepSpeech/training/deepspeech_training/train.py", line 682, in test
samples = evaluate(FLAGS.test_files.split(','), create_model)
File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 132, in evaluate
samples.extend(run_test(init_op, dataset=csv))
File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 114, in run_test
cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in ctc_beam_search_decoder_batch
for beam_results in batch_beam_results
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in <listcomp>
for beam_results in batch_beam_results
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 227, in <listcomp>
[(res.confidence, alphabet.Decode(res.tokens)) for res in beam_results]
File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 138, in Decode
return res.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
Everything is run inside provided docker container, so it shouldnt be a setup issue.
My csv files seem to be properly utf8 encoded, you can check them final-test.zip
https://discourse.mozilla.org/t/help-with-japanese-model/72139/33 - this may help in debugging as well
The idea is that as long as you don't have invalid UTF-8 byte sequences in the text you use to generate the scorer, it'll prevent the model from predicting those. Adding 'ignore' or 'replace' to the decode call will patch the problem but the goal is to avoid predicting those things at all.
Ah, it looks like you're not passing a scorer. I guess for development using 'ignore'/'replace' is a reasonable workaround, but you definitely want to use a scorer, exactly for this reason, it's the component responsible for preventing the model from exploring invalid beams.
@reuben I can confirm that this error is not thrown when using a scorer.