DeepSpeech Testing utf-8 model panics when not usinig scorer

Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04.1 LTS
TensorFlow installed from (our builds, or upstream TensorFlow): Using Docker
TensorFlow version (use command below): Using Docker
Python version: Using Docker
Bazel version (if compiling from source): Docker
GCC/Compiler version (if compiling from source): Docker
CUDA/cuDNN version: Docker
GPU model and memory: NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] - 4GB ram
Exact command to reproduce: Provided below

I have been trying to create a japanese model, however during test phase it errors out. This is the command i am using to test the model.

python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/

Following is the logs of the error recieved

I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                      Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 958, in main
    test()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 682, in test
    samples = evaluate(FLAGS.test_files.split(','), create_model)
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 132, in evaluate
    samples.extend(run_test(init_op, dataset=csv))
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 114, in run_test
    cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in ctc_beam_search_decoder_batch
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in <listcomp>
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 227, in <listcomp>
    [(res.confidence, alphabet.Decode(res.tokens)) for res in beam_results]
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 138, in Decode
    return res.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

Everything is run inside provided docker container, so it shouldnt be a setup issue.

My csv files seem to be properly utf8 encoded, you can check them final-test.zip

https://discourse.mozilla.org/t/help-with-japanese-model/72139/33 - this may help in debugging as well

Dec 22 '20 21:12 shravanshetty1

The idea is that as long as you don't have invalid UTF-8 byte sequences in the text you use to generate the scorer, it'll prevent the model from predicting those. Adding 'ignore' or 'replace' to the decode call will patch the problem but the goal is to avoid predicting those things at all.

Dec 23 '20 09:12 reuben

Ah, it looks like you're not passing a scorer. I guess for development using 'ignore'/'replace' is a reasonable workaround, but you definitely want to use a scorer, exactly for this reason, it's the component responsible for preventing the model from exploring invalid beams.

Dec 23 '20 09:12 reuben

@reuben I can confirm that this error is not thrown when using a scorer.

Dec 23 '20 10:12 shravanshetty1