DeepSpeech icon indicating copy to clipboard operation
DeepSpeech copied to clipboard

Testing utf-8 model panics when not usinig scorer

Open shravanshetty1 opened this issue 5 years ago • 3 comments

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04.1 LTS
  • TensorFlow installed from (our builds, or upstream TensorFlow): Using Docker
  • TensorFlow version (use command below): Using Docker
  • Python version: Using Docker
  • Bazel version (if compiling from source): Docker
  • GCC/Compiler version (if compiling from source): Docker
  • CUDA/cuDNN version: Docker
  • GPU model and memory: NVIDIA Corporation TU117M [GeForce GTX 1650 Mobile / Max-Q] - 4GB ram
  • Exact command to reproduce: Provided below

I have been trying to create a japanese model, however during test phase it errors out. This is the command i am using to test the model.

python -u DeepSpeech.py --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv --test_batch_size 4 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint/

Following is the logs of the error recieved

I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on /home/anon/Downloads/jaSTTDatasets/final-test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                      Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 958, in main
    test()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 682, in test
    samples = evaluate(FLAGS.test_files.split(','), create_model)
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 132, in evaluate
    samples.extend(run_test(init_op, dataset=csv))
  File "/DeepSpeech/training/deepspeech_training/evaluate.py", line 114, in run_test
    cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in ctc_beam_search_decoder_batch
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 228, in <listcomp>
    for beam_results in batch_beam_results
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 227, in <listcomp>
    [(res.confidence, alphabet.Decode(res.tokens)) for res in beam_results]
  File "/usr/local/lib/python3.6/dist-packages/ds_ctcdecoder/__init__.py", line 138, in Decode
    return res.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

Everything is run inside provided docker container, so it shouldnt be a setup issue.

My csv files seem to be properly utf8 encoded, you can check them final-test.zip

https://discourse.mozilla.org/t/help-with-japanese-model/72139/33 - this may help in debugging as well

shravanshetty1 avatar Dec 22 '20 21:12 shravanshetty1

The idea is that as long as you don't have invalid UTF-8 byte sequences in the text you use to generate the scorer, it'll prevent the model from predicting those. Adding 'ignore' or 'replace' to the decode call will patch the problem but the goal is to avoid predicting those things at all.

reuben avatar Dec 23 '20 09:12 reuben

Ah, it looks like you're not passing a scorer. I guess for development using 'ignore'/'replace' is a reasonable workaround, but you definitely want to use a scorer, exactly for this reason, it's the component responsible for preventing the model from exploring invalid beams.

reuben avatar Dec 23 '20 09:12 reuben

@reuben I can confirm that this error is not thrown when using a scorer.

shravanshetty1 avatar Dec 23 '20 10:12 shravanshetty1