fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

spm decoding doesn't handle byte fallback

Open erip opened this issue 3 years ago • 0 comments

🐛 Bug

When training an spm model with byte fallback, the decoded output in fairseq doesn't replace the bytes with the appropriate character.

To Reproduce

Train spm model with --byte_fallback enabled. Train fairseq model on encoded text, do fairseq inference, observe <0x..> in your outputs.

Code sample

Expected behavior

The decoded output should perform proper spm decoding to mirror the spm_decode behavior.

Environment

  • fairseq Version (e.g., 1.0 or main): main
  • PyTorch Version (e.g., 1.0) 1.12
  • OS (e.g., Linux): all
  • How you installed fairseq (pip, source): source
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Additional context

erip avatar Jul 24 '22 23:07 erip