seq2seq icon indicating copy to clipboard operation
seq2seq copied to clipboard

Handling special characters for character seq2seq model

Open shubhamagarwal92 opened this issue 8 years ago • 3 comments

I was trying to train the model for the character level seq2seq model. My source and target file (both in utf-8 encoding) contains special characters such as the pound symbol.

I was facing the "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c" at several places in the code.

I think that the issue is because 'utf-8' encodes the special characters as two bytes such as pound is represented as '\xc2\xa3'. print '\xc2\xa3'.decode('utf-8')

Vocab generated through generate_vocab.py also doesn't cater this.

To create character level vocab, it uses 'list' when using delimiter as '' which considers special characters (represented by two bytes) in utf-8 as two characters. Same for tf.string_split function in split_tokens_decoder.py. For example, consider the result using:

import tensorflow as tf
sess = tf.InteractiveSession()
data  = tf.constant("This \xc2\xa3 is a string")
tokens = tf.string_split([data], delimiter='')
tokens.values.eval()

compared with

print (data.eval())

and you get the same UnicodeDecode error when you try to decode it:
character = data.eval()[5].decode('utf-8')

If we create our own vocabulary file (to handle these special characters) we also get the UnicodeDecodeError later in the code (at hooks.py)

To solve this, I converted my both source and target to latin-1 encoding (which unlike utf-8 encodes these characters as 1-byte ) and changed the encoding (to latin-1) in the code for hooks.py, decode_text.py and metrics_specs.py. The model then ran out of the box for the latin-1 encoding.

Another solution can be to directly convert the sequence of characters into sequence of integers. I have created this gist for the same.

@dennybritz I think that the simplest solution can be to convert the text into unicode (which represents these special characters with 1 byte) while reading and convert back to the desired file encoding (utf-8 standardly) while writing the results. What do you suggest?

shubhamagarwal92 avatar Apr 07 '17 16:04 shubhamagarwal92

#96 may help you, but I'm still not sure how to fix this the right way.

dennybritz avatar Apr 17 '17 20:04 dennybritz

in my case the solution was to use python 3 and correct delimiter config:

buckets: 5,10,15,20
hooks:
  - class: PrintModelAnalysisHook
  - class: MetadataCaptureHook
  - class: SyncReplicasOptimizerHook
  - class: TrainSampleHook
    params:
      every_n_steps: 1000
      source_delimiter: ''
      target_delimiter: ''
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_delimiter: ''
      target_delimiter: ''
      source_files:
        - $TRAIN_SOURCES
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
      source_delimiter: ''
      target_delimiter: ''
      source_files:
       - $DEV_SOURCES
      target_files:
       - $DEV_TARGETS" \

sheerun avatar Jul 09 '17 16:07 sheerun

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 18: invalid start byte

anindita0619 avatar Aug 03 '20 09:08 anindita0619