Handling special characters for character seq2seq model
I was trying to train the model for the character level seq2seq model. My source and target file (both in utf-8 encoding) contains special characters such as the pound symbol.
I was facing the "UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c" at several places in the code.
I think that the issue is because 'utf-8' encodes the special characters as two bytes such as pound is represented as '\xc2\xa3'.
print '\xc2\xa3'.decode('utf-8')
Vocab generated through generate_vocab.py also doesn't cater this.
To create character level vocab, it uses 'list' when using delimiter as '' which considers special characters (represented by two bytes) in utf-8 as two characters. Same for tf.string_split function in split_tokens_decoder.py. For example, consider the result using:
import tensorflow as tf
sess = tf.InteractiveSession()
data = tf.constant("This \xc2\xa3 is a string")
tokens = tf.string_split([data], delimiter='')
tokens.values.eval()
compared with
print (data.eval())
and you get the same UnicodeDecode error when you try to decode it:
character = data.eval()[5].decode('utf-8')
If we create our own vocabulary file (to handle these special characters) we also get the UnicodeDecodeError later in the code (at hooks.py)
To solve this, I converted my both source and target to latin-1 encoding (which unlike utf-8 encodes these characters as 1-byte ) and changed the encoding (to latin-1) in the code for hooks.py, decode_text.py and metrics_specs.py. The model then ran out of the box for the latin-1 encoding.
Another solution can be to directly convert the sequence of characters into sequence of integers. I have created this gist for the same.
@dennybritz I think that the simplest solution can be to convert the text into unicode (which represents these special characters with 1 byte) while reading and convert back to the desired file encoding (utf-8 standardly) while writing the results. What do you suggest?
#96 may help you, but I'm still not sure how to fix this the right way.
in my case the solution was to use python 3 and correct delimiter config:
buckets: 5,10,15,20
hooks:
- class: PrintModelAnalysisHook
- class: MetadataCaptureHook
- class: SyncReplicasOptimizerHook
- class: TrainSampleHook
params:
every_n_steps: 1000
source_delimiter: ''
target_delimiter: ''
--input_pipeline_train "
class: ParallelTextInputPipeline
params:
source_delimiter: ''
target_delimiter: ''
source_files:
- $TRAIN_SOURCES
target_files:
- $TRAIN_TARGETS" \
--input_pipeline_dev "
class: ParallelTextInputPipeline
params:
source_delimiter: ''
target_delimiter: ''
source_files:
- $DEV_SOURCES
target_files:
- $DEV_TARGETS" \
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 18: invalid start byte