nmt icon indicating copy to clipboard operation
nmt copied to clipboard

does it support character level embedding?

Open zxu7 opened this issue 8 years ago • 7 comments

If yes, how do I turn it on during training?

zxu7 avatar Oct 24 '17 03:10 zxu7

@zxu7 The code doesn't have a character level embedding option.

However, you may tokenize data at character level, and prepare a character level vocab file to train a character model with the codebase.

oahziur avatar Nov 14 '17 01:11 oahziur

Hi @oahziur , I'm trying to train a character level NMT mode l. So I bascially built a vocabulary consisting of characters char_vocab But the problem I'm facing is that the space character is also part of my vocabulary but if I add it to the vocab file I get an error: vocab_error So I tried to replace the space with a special character that is not in my vocab, say ~ But during the training the model keep predicting the unknown token <unk>, implying that there may be a tokenization problem. My question is how should I tokenize the data so that the model work with the character-level vocabulary ? As for now I let my data in the same format as for word-level models

  • source file: src_exple

  • target file: tgt_exple

thanks!

AlkaSaliss avatar Aug 02 '18 14:08 AlkaSaliss

You need to make sure your code split the sentence into characters instead of words. By default, the code will split sentences by space, which will cause the problem for you.

https://www.tensorflow.org/api_docs/python/tf/string_split

On Thu, Aug 2, 2018 at 10:01 PM Mahamadou Salissou Aboubacar Alka < [email protected]> wrote:

Hi @oahziur https://github.com/oahziur , I'm trying to train a character level NMT mode l. So I bascially built a vocabulary consisting of characters [image: char_vocab] https://user-images.githubusercontent.com/33458274/43587584-cfc9f01c-966a-11e8-8177-3e11b88a3ad8.png But the problem I'm facing is that the space character is also part of my vocabulary but if I add it to the vocab file I get an error: [image: vocab_error] https://user-images.githubusercontent.com/33458274/43587807-591f0d66-966b-11e8-8d06-6c033e014f63.png So I tried to replace the space with a special character that is not in my vocab, say ~ But during the training the model keep predicting the unknown token , implying that there may be a tokenization problem. My question is how should I tokenize the data so that the model work with the character-level vocabulary ? As for now I let my data in the same format as for word-level models

source file: [image: src_exple] https://user-images.githubusercontent.com/33458274/43588620-2eff7bea-966d-11e8-9c02-d1fbcc8bd34d.png

target file: [image: tgt_exple] https://user-images.githubusercontent.com/33458274/43588637-3915d0d4-966d-11e8-9fbd-94b558ebdafa.png

thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/tensorflow/nmt/issues/152#issuecomment-409936783, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZCMKDcSijujNMIqwv_BbY75xDmAo9jks5uMwZBgaJpZM4QDx3g .

oahziur avatar Aug 03 '18 04:08 oahziur

Hi @oahziur , To train it as a character level model, i should change the delimiter space to empty string in the tensorflow string_split file or is there a way i can do that in the nmt code?

eswarjal09 avatar Aug 16 '18 05:08 eswarjal09

Hi @eswarjal09 . As I was confronted to the situation such as yours, I tried to solve it in a tricky way. Not knowing which part of the script to change in order to tell it to split the data at a character-level, I processed my data in this way :

  1. change all the spaces in my data by a special symbol that I'm sure is not part of my vocabulary (say ~ for example). Thus, a sentence like I eat food. becomes I~eat~food.
  2. transform my sentences in a character level separated by space

So to summarize my data goes from this:

gitnmt1

to this :

gitnmt2 And I trained the model with the normal training procedure provided by the tensorflow-nmt. And to revert back to world-level, I take the results from the inference, remove the whitespaces, and replace the ~ (or whatever special symbol you used) by white spaces.

I am sure there is a better way to handle this by modifying the nmt scripts, but this could work as a temporary solution (at least it worked for me).

AlkaSaliss avatar Aug 18 '18 13:08 AlkaSaliss

@AlkaSaliss were you able to generate character embedding and run NMT using for character level?

shanalikhan avatar May 11 '19 12:05 shanalikhan

@shanalikhan Yes I managed to get it work with character-level vocabulary. See my comment above. I'm not sure it is the best way but could work as workaround.

AlkaSaliss avatar May 11 '19 16:05 AlkaSaliss