How to create vocabulary for my dataset?
Hi, I have digged in the documentation of the tutorial here: https://google.github.io/seq2seq/nmt/
However, it is unclear to me on how to create vocab_source and vocab_target for my own dataset.
You can have a look at the preprocessing script for the wmt16 data here. The script creates vocabularies of characters, words and BPE units.
As @geert-heyman mentioned you can go through those scrips . Simply these are the following steps
- Create a vocabulary using both languages .
- Index them (word 2 Index)
- If you have word2vec model like glove , assign each vector to word.
- Other wise let the words learn their vector while training . Read About tensorflow embedding lookup function .
@shamanez , let's say I generate pre-trained word-embeddings for each word in my vocabulary (using glove/word2vec).
How do I feed both the words and word embeddings to seq2seq? Do I need a special format for the vocabulary file?
In the tutorial each word in the vocabulary is followed by its counter (toy data reverse).
I think the vocabulary file format is just
Does it matter if we resort the tokens in the vocabulary file (asc or desc) or change the order, will nmt still process to work correctly ?