seq2seq How to create vocabulary for my dataset?

Hi, I have digged in the documentation of the tutorial here: https://google.github.io/seq2seq/nmt/

However, it is unclear to me on how to create vocab_source and vocab_target for my own dataset.

Aug 07 '17 11:08 ayushidalmia

You can have a look at the preprocessing script for the wmt16 data here. The script creates vocabularies of characters, words and BPE units.

Aug 07 '17 13:08 geert-heyman

As @geert-heyman mentioned you can go through those scrips . Simply these are the following steps

Create a vocabulary using both languages .
Index them (word 2 Index)
If you have word2vec model like glove , assign each vector to word.
Other wise let the words learn their vector while training . Read About tensorflow embedding lookup function .

Aug 07 '17 13:08 shamanez

@shamanez , let's say I generate pre-trained word-embeddings for each word in my vocabulary (using glove/word2vec).

How do I feed both the words and word embeddings to seq2seq? Do I need a special format for the vocabulary file?

In the tutorial each word in the vocabulary is followed by its counter (toy data reverse).

Dec 08 '17 17:12 micheletufano

I think the vocabulary file format is just (as in prevalence in the training text used to source the vocab)

Feb 15 '18 01:02 ruohoruotsi

Does it matter if we resort the tokens in the vocabulary file (asc or desc) or change the order, will nmt still process to work correctly ?

Apr 04 '18 13:04 mohammedayub44