seq2seq icon indicating copy to clipboard operation
seq2seq copied to clipboard

How to create vocabulary for my dataset?

Open ayushidalmia opened this issue 8 years ago • 5 comments

Hi, I have digged in the documentation of the tutorial here: https://google.github.io/seq2seq/nmt/

However, it is unclear to me on how to create vocab_source and vocab_target for my own dataset.

ayushidalmia avatar Aug 07 '17 11:08 ayushidalmia

You can have a look at the preprocessing script for the wmt16 data here. The script creates vocabularies of characters, words and BPE units.

geert-heyman avatar Aug 07 '17 13:08 geert-heyman

As @geert-heyman mentioned you can go through those scrips . Simply these are the following steps

  1. Create a vocabulary using both languages .
  2. Index them (word 2 Index)
  3. If you have word2vec model like glove , assign each vector to word.
  4. Other wise let the words learn their vector while training . Read About tensorflow embedding lookup function .

shamanez avatar Aug 07 '17 13:08 shamanez

@shamanez , let's say I generate pre-trained word-embeddings for each word in my vocabulary (using glove/word2vec).

How do I feed both the words and word embeddings to seq2seq? Do I need a special format for the vocabulary file?

In the tutorial each word in the vocabulary is followed by its counter (toy data reverse).

micheletufano avatar Dec 08 '17 17:12 micheletufano

I think the vocabulary file format is just (as in prevalence in the training text used to source the vocab)

ruohoruotsi avatar Feb 15 '18 01:02 ruohoruotsi

Does it matter if we resort the tokens in the vocabulary file (asc or desc) or change the order, will nmt still process to work correctly ?

mohammedayub44 avatar Apr 04 '18 13:04 mohammedayub44