Why was the WordVocab generated using only the training set data？

Open LINBEIXL opened this issue 1 year ago • 1 comments

If the words in the test set are not recorded in the VOCAB, then during testing, they will all be unk_index?

Feb 27 '24 15:02 LINBEIXL

This step is to avoid the data leakage, the embedding layer has a fixed size so even when you include a new log event from the test set, its corresponding embedding cannot be learned during training. Thus, these new events are all mapped to unknown indices. This of course would cause OOV(Out-Of-Vocabulary) issue for a parser-based method

Jun 07 '24 20:06 YifeiLin0226