logbert
logbert copied to clipboard
Why was the WordVocab generated using only the training set data?
If the words in the test set are not recorded in the VOCAB, then during testing, they will all be unk_index?
This step is to avoid the data leakage, the embedding layer has a fixed size so even when you include a new log event from the test set, its corresponding embedding cannot be learned during training. Thus, these new events are all mapped to unknown indices. This of course would cause OOV(Out-Of-Vocabulary) issue for a parser-based method