Question about corpus
Hello matatusko, you said [For the data preprocessing I'm using spaCy as I've came to rely heavily on it when it comes to any NLP tasks - it's simply brilliant. In this case, spaCy is used to remove stopwords and punctuation from the dataset as well as exchange any entities in the text into their corresponding labels, such as LOC, PERSON, DATE etc, so the model learns dependencies between paragraph and question and does not overfit.] Do you mean that word-embeddings of LOC, PERSON, DATE etc are already in your corpus when train word-embedding model? If so, could you explain more about your corpus? Do you POS all the word of a paragraph, or just only some special entity(LOC, PERSON, ORG)? If so, do you think how many QA training/dev pair will get a good result? Thanks & regards
Hi! Apologies for the late answer.
The standard words embeddings are taken from Conceptnet Numberbatch (it might be a better option, and certainly faster, to use spaCy's vector though). I only replace the special entities with their labels, so that the model could learn how to generalize. As for the rest, I only remove stopwords and punctuation, but you could experiment with lemmatizing the word or keeping only the important verbs/nouns within a sentence. The entities labels generated by the model output can later be replaced with the real entities found in the new data using some clever function (pointer generation method? haven't really tried that out).
The entity types you enquire about are not in pre-trained vectors and thus a new random vector is created for them in the create_embedding_matrix function. As for the data it's as usual - the more the better. I tried it on SQuAD for question generation and the results obtained were rather average - but with more data I believe it could be improved.