Testing
in your code there is no testing yet, but I see the testing data already exists. how to implement it? is it the same as validation Can you help me for the steps i need to do?
is it the same as validation
Yes.
I'm trying to understand the code, in the summary of the results of the training you did, why did the results only 3 words? while the maximum summary is 31. I tried to use longer data and only 3 words were issued. Are there any parameters that must be changed other than the maximum summary? please answer, thanks in advance
I don't think you can control how many words are generated directly. The maximum length is juts that: the maximum upperbound. In practice, the model is trained to predict a special token called e d of sequence (eos) token after the summary. The eos token marks the end of the generation. During inference I ignore every token after eos . If you are getting only 3 words it means the model is generating eos after the 3 words.
The reason why it is mostly around 3 words could be partly because a lot of training data has similar 3-4 words summary. Otherwise, it may be a model issue, hyperparameters issue, or lack of more training issue. If the data has mostly few short words, and you generally want bigger summaries, then it's probably best to try a different dataset. I think there are also some paper available that tries to do more length controlled generation if you want more control. But yeah, in general you can't do much here.
You can change the code where I filter tokens after the eos token to print the whole generated tokens if you want to see the whole generated sequence. I think you will find the code around the section where I am printing the generated text.
On Wed, Aug 18, 2021, 4:08 AM Karima Marwazia Shaliha < @.***> wrote:
I'm trying to understand the code, in the summary of the results of the training you did, why did the results only 3 words? while the maximum summary is 31. I tried to use longer data and only 3 words were issued. Are there any parameters that must be changed other than the maximum summary? please answer, thanks in advance
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JRC1995/Abstractive-Summarization/issues/17#issuecomment-900950312, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYBQNTHE7NIQAMFB6MA76DT5N2ADANCNFSM45MVKHDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
umm, okay For that problem I understand, maybe because the data I use is not so good so the accuracy produced is not as good as the accuracy you produce. This text summarization model can be used in other language datasets, right? Maybe just need to replace the pre-trained embedding Glove
Yes. You can also try other models from github.
On Thu, Aug 19, 2021, 4:38 AM Karima Marwazia Shaliha < @.***> wrote:
umm, okay For that problem I understand, maybe because the data I use is not so good so the accuracy produced is not as good as the accuracy you produce. This text summarization model can be used in other language datasets, right? Maybe just need to replace the pre-trained embedding Glove
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JRC1995/Abstractive-Summarization/issues/17#issuecomment-901765816, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACYBQNXZ2E56TWPPSTAIJC3T5TGKJANCNFSM45MVKHDQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Vp = tf.get_variable("Vp", shape=[128,1], dtype=tf.float32, trainable=True, initializer=tf.glorot_uniform_initializer()) Okay, I'm still curious about the number 128 in the code above. What size is that for?
It's the number of neurons in the layers used to predict the local attention window position. https://arxiv.org/pdf/1508.04025.pdf (eqn. 9) if Wp transforms some vector of dimension d to 128, then Vp transformes the 128 dimension to 1. It's a hyperparameter.
why use the value 128? and maybe this is my last question, I don't quite understand yet for the value 5 in Gradient Clipping. Can you explain a little? this is the code: capped_gvs = [(tf.clip_by_norm(grad,5), var) for grad, var in gvs]
thanks for all the answers
I don't remember, I probably chose 128 randomly. Ideally, we are supposed to hyperparameter tune it. Same for 5. I have seen 1 or 5 been used as reasonable values for gradient clipping. I just randomly chose 5.