Two Questions of Your Datasets used in the Experiments

Open YasumiKurashima opened this issue 1 year ago • 0 comments

Hello guys, I am a PhD student in AI science at a Japanese university. I am trying to replicate the results of your paper, "Is Everything in Order? A SimpleWay to Order Sentences." I have two questions about the dataset used in your experiments.

Why are some sentences in the SIND dataset not 5 sentences? All the sentences in the official SIND dataset are 5 sentences. Also your ReBART paper states that the SIND dataset consists of five-sentence stories in Table 1. However, it appears that some of your SIND dataset have 4 or 6 sentences.
Why do the numbers of sentences in your AAN and NeurIPS dataset differ from the number of sentences in the official datasets? For example, the original AAN dataset has 8,569 training data. Also your ReBART paper states the same number in Table 1. However, your data has 11,119. This discrepancy in the number of sentences is seen in all of the training, evaluation, and test data for the AAN and NeurIPS datasets. On the other hand, in the SIND and ROCStory datasets, the number of sentences matches the number of sentences in the official datasets.

I posted the same questions to the ReBART github page. I would appreciate it if you could give me an answer.

NOTE: "Your dataset" is stored at the followin url. https://drive.google.com/file/d/17r9D_l-jdhHhpLsa86FGuWgeLgeJkQ19/view?usp=sharing (This url is linked on the ReBART github page as "The exact data used in our experiments can be found here.")

"Official dataset" is linked to the following page. http://visionandlanguage.net/VIST/dataset.html (This url is linked to the ReBART github page as "Please find the links for the various datasets: arXiv, Wiki Movie Plots, SIND, NSF, ROCStories, NeurIPS, AAN.")

Feb 14 '24 05:02 YasumiKurashima