arglinking icon indicating copy to clipboard operation
arglinking copied to clipboard

Confusion about the dataset

Open bellytina opened this issue 3 years ago • 1 comments

Hello, Thanks for your dataset and code! I see #Docs is much smaller than #Events from Table 1, indicating that a document can contain multiple events. So is there a clear boundary between these events, that is, whether different events under the same document will share arguments? In addition, I found that the doc_key of each instance in the jsonlines is unique. How do you count the number of documents (3194,399 and 400)? Any help would be great. 微信图片_20220731141406

bellytina avatar Jul 31 '22 06:07 bellytina

As you've noticed, given the stats, there are documents with multiple events. In those cases, there's a good chance an argument or two will be shared across events. However, determining the amount of argument overlap would require combining the examples back into full documents, which is a bit tricky.

The number of documents (top row in the table) is the number of unique source document URLs. That is, it's the number of documents that were then processed to create the individual examples.

We added a script to generate the numbers in the table in https://github.com/pitrack/arglinking/pull/11.

sethebner avatar Aug 06 '22 18:08 sethebner