preprocess-conll05 icon indicating copy to clipboard operation
preprocess-conll05 copied to clipboard

Where do test dirs props, null, and ne come from?

Open Adamits opened this issue 4 years ago • 6 comments

Hi!

I noticed in make-wsj-test.sh and make-brown-test.sh that we try to zcat a props, null, and ne file from test.wsj. However, in the extract_test_from_ptb.sh and extract_test_from_brown.sh scripts, none of these dirs/files are generated. Where are these supposed to come from?

Thanks!

Adamits avatar Jul 09 '21 22:07 Adamits

Those dirs should be under the train directory in the conll05 data.

On Fri, Jul 9, 2021 at 6:49 PM Adam @.***> wrote:

Hi!

I noticed in make-wsj-test.sh and make-brown-test.sh that we try to zcat a props, null, and ne file from test.wsj. However, in the extract_test_from_ptb.sh and extract_test_from_brown.sh scripts, none of these dirs/files are generated. Where are these supposed to come from?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/strubell/preprocess-conll05/issues/9, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAY5TNYMBDKIEH4SM47TE6TTW54HZANCNFSM5ADRYS7A .

strubell avatar Jul 10 '21 15:07 strubell

Thanks for the response!

I am probably missing something, but I thought the train directory only had data for sections 02-21 for wsj, whereas the test set is for sections 23. To be sure, I am referencing e.g. this line: https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/make-wsj-test.sh#L13 - whereas https://github.com/strubell/preprocess-conll05/blob/master/bin/basic/extract_test_from_ptb.sh only generates words/syntax for section 23.

Adamits avatar Jul 10 '21 17:07 Adamits

Hi!

I noticed in make-wsj-test.sh and make-brown-test.sh that we try to zcat a props, null, and ne file from test.wsj. However, in the extract_test_from_ptb.sh and extract_test_from_brown.sh scripts, none of these dirs/files are generated. Where are these supposed to come from?

Thanks!

Hello, I have the same problem with you. Do you have any ideas now? Thanks!

XueBingo avatar Jul 12 '21 12:07 XueBingo

It sounds like you're describing the ptb training data, not the conll data - the directory I'm referring to is the $CONLL05 dir as defined in get_data.sh.

strubell avatar Jul 20 '21 14:07 strubell

Yeah I guess so. I am asking about the test data in particular. Which appears to be section 23 of PTB.

So running ./bin/basic/extract_test_from_ptb.sh only extracts words and synts from section 23.

However, bin/basic/make-wsj-test.sh expects props, null, and ne as well. I think for the train/dev data, these dirs come from the conll05 releaser, in get_data.sh, however, section 23 (the test data) does not seem to be included in here.

But for the test data, where do these dirs come from? In bin/basic/make-wsj-test.sh:

zcat < $CONLL05/$FILE/words/$FILE.words.gz > /tmp/$$.words
    zcat < $CONLL05/$FILE/props/$FILE.props.gz > /tmp/$$.props
    zcat < $CONLL05/$FILE/synt/$FILE.$s.synt.gz > /tmp/$$.synt

    # no senses, set to null
    zcat < $CONLL05/$FILE/null/$FILE.null.gz > /tmp/$$.senses
    zcat < $CONLL05/$FILE/ne/$FILE.ne.gz > /tmp/$$.ne

cannot find the props, sense, or ne file, and then writes an empty archive.

Adamits avatar Jul 26 '21 19:07 Adamits

Oh, that's so strange! I guess the senses/ne lines (and corresponding entries in the paste) should be removed, but I'm surprised this non-working version is in the repo. Unfortunately I no longer have access to the old server where I originally developed/ran these scripts, so I can't go back and see if there were uncommitted changes, etc.

strubell avatar Aug 06 '21 18:08 strubell