arglinking How to train model with RAMS 1.0?

Hi!

I want to train new model with RAMS 1.0. so I edited rams.yaml file.

# Parameters of the Argument Linking model

# Device-sensitive parameters
arglink_data_dir: &arglink_data_dir ../../data/RAMS_1.0/data/
glove: &glove </path/to/glove_embeddings>

# Frequently modified parameters
serialization_dir: &serialization_dir ../../data/serialization
train_data: &train_data "train.jsonlines"
dev_data: &dev_data "dev.jsonlines"
test_data: &test_data "test.jsonlines"
dev_gold_data_path: &dev_gold_data_path ""
test_gold_data_path: &test_gold_data_path ""
finetune: &finetune False
pretrain_dir: &pretrain_dir ""

lm_file: &lm_file <path/to/train_dev_contextualizedembeddings>
test_lm_file: &test_lm_file <path/to/dev_test_contextualizedembeddings>

But, The following error occurs:

[2020-09-10 19:21:16,336 INFO] Init random seeds => tseed: 2 numpy_seed: 2 torch_seed: 2
[2020-09-10 19:21:19,036 INFO] Building train datasets ...
[2020-09-10 19:21:19,036 INFO] Reading RAMS arglinking instances from dataset files at: ../../data/RAMS_1.0/data/train.jsonlines
0it [00:00, ?it/s]Error in sys.excepthook:
Traceback (most recent call last):
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
ModuleNotFoundError: No module named 'IPython'

Original exception was:
Traceback (most recent call last):
  File "/home/fairy_of_9/.pycharm_helpers/pydev/pydevd.py", line 1741, in <module>
    main()
  File "/home/fairy_of_9/.pycharm_helpers/pydev/pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "/home/fairy_of_9/.pycharm_helpers/pydev/pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/fairy_of_9/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/fairy_of_9/arglinking/miso/commands/train.py", line 182, in <module>
    train_model(params)
  File "/home/fairy_of_9/arglinking/miso/commands/train.py", line 96, in train_model
    dataset = dataset_from_params(data_params)
  File "/home/fairy_of_9/arglinking/miso/data/dataset_builder.py", line 66, in dataset_from_params
    train_data = load_dataset(train_data, data_type, **params)
  File "/home/fairy_of_9/arglinking/miso/data/dataset_builder.py", line 57, in load_dataset
    return load_dataset_reader(dataset_type, *args, **kwargs).read(path)
  File "/home/fairy_of_9/arglinking/miso/data/dataset_readers/dataset_reader.py", line 73, in read
    instances = [instance for instance in Tqdm.tqdm(instances)]
  File "/home/fairy_of_9/arglinking/miso/data/dataset_readers/dataset_reader.py", line 73, in <listcomp>
    instances = [instance for instance in Tqdm.tqdm(instances)]
  File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/site-packages/tqdm/_tqdm.py", line 1032, in __iter__
    for obj in iterable:
  File "/home/fairy_of_9/arglinking/miso/data/dataset_readers/rams.py", line 77, in _read_from_json
    input_ = json.load(f)
  File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
  File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/json/decoder.py", line 342, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1705)

I think, it is an error that occurs because RAMS is jsonlines, not json.

plus, I can't understand following config parameters.. (I'm sorry.)

dev_gold_data_path: &dev_gold_data_path ""
test_gold_data_path: &test_gold_data_path ""
pretrain_dir: &pretrain_dir ""
lm_file: &lm_file <path/to/train_dev_contextualizedembeddings>
test_lm_file: &test_lm_file <path/to/dev_test_contextualizedembeddings>

Can you tell me how to execute more specifically? Thanks!

Sep 10 '20 10:09 fairy-of-9

Sorry, we didn't document or organize our data consistently (and each of our datareaders did things a bit differently). For the RAMS datareader, the paths should the files themselves, e.g. train/*.json or dev/*.json (or even train or dev should work if the only files in the directory are the json files). This means you will need to split up the jsonlines file into individual files into its own directory. Each line in the jsonlines file is valid json, so splitting each line into its own file should work.

You can ignore the dev/train gold data paths. I believe those are only used for evaluating with the official conll scorer, which is only relevant for the conll 2012 srl experiments/dataset. pretrain_dir is only used to restore a pretrained model checkpoint for finetuning purposes (see L551 in training/trainer.py).

For lm_file and test_lm_file, these are cached (BERT) embeddings. You can see more instructions about how to cache them in header of scripts/cache_bert_hf.py. The lm_file needs to contain the embeddings for the documents in train and dev, while the test_lm_file needs to contain the embeddings for the documents in dev and test.

Please let us know if you have more questions!

Sep 10 '20 15:09 pitrack

Thanks for your comment.

I want to debug to learn how to code deep learning structure. Therefore, I want to make it very simple. What should I do? Is lm_file, test_lm_file and glove needed even in this situation?

glove I have is provided by Stanford NLP. And lm_file I have is BERT provided by Google.

Sep 14 '20 04:09 fairy-of-9

You should just need lm_file and test_lm_file. glove should not be needed.

Sep 17 '20 14:09 pitrack

Wait, sorry, you do need all three. The ones you have should be fine (300 dim glove embeddings)

Sep 17 '20 14:09 pitrack

I've been trying to follow the instructions in this thread dutifully, but I'm still struggling to get a model trained on RAMs 1.0. In particular the scripts/cache_bert_hf.py script which generates the h5 files for the lm_file and test_lm_file for rams.yaml is not working for me. The issue is the RAMs 1.0 data does not have a "subtokens_map" property for each json document.

From what I gather, in order to get a subtokens_map property added to the RAMs dataset, I need to run minimize.py from the repo linked at the top of scripts/cache_bert_hf.py. When I go to do this, I get this:

(venv) lillianthistlethwaite@PRNT-MBP-0020 coref-master % python minimize.py cased_config_vocab/vocab.txt ../arglinking-master/RAMS_1.0/data ../arglinking-master/RAMS_1.0/data false
False
Minimizing ../arglinking-master/RAMS_1.0/data/dev.english.v4_gold_conll
Traceback (most recent call last):
  File "minimize.py", line 237, in <module>
    minimize_language("english", labels, stats, vocab_file, seg_len, input_dir, output_dir, do_lower_case)
  File "minimize.py", line 222, in minimize_language
    minimize_partition("dev", language, "v4_gold_conll", labels, stats, tokenizer, seg_len, input_dir, output_dir)
  File "minimize.py", line 198, in minimize_partition
    with open(input_path, "r") as input_file:
FileNotFoundError: [Errno 2] No such file or directory: '../arglinking-master/RAMS_1.0/data/dev.english.v4_gold_conll'

This indicates that the ONLY data that minimize.py will accept is the conll-2012 data outputted by the setup_training.sh script from the coref-master repository. This has nothing to do with RAMS 1.0 and therefore I feel I've hit a wall. Any help / insight will be much appreciated!

Update: I've been looking through minimize.py and it doesn't seem to work well with RAMS 1.0 data .jsonlines files at all. the get_document() function for example is not suited to the data shared by RAMS 1.0. Would ideally love the exact script the authors used to train the RAMs 1.0 model that called this script.

Nov 29 '21 16:11 lashmore

Sorry, the in-code comment brought up minimize.py as an example of what the expected output format should look like, but that script won't run on RAMS, and cache_bert_hf.py won't run directly on RAMS either. There was another script we used that was based around the BERT tokenizer that we didn't include into the repo.

@lashmore, can you see if these files + instructions work for you? https://github.com/pitrack/arglinking/tree/preprocessing/scripts/preprocess_rams

Nov 29 '21 20:11 pitrack

Excellent - I'll check it out and let you know how it goes!

Nov 29 '21 20:11 lashmore

Nice!! I got the cache_bert_hf.py to work with the aid of minimize_json.py - thank you! The training partition hdf5 file is ~12 GB, and the dev and test partitions is ~1.5 GB each.

I've now started training, and with a few tweaks (I'll send those tweaks in a separate MR or suggested changes request so others can benefit), it's running! Thanks so much for your help on this!

Nov 30 '21 19:11 lashmore