Invalid argument error during training
Hello Ms.Strubell :-) I am trying to train and evaluate your LISA model on CoNLL05 dataset. I followed the recipe in this post https://github.com/strubell/preprocess-conll05 for preprocesing ConLL2005 dataset and I have adapted the data path in configuration file correspondingly. When I run the training, the initialization steps of tensorflow model seem to work normally. But after "filling up the shuffle buffer" , I got following error information immediately.. Do you have any ideas about the reason of this error? And could you have any pretrained models on CoNLL05 dataset ?
2018-10-18 23:39:20.446629: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:135] Shuffle buffer filled. Traceback (most recent call last): File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1278, in _do_call return fn(*args) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1263, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/Users/xiaotang/Documents/soft/miniconda3/envs/deep_nlp/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1350, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[862] = 5199 is not in [0, 1968) [[Node: LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@LISA/Nadam/update_LISA/word_type_embeddings/embeddings/ScatterAdd"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](LISA/Nadam/update_LISA/word_type_embeddings/embeddings/add_1, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/Unique, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3/axis)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/train.py", line 143, in
Caused by op 'LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3', defined at:
File "src/train.py", line 143, in
InvalidArgumentError (see above for traceback): indices[862] = 5199 is not in [0, 1968) [[Node: LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3 = GatherV2[Taxis=DT_INT32, Tindices=DT_INT32, Tparams=DT_FLOAT, _class=["loc:@LISA/Nadam/update_LISA/word_type_embeddings/embeddings/ScatterAdd"], _device="/job:localhost/replica:0/task:0/device:CPU:0"](LISA/Nadam/update_LISA/word_type_embeddings/embeddings/add_1, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/Unique, LISA/Nadam/update_LISA/word_type_embeddings/embeddings/GatherV2_3/axis)]]
This looks like an optimizer bug. I just tested the master branch w/ tensorflow v1.9 and 1.10 on gpu and can't replicate. What version of tensorflow are you using, and are you running on gpu or cpu?
Thanks for the help, Ms.Strubell :-) My environment is: python 3.6.6, tensorflow 1.10.1 and this error occurred when I run on cpu. I also doubt the format of my processed CoNLL05 data is incorrect, since it looks a bit different from the examples mentioned in your repository ( I set the path to WSJ testset as: /treebank2/combined/wsj, and I did not find a valid path to Brown test set... )
It could be a cpu-specific issue, or it could be the data formatting (or both). Is it possible for you to try running on a gpu?
I don't think this specific error is caused by the data format, but that could also be a separate issue. Can you paste a few example lines of your pre-processed data? It should look exactly like the example in the data preprocessing repo here: https://github.com/strubell/preprocess-conll05#further-pre-processing-eg-for-lisa
On Mon, Oct 29, 2018 at 11:05 AM acDante [email protected] wrote:
Thanks for the help, Ms.Strubell :-) My environment is: python 3.6.6, tensorflow 1.10.1 and this error occurred when I run on cpu. I also doubt the format of my processed CoNLL05 data is incorrect, since it looks a bit different from the examples mentioned in your repository ( I set the path to WSJ testset as: /treebank2/combined/wsj, and I did not find a valid path to Brown test set... )
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-433944931, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZtzDt19SfUmE0whmJAXe1-Egsaxeiks5upxlVgaJpZM4XvV9V .
My training dataset looks correct to me: conll05 141 0 They PRP PRP 2 nsubj _ - - - - O B-A0 conll05 141 1 attached VBD VBD 0 root _ 01 attach - - O B-V conll05 141 2 a DT DT 5 det _ - - - - O B-A1 conll05 141 3 second JJ JJ 5 amod _ - - - - O I-A1 conll05 141 4 gene NN NN 2 dobj _ - - - - O I-A1 conll05 141 5 , , , 5 punct _ - - - - O I-A1 conll05 141 6 for IN IN 5 prep _ - - - - O I-A1 conll05 141 7 herbicide NN NN 9 nn _ - - - - O I-A1 conll05 141 8 resistance NN NN 7 pobj _ - - - - O I-A1 conll05 141 9 , , , 5 punct _ - - - - O I-A1 conll05 141 10 to TO TO 2 prep _ - - - - O B-A1 conll05 141 11 the DT DT 14 det _ - - - - O I-A1 conll05 141 12 pollen-inhibiting JJ JJ 14 amod _ - - - - O I-A1 conll05 141 13 gene NN NN 11 pobj _ - - - - O I-A1 conll05 141 14 . . . 2 punct _ - - - - O O
conll05 142 0 Both DT DT 2 det _ - - - - O B-A1 O O O O conll05 142 1 genes NNS NNS 5 nsubjpass _ - - - - O I-A1 O O O O conll05 142 2 are VBP VBP 5 auxpass _ - - - - O O O O O O conll05 142 3 then RB RB 5 advmod _ - - - - O B-AM-TMP O O O O conll05 142 4 inserted VBN VBN 0 root _ 01 insert - - O B-V O O O O conll05 142 5 into IN IN 5 prep _ - - - - O B-A2 O O O O conll05 142 6 a DT DT 10 det _ - - - - O I-A2 B-A1 B-A1 B-A1 B-A0 conll05 142 7 few JJ JJ 10 amod _ - - - - O I-A2 I-A1 I-A1 I-A1 I-A0 conll05 142 8 greenhouse NN NN 10 nn _ - - - - O I-A2 I-A1 I-A1 I-A1 I-A0 conll05 142 9 plants NNS NNS 6 pobj _ - - - - O I-A2 I-A1 I-A1 I-A1 I-A0 conll05 142 10 , , , 10 punct _ - - - - O I-A2 O O O O conll05 142 11 which WDT WDT 15 nsubjpass _ - - - - O I-A2 B-R-A1 B-C-A1 B-R-A1 B-R-A0 conll05 142 12 are VBP VBP 15 auxpass _ - - - - O I-A2 O O O O conll05 142 13 then RB RB 15 advmod _ - - - - O I-A2 B-AM-TMP B-AM-TMP O O conll05 142 14 pollinated VBN VBN 10 rcmod _ 01 pollinate - - O I-A2 B-V O O O conll05 142 15 and CC CC 15 cc _ - - - - O I-A2 O O O O conll05 142 16 allowed VBN VBN 15 conj _ 01 allow - - O I-A2 O B-V O O conll05 142 17 to TO TO 19 aux _ - - - - O I-A2 O B-C-A1 O O conll05 142 18 mature VB VB 17 xcomp _ 01 mature - - O I-A2 O I-C-A1 B-V O conll05 142 19 and CC CC 19 cc _ - - - - O I-A2 O I-C-A1 O O conll05 142 20 produce VB VB 19 conj _ 01 produce - - O I-A2 O I-C-A1 O B-V conll05 142 21 seed NN NN 21 dobj _ - - - - O I-A2 O I-C-A1 O B-A1 conll05 142 22 . . . 5 punct _ - - - - O O O O O O
But my WSJ test set only contains these columns: conll05 7 0 DT - - conll05 7 1 NN - - conll05 7 2 VBZ - - conll05 7 3 RB - - conll05 7 4 VBN - - conll05 7 5 . - -
conll05 8 0 `` - - conll05 8 1 DT - - conll05 8 2 NN - - conll05 8 3 NN - - conll05 8 4 VBD - - conll05 8 5 JJ - - conll05 8 6 . - -
Is this data format correct ? I will also try running the experiments on GPU later to see :-) Thanks for giving the advice!
@acDante Do you fix this problem ? I got exactly same error as you.
You need to add some dummy parse/srl labels to the test set, as right now the code expects to evaluate with respect to gold labels.
On Thu, Jan 17, 2019 at 7:41 PM Peng Shi [email protected] wrote:
@acDante https://github.com/acDante Do you fix this problem ? I got exactly same error as you.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-455385230, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt-F_S3FOHl12W-Kkdr-xuSbN4Hvjks5vERhFgaJpZM4XvV9V .
I think my test data is in same format with training and dev (with parse and srl info)
Did you try w/ tf 1.9 or 1.10 on gpu?
On Thu, Jan 17, 2019 at 7:46 PM Peng Shi [email protected] wrote:
I think my test data is in same for format with training and dev (with parse and srl info)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-455386060, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt0BKlXoQY7uP3d-ovVP3vtPPCUnDks5vERlNgaJpZM4XvV9V .
yeah. It gives me segmentation fault. No error message at all.
Is there any output before the segfault, and are you sure that your tensorflow installation otherwise works?
On Thu, Jan 17, 2019 at 7:57 PM Peng Shi [email protected] wrote:
yeah. It gives me segmentation fault. No error message at all.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/strubell/LISA/issues/2#issuecomment-455388372, or mute the thread https://github.com/notifications/unsubscribe-auth/ADHZt14q7a2Ma3DFN-06Vz3-vL3tZyTWks5vERwLgaJpZM4XvV9V .
2019-01-17 20:24:31.656029: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:135] Shuffle buffer filled.
bin/train.sh: line 20: 18297 Segmentation fault (core dumped) python3 src/train.py --train_files $train_files --dev_files $dev_files --transition_stats $transition_stats --data_config $data_config --model_configs $model_configs --task_configs $task_configs --layer_configs $layer_configs --attention_configs $attention_configs $params
Hi impavidily, I managed to fix this error by downgrading my Tensorflow to 1.9.0. I guess this results from incompatible Tensorflow version with cuda. What is your cuda and cuDNN version ?
I both tried 1.9.0 and 1.10.0 with cuda 9.0 and cuDNN 7
I think you are right. There might be some incompatible issue here. @acDante @strubell Thank you all.