tensor2tensor icon indicating copy to clipboard operation
tensor2tensor copied to clipboard

Question: out of memory when generating translate_enzh_wmt32k data set

Open Qiaoxl opened this issue 7 years ago • 18 comments

Nobody got MemoryError? Problem: translate_enzh_wmt32k When generating data, with the toyset(220k lines), it takes at most about 12G memory. But with the whole dataset (about 24m lines), generating data takes really a lot memory and I got MemoryError (50G is really not enough). Has anybody tried this? How could I fix this?

Qiaoxl avatar Jun 06 '18 08:06 Qiaoxl

Hi, I met another question about "Chinese-English" here. When I want to train a Chinese to English model, I used "--problem = translate_enzh_wmt32k_rev/ --problem = translate_enzh_wmt8k_rev" to generate data. However, I got an error which told me I should define the right "problem".

Have you met this problem?

houyu0930 avatar Jun 06 '18 13:06 houyu0930

@houyu0930: For t2t-datagen don't use _rev in problems names. (This is not related to the topic of this issue, which is about OOM.)

martinpopel avatar Jun 06 '18 13:06 martinpopel

@martinpopel OK, I got a wrong idea about this. Thank you for your answer, I get it now. (Sorry for putting this question here which is not related to this topic. I will take care about this next time.)

For this topic, in fact, I generated data for "problem = translate_enzh_wmt32k" yesterday. And no error occurs here, everything looks fine for me. Are you in trouble with generating data? I don't know what you mean for "50G". In my case, total size for these generated data is 33M. Sorry for no help here. :(

houyu0930 avatar Jun 06 '18 14:06 houyu0930

@houyu0930 see the translate_enzh.py "This is far from being the real WMT17 task - only toyset here you need to register to get UN data and CWT data..." The default dataset is only 220k lines. when training with this, you won't get a good result. The whole dataset includes UN data and CWT, totally about 24000k lines. When generating with 220k lines, it takes at most about 12G memory. When generating with 24000k lines, I have got 50G memory but still not enough.

Qiaoxl avatar Jun 07 '18 00:06 Qiaoxl

@Qiaoxl: This is strange, t2t-datagen should take just a little memory even for very big data (the shuffling is using 100 datashards by default). Can you report your T2T (and TF and Python) version? Can you re-try with the newest version? Can you try generating another small translation dataset?

martinpopel avatar Jun 07 '18 06:06 martinpopel

@martinpopel T2T: 1.6.2 TF: 1.7.0 Pyhton: 3.6.5 I haven't try the newest version. But I've used t2t-datagen to generate librispeech_clean_small dataset. It went well with just a little memory. Maybe you can check some detail in the log file gen_20180605_155220.log

Qiaoxl avatar Jun 07 '18 08:06 Qiaoxl

Tagging with question for now, but if we find that there is indeed something wrong I'll change to bug.

rsepassi avatar Jun 15 '18 19:06 rsepassi

me too. T2T: 1.6.2 TF: 1.7.0 Pyhton: 3.5.3 output_gen.txt

robotzheng avatar Jun 29 '18 07:06 robotzheng

me too. T2T: 1.6.6 TF:1.8.0 python:Python 3.6.3

xuekun90 avatar Jul 16 '18 08:07 xuekun90

@rsepassi hi

when i use huge Chinese datasets to generate data . it will cost a lot of memory ,but if i put the vocab files in the data forlder , it is ok . Then i think maybe some improvements are needed for subword tokenizer . it will cost a lot of memory .

hpulfc avatar Aug 30 '18 07:08 hpulfc

Me too facing the same issue using T2T :1.6.3. Getting same as mentioned in log by @Qiaoxl . Any help on this @rsepassi ??

sugeeth14 avatar Sep 18 '18 09:09 sugeeth14

Try calling SubwordTextEncoder.build_from_generator directly and pass max_subtoken_length. See the docstring there. We should find a number for Zh problems that controls memory but still produces good results.

rsepassi avatar Sep 18 '18 15:09 rsepassi

@rsepassi can I use sentencepiece to generate vocabulary and keep it in t2t_data folder ?? in that case can I simply run the t2t_trainer or any other changes are needed? and in case I simply pass some value for max_subtoken_length instead of "None" is there any ideal value for that for chinese

sugeeth14 avatar Sep 20 '18 11:09 sugeeth14

I would also find it helpful if there is a suggested 'max_subtoken_length' value.

I haven't yet received a memory issue, but t2tdatagen is taking a long time to run. How long did it take you? @hpulfc @xuekun90 @robotzheng

echan00 avatar Nov 03 '18 14:11 echan00

It seems encode() in data_generators/tokenizer.py doesn't support Chinese. It cannot tokenize a Chinese sentence.

The attached patch could make it support character based tokenization.

cjk.txt

torshie avatar Mar 16 '19 16:03 torshie

I would also find it helpful if there is a suggested 'max_subtoken_length' value.

I haven't yet received a memory issue, but t2tdatagen is taking a long time to run. How long did it take you? @hpulfc @xuekun90 @robotzheng

@echan00 did you ever find a suggested 'max_subtoken_length' value? The default of 200 seems to be very large.

Santosh-Gupta avatar Jul 15 '19 21:07 Santosh-Gupta

Hello all, any updates on this? I also encountered the massive use of memory when try to run t2t-datagen for translate_enzh_wmt32k

timxzz avatar Aug 06 '19 21:08 timxzz

https://github.com/tensorflow/tensor2tensor/issues/855#issuecomment-473559816

It seems encode() in data_generators/tokenizer.py doesn't support Chinese. It cannot tokenize a Chinese sentence.

The attached patch could make it support character based tokenization.

cjk.txt

It works for me.Thanks!

qpzhao avatar Aug 03 '20 08:08 qpzhao