Bug rep when run gpipe with lm.one_billion_wds.OneBWdsGPipeTransformerWPM
Hi, when I try to run the Gpipe example one_billion_wds using the given command:
trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM --logdir=/tmp/lm/log --logtostderr --worker_split_size=4 --worker_gpus=4
There is an error :
Traceback (most recent call last):
File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1957, in
I have 4 Tesla V100-SXM2 GPUs. And when I run the one_billion example without Gpipe, it works. I don't know how to fix this problem. Could you please give some advice? THX
Hey, were you able to resolve this issue? I'm facing the same problem.
Not yet :(
Is there any other example using gpipe? If not, I think it should be easy enough to copy paste the necessary parts from that class and use it on another example like mnist.
Hey, were you able to resolve this issue? I'm facing the same problem.
Hi, have you found the solution to this issue?
I did a lot of debugging and found out that there is an issue with input generation (for example, a tensor shaped (32, ) is being converted to (32, None). This causes assertion errors at later stages. However, I was able to develop a working model using gpipe for the mnist dataset (by extending LeNet5)
I did a lot of debugging and found out that there is an issue with input generation (for example, a tensor shaped (32, ) is being converted to (32, None). This causes assertion errors at later stages. However, I was able to develop a working model using gpipe for the mnist dataset (by extending LeNet5)
Yeah, I find that input generation problem. I try to solve it , but I have not got the point yet. Could you share with me your modified model? I also want to try to learn how to develop gpipe for other datasets. Thanks a lot.
Sure, how would you like me to share it? There are a lot of new additions and fixes to various library files
You can find the necessary files https://github.com/adis98/Lingvo_modified
You can find it here Thanks
You have a system with multiple gpus right? Could you let me know if it is running fine? If there are any bugs do let me know.
You have a system with multiple gpus right? Could you let me know if it is running fine? If there are any bugs do let me know.
sure, I will comment it to you.
I've tested the code (the image processing model) with 2 GPUS (8GB, M60 Tesla) and it seems to be working fine.
I've tested the code (the image processing model) with 2 GPUS (8GB, M60 Tesla) and it seems to be working fine.
oh, nice! I have not got time to try it. I will comment on it to you when I finished running the code.
@adis98 I have the same issue with "AssertionError: [1, None]" when running lm.one_billion_wds.OneBWdsGPipeTransformerWPM, I saw you comment that it was the issue of input generator, does it mean it was an issue in "input_generator.py"? I compare the code in modified_library_code in https://github.com/adis98/Lingvo_modified/tree/main/modified_library_codes/lingvo/tasks/lm , it seems there is no change in this file, can you point out which code need to be fix? very appreciate. Thanks.
@jzhoulon. I didn’t work towards resolving the issue in the lm task, but I did make some modifications to the image task, making use of GPipe.
Hi,I also encountered the same error report as you. Have you solved this error?If you have solved the error, could you please give me some advice?Thanks. @xsppp