lingvo icon indicating copy to clipboard operation
lingvo copied to clipboard

Bug rep when run gpipe with lm.one_billion_wds.OneBWdsGPipeTransformerWPM

Open xsppp opened this issue 5 years ago • 16 comments

Hi, when I try to run the Gpipe example one_billion_wds using the given command:

trainer --run_locally=gpu --mode=sync --model=lm.one_billion_wds.OneBWdsGPipeTransformerWPM --logdir=/tmp/lm/log --logtostderr --worker_split_size=4 --worker_gpus=4

There is an error :

Traceback (most recent call last): File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1957, in tf.app.run(main) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/platform/app.py", line 40, in run _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 303, in run _run_main(main, args) File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1941, in main RunnerManager(FLAGS.model).Start() File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1937, in Start self.StartRunners(self.CreateRunners(FLAGS.job.split(','), FLAGS.logdir)) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1666, in CreateRunners runner = self._CreateRunner(j, FLAGS.model_task_name, logdir, tf_master, File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 1617, in _CreateRunner return self.Controller(cfg, *common_args) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/trainer.py", line 241, in init self._model.ConstructFPropBPropGraph() File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 1056, in ConstructFPropBPropGraph self._task.FPropDefaultTheta() File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 554, in FPropDefaultTheta return self.FProp(self.theta, input_batch) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 471, in FProp metrics, per_example = self._FPropSplitInputBatch(theta, input_batch) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/base_model.py", line 518, in _FPropSplitInputBatch metrics, per_example = self.FPropTower(theta_local, batch) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/tasks/lm/model.py", line 267, in FPropTower xent_output, _ = self.lm.FProp(theta.lm, ids, paddings, state0, labels) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/tasks/lm/layers.py", line 1308, in FProp per_example_xent, logits = self.stack.FProp(theta.stack, ids, paddings, File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/layers_with_gpipe.py", line 865, in FProp logits = super().FProp(theta, source_input, source_paddings, target_input, File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/gpipe.py", line 454, in FProp state_shapes = self._CalculateOutputShapes(input_shapes) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/gpipe.py", line 366, in _CalculateOutputShapes shapes = py_utils.Transform(_ToTShape, input_shapes) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/py_utils.py", line 810, in Transform return tf.nest.map_structure(fn, *v) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py", line 635, in map_structure structure[0], [func(*x) for x in entries], File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/nest.py", line 635, in structure[0], [func(*x) for x in entries], File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/gpipe.py", line 364, in _ToTShape return tshape.Shape(x.as_list()) File "/tmp/lingvo/bazel-bin/lingvo/trainer.runfiles/main/lingvo/core/tshape.py", line 43, in init assert x is not None, str(dims) AssertionError: [1, None]

I have 4 Tesla V100-SXM2 GPUs. And when I run the one_billion example without Gpipe, it works. I don't know how to fix this problem. Could you please give some advice? THX

xsppp avatar Jan 20 '21 14:01 xsppp

Hey, were you able to resolve this issue? I'm facing the same problem.

adis98 avatar Feb 03 '21 06:02 adis98

Not yet :(

xsppp avatar Feb 03 '21 08:02 xsppp

Is there any other example using gpipe? If not, I think it should be easy enough to copy paste the necessary parts from that class and use it on another example like mnist.

adis98 avatar Feb 03 '21 10:02 adis98

Hey, were you able to resolve this issue? I'm facing the same problem.

Hi, have you found the solution to this issue?

xsppp avatar Feb 17 '21 08:02 xsppp

I did a lot of debugging and found out that there is an issue with input generation (for example, a tensor shaped (32, ) is being converted to (32, None). This causes assertion errors at later stages. However, I was able to develop a working model using gpipe for the mnist dataset (by extending LeNet5)

adis98 avatar Feb 17 '21 09:02 adis98

I did a lot of debugging and found out that there is an issue with input generation (for example, a tensor shaped (32, ) is being converted to (32, None). This causes assertion errors at later stages. However, I was able to develop a working model using gpipe for the mnist dataset (by extending LeNet5)

Yeah, I find that input generation problem. I try to solve it , but I have not got the point yet. Could you share with me your modified model? I also want to try to learn how to develop gpipe for other datasets. Thanks a lot.

xsppp avatar Feb 17 '21 09:02 xsppp

Sure, how would you like me to share it? There are a lot of new additions and fixes to various library files

adis98 avatar Feb 17 '21 09:02 adis98

You can find the necessary files https://github.com/adis98/Lingvo_modified

adis98 avatar Feb 17 '21 09:02 adis98

You can find it here Thanks

xsppp avatar Feb 17 '21 09:02 xsppp

You have a system with multiple gpus right? Could you let me know if it is running fine? If there are any bugs do let me know.

adis98 avatar Feb 17 '21 10:02 adis98

You have a system with multiple gpus right? Could you let me know if it is running fine? If there are any bugs do let me know.

sure, I will comment it to you.

xsppp avatar Feb 17 '21 11:02 xsppp

I've tested the code (the image processing model) with 2 GPUS (8GB, M60 Tesla) and it seems to be working fine.

adis98 avatar Mar 03 '21 05:03 adis98

I've tested the code (the image processing model) with 2 GPUS (8GB, M60 Tesla) and it seems to be working fine.

oh, nice! I have not got time to try it. I will comment on it to you when I finished running the code.

xsppp avatar Mar 05 '21 15:03 xsppp

@adis98 I have the same issue with "AssertionError: [1, None]" when running lm.one_billion_wds.OneBWdsGPipeTransformerWPM, I saw you comment that it was the issue of input generator, does it mean it was an issue in "input_generator.py"? I compare the code in modified_library_code in https://github.com/adis98/Lingvo_modified/tree/main/modified_library_codes/lingvo/tasks/lm , it seems there is no change in this file, can you point out which code need to be fix? very appreciate. Thanks.

jzhoulon avatar Aug 09 '21 08:08 jzhoulon

@jzhoulon. I didn’t work towards resolving the issue in the lm task, but I did make some modifications to the image task, making use of GPipe.

adis98 avatar Aug 09 '21 09:08 adis98

Hi,I also encountered the same error report as you. Have you solved this error?If you have solved the error, could you please give me some advice?Thanks. @xsppp

rank2008 avatar Oct 18 '21 07:10 rank2008