`runner.evaluate_loader` does not work with DataParallelEngine
🐛 Bug Report
How To Reproduce
I have two GPUs and enable both of them. I copied the linear regression minimal example. After that, I checked
runner.engine
# <catalyst.engines.torch.DataParallelEngine at 0x7f67e25d72b0>
Then, the following line produced a long error message:
runner.evaluate_loader(loaders['valid'])
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/home/shuhua/GitHub/Learn-DL/catalyst-tutorial/linear-regression.ipynb Cell 10' in <cell line: 1>()
----> [1](vscode-notebook-cell://ssh-remote%2B7b22686f73744e616d65223a22575333303930227d/home/shuhua/GitHub/Learn-DL/catalyst-tutorial/linear-regression.ipynb#ch0000009vscode-remote?line=0) runner.evaluate_loader(loaders['valid'])
File ~/miniconda3/lib/python3.9/site-packages/catalyst/runners/runner.py:490, in Runner.evaluate_loader(self, loader, callbacks, model, engine, seed, verbose)
487 model = self.model
488 assert model is not None
--> 490 self.train(
491 model=model,
492 engine=engine,
493 loaders=OrderedDict([("valid", loader)]),
494 num_epochs=1,
495 verbose=verbose,
496 callbacks=callbacks,
497 valid_loader="valid",
498 seed=seed,
499 )
501 return self.loader_metrics
File ~/miniconda3/lib/python3.9/site-packages/catalyst/runners/runner.py:377, in Runner.train(self, loaders, model, engine, criterion, optimizer, scheduler, callbacks, loggers, seed, hparams, num_epochs, logdir, resume, valid_loader, valid_metric, minimize_valid_metric, verbose, timeit, check, overfit, profile, load_best_on_end, cpu, fp16, ddp)
375 self._load_best_on_end = load_best_on_end
376 # run
--> 377 self.run()
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:422, in IRunner.run(self)
420 except (Exception, KeyboardInterrupt) as ex:
421 self.exception = ex
--> 422 self._run_event("on_exception")
423 return self
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:365, in IRunner._run_event(self, event)
363 getattr(callback, event)(self)
364 if is_str_intersections(event, ("_end", "_exception")):
--> 365 getattr(self, event)(self)
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:357, in IRunner.on_exception(self, runner)
355 def on_exception(self, runner: "IRunner"):
356 """Event handler."""
--> 357 raise self.exception
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:419, in IRunner.run(self)
413 """Runs the experiment.
414
415 Returns:
416 self, `IRunner` instance after the experiment
417 """
418 try:
--> 419 self._run()
420 except (Exception, KeyboardInterrupt) as ex:
421 self.exception = ex
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:410, in IRunner._run(self)
408 def _run(self) -> None:
409 self.engine = self.get_engine()
--> 410 self.engine.spawn(self._run_local)
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/engine.py:59, in Engine.spawn(self, fn, *args, **kwargs)
42 def spawn(self, fn: Callable, *args, **kwargs):
43 """Spawns processes with specified ``fn`` and ``args``/``kwargs``.
44
45 Args:
(...)
57 wrapped function (if needed).
58 """
---> 59 return fn(*args, **kwargs)
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:405, in IRunner._run_local(self, local_rank, world_size)
403 self._local_rank, self._world_size = local_rank, world_size
404 self._run_event("on_experiment_start")
--> 405 self._run_experiment()
406 self._run_event("on_experiment_end")
File ~/miniconda3/lib/python3.9/site-packages/catalyst/core/runner.py:399, in IRunner._run_experiment(self)
397 break
...
File "/home/shuhua/miniconda3/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 103, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument mat1 in method wrapper_addmm)
By contrast, if I use one GPU or CPU via setting os.environ["CUDA_VISIBLE_DEVICES"], then it works.
Pytorch DataParallel supports inference on multiple GPUs, right? I don't understand why evaluate_loader fails with DataParallelEngine.
Environment
example checklist, fill with your info
Catalyst version: 20.04. PyTorch version: 1.11.0 Python version: 3.9 CUDA runtime version: 11.4 Nvidia driver version: 472.39
hi,
thanks for the issue!
could you please try using evaluate_loader without any training?)
as I see our implementation, we just run the experiment... so the problem could be with transferring model/data from train to validation experiment.
I tried
- Setting
num_epochs=0inrunner.train, and the same error occurred. - Comment out totally the
runner.train, and changeevaluate_loaderto
runner.evaluate_loader(loaders['valid'], model=model)
Then there was no error, though the code is not useful.
so, looks like we have some problems with handware-backend 😢 maybe, @ditwoo @bagxi could also review it our :)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.