The Pytorch Lightning Adapter raises TypeErrors after calling `finalize()` for remote runs
🐛 Bug
After calling finalize() in the AimLogger adapter for Pytorch Lightning, the logger fails to reconnect to the Aim Run. This for example happens in a training loop like:
trainer = Trainer(logger=aim_logger, ...)
trainer.fit(...)
trainer.test(...)
Full traceback
File "/home/cas/Documents/repositories/Yggdrasil/yggdrasil/apps/train.py", line 172, in train_cli
aim_logger.experiment.add_tag("Interrupted")
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/lightning_fabric/loggers/logger.py", line 117, in experiment
return get_experiment() or _DummyExperiment()
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 27, in wrapped_fn
return fn(*args, **kwargs)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/lightning_fabric/loggers/logger.py", line 115, in get_experiment
return fn(self)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 80, in experiment
self._run = Run(
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
_SafeModeConfig.exception_callback(e, func)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
raise e
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
return func(*args, **kwargs)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/run.py", line 828, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/run.py", line 276, in __init__
super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/base_run.py", line 50, in __init__
self._lock.lock(force=force_resume)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/storage/lock_proxy.py", line 38, in lock
return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/transport/client.py", line 258, in run_instruction
return self._run_read_instructions(queue_id, resource, method, args)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/transport/client.py", line 283, in _run_read_instructions
raise_exception(status_msg.header.exception)
File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
raise exception(*args) if args else exception()
TypeError: __init__() missing 1 required positional argument: 'lock_file'
To reproduce
aim_repo = ... # Remote repo
logger = AimLogger(experiment="test", repo=aim_repo)
logger.log_metrics({"A": 1, "B": 2}, step=0)
logger.log_metrics({"A": 10, "B": 20}, step=1)
logger.finalize()
logger.experiment.add_tag("working_or_not") # raises the TypeError
Expected behavior
Accessing the .experiment attribute should reconnect to the remote Aim Run.
Environment
- Aim Version (e.g., 3.0.1): 3.17.3
- Python version: 3.9.16
- pip version: 23.0.1
- OS (e.g., Linux): Linux
Additional context
With the following monkey patch, the exception is not raised. Specifically: Setting force_resume=True.
# Monkey patch for the win!
import aim
from pytorch_lightning.loggers.logger import rank_zero_experiment
from aim.pytorch_lightning import AimLogger
@property
@rank_zero_experiment
def monkey_patch_valtracker_experiment(self) -> aim.Run:
if self._run is None:
if self._run_hash:
self._run = aim.Run(
self._run_hash,
repo=self._repo_path,
system_tracking_interval=self._system_tracking_interval,
capture_terminal_logs=self._capture_terminal_logs,
force_resume=True,
)
if self._run_name is not None:
self._run.name = self._run_name
else:
self._run = aim.Run(
repo=self._repo_path,
experiment=self._experiment_name,
system_tracking_interval=self._system_tracking_interval,
log_system_params=self._log_system_params,
capture_terminal_logs=self._capture_terminal_logs,
)
self._run_hash = self._run.hash
return self._run
AimLogger.experiment = monkey_patch_valtracker_experiment
Hey @cwognum! Thanks a lot for detailed report. I'll have a look at it. Will keep posted here once any updates.
I have not been able to reproduce this in a simple, small example, but in a larger training loop it feels like the provided monkey patch (force_resume=True) slows down the application for a few minutes.
My training loop is organized like:
trainer = Trainer(logger=aim_logger, ...)
trainer.fit(...)
trainer.test(...)
The loop freezes for a few minutes in between fit() and test(), which it didn't do with a previous version of Aim and which I cannot attribute to any other piece of logic.
I understand this is not the most helpful report, but I wanted to share it nonetheless in case it makes sense to someone with a better understanding of the Aim codebase than myself. In the meantime I'll try reproduce this in a simple example.
@cwognum that's actually expected when lot of data is tracked, as when the fit() finishes aim is trying to clean up the queue of pending values to be tracked as it's also finalizing the run
so the problem is not in the patch, it's kind of expected anyways when using remote tracking
Clear! Thank you! In my mind, it didn't take this long before, but maybe I'm just having a day of bad WiFi or am I misremembering this. Thank you for the quick reply!
Perhaps unrelated, but I now get warnings (process does not exit) like the one below in between training and testing as well. Again (sorry!), I have not been able to reproduce this in a simple example, as it happens sporadically in longer trainings (>30 min).
Exception ignored in: <finalize object at 0x7f9ba2601620; dead>
Traceback (most recent call last):
File "/home/cas/local/conda/envs/my_env/lib/python3.9/weakref.py", line 591, in __call__
return info.func(*info.args, **(info.kwargs or {}))
File "/home/cas/local/conda/envs/my_env/lib/python3.9/site-packages/aim/ext/transport/remote_resource.py", line 14, in _close
self.rpc_client.release_resource(self.handler)
File "/home/cas/local/conda/envs/my_env/lib/python3.9/site-packages/aim/ext/transport/client.py", line 240, in release_resource
del self._resource_pool[resource_handler]
File "/home/cas/local/conda/envs/my_env/lib/python3.9/weakref.py", line 146, in __delitem__
del self.data[key]
KeyError: 'e61a8904-b4a7-4a9f-b8de-98d872b86800'
Encountered the same error TypeError: __init__() missing 1 required positional argument: 'lock_file'. As described in the issue, the following simple code produced the error:
trainer = Trainer(logger=aim_logger, ...)
trainer.fit(...)
trainer.test(...)
The error rose inside test function, after fit had finished.
Modifying the code to use aim.Run(..., force_resume=True) helped, but not sure if this is the correct way to do.
Used aim==3.17.3
Modifying the code to use
aim.Run(..., force_resume=True)helped, but not sure if this is the correct way to do.Used aim==3.17.3
@Zakhar17 I'm seeing the same issue.
Do you mean that instead of using aim.pytorch_lightning.AimLogger you simply create an instance of aim.Run and attach that to your lightning module?
Hi, is there going to be a fix for this? I'm trying to inherit AimLogger with the monkey patch above and the iteration count seems to be degraded.