aim The Pytorch Lightning Adapter raises TypeErrors after calling `finalize()` for remote runs

🐛 Bug

After calling finalize() in the AimLogger adapter for Pytorch Lightning, the logger fails to reconnect to the Aim Run. This for example happens in a training loop like:

trainer = Trainer(logger=aim_logger, ...)
trainer.fit(...)
trainer.test(...)

Full traceback

  File "/home/cas/Documents/repositories/Yggdrasil/yggdrasil/apps/train.py", line 172, in train_cli
    aim_logger.experiment.add_tag("Interrupted")
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/lightning_fabric/loggers/logger.py", line 117, in experiment
    return get_experiment() or _DummyExperiment()
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 27, in wrapped_fn
    return fn(*args, **kwargs)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/lightning_fabric/loggers/logger.py", line 115, in get_experiment
    return fn(self)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 80, in experiment
    self._run = Run(
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/transport/client.py", line 258, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/transport/client.py", line 283, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/home/cas/local/conda/envs/yggdrasil/lib/python3.9/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
TypeError: __init__() missing 1 required positional argument: 'lock_file'

To reproduce

aim_repo = ...  # Remote repo
logger = AimLogger(experiment="test", repo=aim_repo)

logger.log_metrics({"A": 1, "B": 2}, step=0)
logger.log_metrics({"A": 10, "B": 20}, step=1)

logger.finalize()

logger.experiment.add_tag("working_or_not")  # raises the TypeError

Expected behavior

Accessing the .experiment attribute should reconnect to the remote Aim Run.

Environment

Aim Version (e.g., 3.0.1): 3.17.3
Python version: 3.9.16
pip version: 23.0.1
OS (e.g., Linux): Linux

Additional context

With the following monkey patch, the exception is not raised. Specifically: Setting force_resume=True.

# Monkey patch for the win!
import aim
from pytorch_lightning.loggers.logger import rank_zero_experiment
from aim.pytorch_lightning import AimLogger


@property
@rank_zero_experiment
def monkey_patch_valtracker_experiment(self) -> aim.Run:
    if self._run is None:
        if self._run_hash:
            self._run = aim.Run(
                self._run_hash,
                repo=self._repo_path,
                system_tracking_interval=self._system_tracking_interval,
                capture_terminal_logs=self._capture_terminal_logs,
                force_resume=True,
            )
            if self._run_name is not None:
                self._run.name = self._run_name
        else:
            self._run = aim.Run(
                repo=self._repo_path,
                experiment=self._experiment_name,
                system_tracking_interval=self._system_tracking_interval,
                log_system_params=self._log_system_params,
                capture_terminal_logs=self._capture_terminal_logs,
            )
            self._run_hash = self._run.hash
    return self._run


AimLogger.experiment = monkey_patch_valtracker_experiment

Apr 24 '23 21:04 cwognum

Hey @cwognum! Thanks a lot for detailed report. I'll have a look at it. Will keep posted here once any updates.

Apr 25 '23 13:04 mihran113

I have not been able to reproduce this in a simple, small example, but in a larger training loop it feels like the provided monkey patch (force_resume=True) slows down the application for a few minutes.

My training loop is organized like:

trainer = Trainer(logger=aim_logger, ...)
trainer.fit(...)
trainer.test(...)

The loop freezes for a few minutes in between fit() and test(), which it didn't do with a previous version of Aim and which I cannot attribute to any other piece of logic.

I understand this is not the most helpful report, but I wanted to share it nonetheless in case it makes sense to someone with a better understanding of the Aim codebase than myself. In the meantime I'll try reproduce this in a simple example.

Apr 26 '23 18:04 cwognum

@cwognum that's actually expected when lot of data is tracked, as when the fit() finishes aim is trying to clean up the queue of pending values to be tracked as it's also finalizing the run

Apr 26 '23 19:04 mihran113

so the problem is not in the patch, it's kind of expected anyways when using remote tracking

Apr 26 '23 19:04 mihran113

Clear! Thank you! In my mind, it didn't take this long before, but maybe I'm just having a day of bad WiFi or am I misremembering this. Thank you for the quick reply!

Apr 26 '23 19:04 cwognum

Perhaps unrelated, but I now get warnings (process does not exit) like the one below in between training and testing as well. Again (sorry!), I have not been able to reproduce this in a simple example, as it happens sporadically in longer trainings (>30 min).

Exception ignored in: <finalize object at 0x7f9ba2601620; dead>
Traceback (most recent call last):
  File "/home/cas/local/conda/envs/my_env/lib/python3.9/weakref.py", line 591, in __call__
    return info.func(*info.args, **(info.kwargs or {}))
  File "/home/cas/local/conda/envs/my_env/lib/python3.9/site-packages/aim/ext/transport/remote_resource.py", line 14, in _close
    self.rpc_client.release_resource(self.handler)
  File "/home/cas/local/conda/envs/my_env/lib/python3.9/site-packages/aim/ext/transport/client.py", line 240, in release_resource
    del self._resource_pool[resource_handler]
  File "/home/cas/local/conda/envs/my_env/lib/python3.9/weakref.py", line 146, in __delitem__
    del self.data[key]
KeyError: 'e61a8904-b4a7-4a9f-b8de-98d872b86800'

Apr 28 '23 18:04 cwognum

Encountered the same error TypeError: __init__() missing 1 required positional argument: 'lock_file'. As described in the issue, the following simple code produced the error:

trainer = Trainer(logger=aim_logger, ...)
trainer.fit(...)
trainer.test(...)

The error rose inside test function, after fit had finished.

Modifying the code to use aim.Run(..., force_resume=True) helped, but not sure if this is the correct way to do.

Used aim==3.17.3

Aug 03 '23 14:08 Zakhar17

Modifying the code to use aim.Run(..., force_resume=True) helped, but not sure if this is the correct way to do.

Used aim==3.17.3

@Zakhar17 I'm seeing the same issue.

Do you mean that instead of using aim.pytorch_lightning.AimLogger you simply create an instance of aim.Run and attach that to your lightning module?

Aug 28 '23 21:08 whlteXbread

Hi, is there going to be a fix for this? I'm trying to inherit AimLogger with the monkey patch above and the iteration count seems to be degraded.

Nov 17 '23 14:11 danielphan2003