Assert subkey == 'step'
Hello. I have this error below when using SGD as optimizer. With Adam it work correctly.
optimizer = optim.Adam(params_to_update, lr=1e-4)
#optimizer = optim.AdamW(params_to_update, lr=0.001, weight_decay=0.02)
#optimizer = optim.SGD(params_to_update, lr=0.01)
Basically in this setting it work, but if I comment Adam and uncomment SGD then I have this error:
[11:07:13] INFO Using Interactive Python API collaborator.py:237
ERROR Collaborator failed with error: : envoy.py:93
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/envoy/envoy.py", line 91, in run
self._run_collaborator()
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/envoy/envoy.py", line 164, in
_run_collaborator
col.run()
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 145,
in run
self.do_task(task, round_number)
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/component/collaborator/collaborator.py", line 255,
in do_task
global_output_tensor_dict, local_output_tensor_dict = func(
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/federated/task/task_runner.py", line 108, in
collaborator_adapted_task
self.rebuild_model(input_tensor_dict, validation=validation_flag, device=device)
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/federated/task/task_runner.py", line 229, in
rebuild_model
self.set_tensor_dict(input_tensor_dict, with_opt_vars=True, device=device)
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/federated/task/task_runner.py", line 381, in
set_tensor_dict
return self.framework_adapter.set_tensor_dict(*args, **kwargs)
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/plugins/frameworks_adapters/pytorch_adapter.py",
line 55, in set_tensor_dict
_set_optimizer_state(optimizer, device, tensor_dict)
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/plugins/frameworks_adapters/pytorch_adapter.py",
line 70, in _set_optimizer_state
temp_state_dict = expand_derived_opt_state_dict(
File "/home/ubuntu/anaconda3/envs/openfl/lib/python3.8/site-packages/openfl/plugins/frameworks_adapters/pytorch_adapter.py",
line 236, in expand_derived_opt_state_dict
assert subkey == 'step'
AssertionError
Here is my analysis on the issue:
Issue Reproduction: - Using SGD as an optimizer in PyTorch_TinyImageNet Tutorial optimizer_SGD = optim.SGD(model.parameters(), lr=1e-1) - This issue is observed when we use torch >=1.8.0. In my case it is torch 1.13.1
Issue is observed with torch>=1.8.0 but not with torch <=1.7.1 torch<=1.7.1 - By default the length of state dictionary is zero. - Inside method: _derive_opt_state_dict(opt_state_dict): len(opt_state_dict['state']) evaluates to 0 indicating that the optimizer is stateless.
Logs:
INFO #### opt_state_dict {'state': {}, 'param_groups': [{'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, pytorch_adapter.py:121
'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': []}]} ####
WARNING tried to remove tensor: __opt_state_needed not present in the tensor dict utils.py:170
INFO #### opt_state_dict {'state': {}, 'param_groups': [{'lr': 0.1, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, pytorch_adapter.py:121
'nesterov': False, 'maximize': False, 'foreach': None, 'differentiable': False, 'params': []}]} ####
torch>=1.8.0 - The state dictionary is updated with momentum buffers which are set to None by default and the State dictionary becomes non-empty. - _derive_opt_state_dict(opt_state_dict): len(opt_state_dict['state']) evaluates to non-zero indicating that the optimizer has state. - This will lead to an assertion error(Assert subkey == 'step') as none of the subkeys is 'step'.
Logs:
INFO #### opt_state_dict {'state': {0: {'momentum_buffer': None}, 1: {'momentum_buffer': None}}, 'param_groups': [{'lr': pytorch_adapter.py:120
0.0001, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None,
'differentiable': False, 'params': [0, 1]}]} ####
WARNING tried to remove tensor: __opt_state_needed not present in the tensor dict utils.py:172
INFO #### opt_state_dict {'state': {0: {'momentum_buffer': None}, 1: {'momentum_buffer': None}}, 'param_groups': [{'lr': pytorch_adapter.py:120
0.0001, 'momentum': 0, 'dampening': 0, 'weight_decay': 0, 'nesterov': False, 'maximize': False, 'foreach': None,
'differentiable': False, 'params': [0, 1]}]} ####
Additional Information Issue not observed if momentum!=0 - When we add momentum parameter in the optimizer definition. The state dict is updated with tensors. - In this case the state is needed for the optimizer and derived_opt_state_dict['__opt_state_needed'] = 'true' will be set and it will continue to work as expected.
Next Steps: I will raise a PR for fixing this issue.
Thank you for the answer. So, until now we can fix only by changing the momentum?
Yes, for now we can fix this by giving a non-zero value to momentum