python-amazon-sp-api icon indicating copy to clipboard operation
python-amazon-sp-api copied to clipboard

Global cache usage is not thread-safe and can cause bugs due to race conditions

Open dbent opened this issue 3 years ago • 1 comments

As far as I can tell there are three global caches in use:

All three are instances of cachetools.TTLCache from the cachetools package.

(There is also a cache class member on CredentialProvider at credential_provider.py#L113 but that appears to be dead code, it's not used anywhere.)

The cachetools documentation states:

Please be aware that all these classes are not thread-safe. Access to a shared cache from multiple threads must be properly synchronized, e.g. by using one of the memoizing decorators with a suitable lock object.

Since these caches are global it is impractical to avoid potential concurrent access in a multithreaded environment, even if multiple Client instances are used. The only way to guarantee safe usage in a multithreaded environment, as-is, would be to synchronize usage of all methods and fields across all Client objects. Since these operations are potentially lengthy requests, this would cause significant lock contention.

The lack of synchronization leads to race conditions and strange bugs deep inside the cachetools package such as this exception (potentially confidential parts removed and replaced with <...snip...>):

Message
'access_token_<...snip...>'

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 68, in __getitem__
return self.__data[key]
KeyError: 'access_token_<...snip...>'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/sp_api/auth/access_token_client.py", line 47, in get_auth
access_token = cache[cache_key]
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 418, in __getitem__
return cache_getitem(self, key)
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 70, in __getitem__
return self.__missing__(key)
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 97, in __missing__
raise KeyError(key)
KeyError: 'access_token_<...snip...>'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
<...snip...>
File "/usr/local/lib/python3.10/site-packages/sp_api/base/helpers.py", line 19, in wrapper
return function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/sp_api/api/reports/reports.py", line 113, in create_report
return self._request(kwargs.pop('path'), data=kwargs)
File "/usr/local/lib/python3.10/site-packages/sp_api/base/client.py", line 140, in _request
headers=headers or self.headers,
File "/usr/local/lib/python3.10/site-packages/sp_api/base/client.py", line 81, in headers
'x-amz-access-token': self.restricted_data_token or self.auth.access_token,
File "/usr/local/lib/python3.10/site-packages/sp_api/base/client.py", line 88, in auth
return self._auth.get_auth()
File "/usr/local/lib/python3.10/site-packages/sp_api/auth/access_token_client.py", line 51, in get_auth
cache[cache_key] = access_token
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 422, in __setitem__
self.expire(time)
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 476, in expire
cache_delitem(self, curr.key)
File "/usr/local/lib/python3.10/site-packages/cachetools/__init__.py", line 90, in __delitem__
del self.__data[key]
KeyError: 'access_token_<...snip...>'

The code attempts to delete a key which has apparently already been deleted (presumably by another thread).

One potential solution is to simply make the caches instance members of the Client classes and shift responsibility for ensuring proper synchronization to the calling code. Since this was already my assumption this works for my use case and I've implemented it in my own fork in the bugfix/non_global_caches branch.

The obvious side-effect is that Client instances no longer share a single cache. This is irrelevant for my use case, but is a functional change I'm not sure you wish to make. A more complete solution for this may be to encapsulate all the cachetools.TTLCache objects in a single custom cache object, allow callers to instantiate their own instance, and pass it into new instances of Client objects. Callers would then be responsible for synchronizing any Client instances that share the same cache object.

Retaining global caches but synchronizing them internally is another possibility, but potentially hazardous in my opinion. This would necessitate the use of one or more internal locks. If multiple locks are used there is a very real threat that potential deadlocks could be introduced now and forever into the future even with seemingly innocuous refactoring and code changes. A single internal lock is safer and more robust in the face of future changes but still leaves the possibility of deadlock if calling code uses its own locks and calls into the library along different code paths.

dbent avatar Sep 25 '22 15:09 dbent

Hey, thanks for bringing this up, the cache has to be fixed and enhanced - wanted to start working on this since some time.

saleweaver avatar Sep 25 '22 21:09 saleweaver