AllenSDK Make it possible for multiple processes to download data to the same CloudCache

Multiple users for a single cache. Doug attempted to parallelize the download of the entire cache. There were collisions on a json file that tracks this.

We could implement a file lock on that file, which allows for this. I also see this even when the downloads are not taking place.
We could make the StaticLocalCache work with not just an s3 bucket, but also with a local directory on disk. This currently fails because the local caches do not have a manifests folder, and the s3 one does.
The use case here is a single maintainer and multiple read-only (no download) users. We'd want to lock users out from instantiating an s3 cache perhaps.

Sep 17 '21 22:09 danielsf

This issue is really killing me with analysis right now...I cant parallelize anything because I can only read each file one at a time. Makes things extremely slow :(

Which points out that this isn't just an issue of multiple users for a single cache, a single user for a single cache can have the same issues.

Sep 20 '21 02:09 matchings

I should also point out that we have 4 other scientists who need to be using this same cache for analysis for the platform paper. I anticipate that this particular issue is going to become a fire that needs to be put out very soon.

@wbwakeman just want to flag this for you as a priority (or at least warn you that its probably about to become one)

Sep 20 '21 02:09 matchings

I had originally told Marina that if she downloaded all of the data first, parallel access to the cache should not be a problem. Only parallel downloading of data was a problem. I now see a flaw in our implementation that makes that not true.

I can do a quick bugfix that will make it possible to download all the data with one process and then access the data in parallel from multiple processes (downloading data in parallel will be a heavier lift).

I'll let you know when that bugfix is ready.

Sorry about the confusion.

Sep 20 '21 16:09 danielsf

I’m actually more concerned about parallel access than parallel download at this point. Not having ability to do parallel download is non-ideal but really not a huge problem since you just have to do it once. Parallel access is more of an obstruction because it’s a continual ongoing thing. If you have to prioritize, I would go for that first.

Thanks for investigating and being so responsive about this stuff!

Sep 20 '21 17:09 matchings

how can I parallelize a code that requires a cache (cache= VisualBehaviorNeuropixelsProjectCache.from_s3_cache(cache_dir=neuropixel_dataset_behavior))? Is this possible? Joblib encounters a Pickling error dealing with the cache object:

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 125, in _feed
    obj_ = dumps(obj, reducers=reducers)
  File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 211, in dumps
    dump(obj, buf, reducers=reducers, protocol=protocol)
  File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 204, in dump
    _LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
  File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.lock' object
"""

The above exception was the direct cause of the following exception:

PicklingError                             Traceback (most recent call last)
Cell In[395], line 17
     14 num_cores = joblib.cpu_count()
     16 # Use Parallel and delayed to run the function in parallel.
---> 17 results = Parallel(n_jobs=num_cores)(
     18     delayed(process_session)(id) for id in tqdm(ids)
     19 )
     21 # `results` will be a list of DataFrames.
     22 out = [result for result in results if not result.empty]

File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/parallel.py:1098, in Parallel.__call__(self, iterable)
   1095     self._iterating = False
   1097 with self._backend.retrieval_context():
-> 1098     self.retrieve()
   1099 # Make sure that we get a last message telling us we are done
   1100 elapsed_time = time.time() - self._start_time

File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/parallel.py:975, in Parallel.retrieve(self)
    973 try:
    974     if getattr(self._backend, 'supports_timeout', False):
--> 975         self._output.extend(job.get(timeout=self.timeout))
    976     else:
    977         self._output.extend(job.get())

File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/_parallel_backends.py:567, in LokyBackend.wrap_future_result(future, timeout)
    564 """Wrapper for Future.result to implement the same behaviour as
    565 AsyncResults.get from multiprocessing."""
    566 try:
--> 567     return future.result(timeout=timeout)
    568 except CfTimeoutError as e:
    569     raise TimeoutError from e

File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/concurrent/futures/_base.py:446, in Future.result(self, timeout)
    444     raise CancelledError()
    445 elif self._state == FINISHED:
--> 446     return self.__get_result()
    447 else:
    448     raise TimeoutError()

File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
    389 if self._exception:
    390     try:
--> 391         raise self._exception
    392     finally:
    393         # Break a reference cycle with the exception in self._exception
    394         self = None

PicklingError: Could not pickle the task to send it to the workers.

Nov 08 '23 13:11 RobertoDF

@RobertoDF

I don't have any experience using joblib. The conversation in this PR

https://github.com/AllenInstitute/AllenSDK/pull/2237

contains an example script using multiprocessing to access data in parallel from a cache. Note: you will need to download the entire cache before running your parallelized job. As discussed earlier in this ticket, parallelized downloading for data from S3 is not supported.

Hope this helps.

Nov 08 '23 16:11 danielsf