Make it possible for multiple processes to download data to the same CloudCache
Multiple users for a single cache. Doug attempted to parallelize the download of the entire cache. There were collisions on a json file that tracks this.
-
We could implement a file lock on that file, which allows for this. I also see this even when the downloads are not taking place.
-
We could make the StaticLocalCache work with not just an s3 bucket, but also with a local directory on disk. This currently fails because the local caches do not have a manifests folder, and the s3 one does.
-
The use case here is a single maintainer and multiple read-only (no download) users. We'd want to lock users out from instantiating an s3 cache perhaps.
This issue is really killing me with analysis right now...I cant parallelize anything because I can only read each file one at a time. Makes things extremely slow :(
Which points out that this isn't just an issue of multiple users for a single cache, a single user for a single cache can have the same issues.
I should also point out that we have 4 other scientists who need to be using this same cache for analysis for the platform paper. I anticipate that this particular issue is going to become a fire that needs to be put out very soon.
@wbwakeman just want to flag this for you as a priority (or at least warn you that its probably about to become one)
I had originally told Marina that if she downloaded all of the data first, parallel access to the cache should not be a problem. Only parallel downloading of data was a problem. I now see a flaw in our implementation that makes that not true.
I can do a quick bugfix that will make it possible to download all the data with one process and then access the data in parallel from multiple processes (downloading data in parallel will be a heavier lift).
I'll let you know when that bugfix is ready.
Sorry about the confusion.
I’m actually more concerned about parallel access than parallel download at this point. Not having ability to do parallel download is non-ideal but really not a huge problem since you just have to do it once. Parallel access is more of an obstruction because it’s a continual ongoing thing. If you have to prioritize, I would go for that first.
Thanks for investigating and being so responsive about this stuff!
how can I parallelize a code that requires a cache (cache= VisualBehaviorNeuropixelsProjectCache.from_s3_cache(cache_dir=neuropixel_dataset_behavior))? Is this possible? Joblib encounters a Pickling error dealing with the cache object:
---------------------------------------------------------------------------
_RemoteTraceback Traceback (most recent call last)
_RemoteTraceback:
"""
Traceback (most recent call last):
File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/loky/backend/queues.py", line 125, in _feed
obj_ = dumps(obj, reducers=reducers)
File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 211, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/loky/backend/reduction.py", line 204, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "/alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/externals/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.lock' object
"""
The above exception was the direct cause of the following exception:
PicklingError Traceback (most recent call last)
Cell In[395], line 17
14 num_cores = joblib.cpu_count()
16 # Use Parallel and delayed to run the function in parallel.
---> 17 results = Parallel(n_jobs=num_cores)(
18 delayed(process_session)(id) for id in tqdm(ids)
19 )
21 # `results` will be a list of DataFrames.
22 out = [result for result in results if not result.empty]
File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/parallel.py:1098, in Parallel.__call__(self, iterable)
1095 self._iterating = False
1097 with self._backend.retrieval_context():
-> 1098 self.retrieve()
1099 # Make sure that we get a last message telling us we are done
1100 elapsed_time = time.time() - self._start_time
File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/parallel.py:975, in Parallel.retrieve(self)
973 try:
974 if getattr(self._backend, 'supports_timeout', False):
--> 975 self._output.extend(job.get(timeout=self.timeout))
976 else:
977 self._output.extend(job.get())
File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/site-packages/joblib/_parallel_backends.py:567, in LokyBackend.wrap_future_result(future, timeout)
564 """Wrapper for Future.result to implement the same behaviour as
565 AsyncResults.get from multiprocessing."""
566 try:
--> 567 return future.result(timeout=timeout)
568 except CfTimeoutError as e:
569 raise TimeoutError from e
File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/concurrent/futures/_base.py:446, in Future.result(self, timeout)
444 raise CancelledError()
445 elif self._state == FINISHED:
--> 446 return self.__get_result()
447 else:
448 raise TimeoutError()
File /alzheimer/Roberto/Software/mambaforge/envs/De-Filippo-et-al-2023/lib/python3.9/concurrent/futures/_base.py:391, in Future.__get_result(self)
389 if self._exception:
390 try:
--> 391 raise self._exception
392 finally:
393 # Break a reference cycle with the exception in self._exception
394 self = None
PicklingError: Could not pickle the task to send it to the workers.
@RobertoDF
I don't have any experience using joblib. The conversation in this PR
https://github.com/AllenInstitute/AllenSDK/pull/2237
contains an example script using multiprocessing to access data in parallel from a cache. Note: you will need to download the entire cache before running your parallelized job. As discussed earlier in this ticket, parallelized downloading for data from S3 is not supported.
Hope this helps.