Move pytorch shuffling out of main thread
This can more than double the performace of simple pytorch micro-benchmark.
🚀 🚀 Pull Request
Checklist:
- [ ] My code follows the style guidelines of this project and the Contributing document
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have kept the
coverage-rateup - [ ] I have performed a self-review of my own code and resolved any problems
- [ ] I have checked to ensure there aren't any other open Pull Requests for the same change
- [ ] I have described and made corresponding changes to the relevant documentation
- [ ] New and existing unit tests pass locally with my changes
Changes
Hi @hakanardo ! Thanks a lot for your contribution. Do you mind signing the CLA, while we review the PR. Also, if you haven't already, please consider joining our community on slack.
Hi @hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?
Sure, they'll have to be extracted from a bigger experimental tangle, but I'll get back to you with a stand alone micro-benchmark...
On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman @.***> wrote:
Hi @hakanardo https://github.com/hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?
— Reply to this email directly, view it on GitHub https://github.com/activeloopai/Hub/pull/1847#issuecomment-1240288504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>
-- Håkan Ardö
Hi, here is the benchmark. Including results as comments at the bottom.
On Thu, Sep 8, 2022 at 8:48 AM Hakan Ardo @.***> wrote:
Sure, they'll have to be extracted from a bigger experimental tangle, but I'll get back to you with a stand alone micro-benchmark...
On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman @.***> wrote:
Hi @hakanardo https://github.com/hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?
— Reply to this email directly, view it on GitHub https://github.com/activeloopai/Hub/pull/1847#issuecomment-1240288504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>
-- Håkan Ardö
-- Håkan Ardö
@hakanardo thank you very much for these, these would be extremely helpful for us internally to improve Hub. I'll let @farizrahman4u and @levongh provide any additional updates once they have them. Have a nice start of the week in the meantime!
Hey @hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful.
On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake notebook and the assosciated docs.
I use a "Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz" system with an "GeForce RTX 2080 Ti" GPU. Thanx for the pointer, I will have a look!
On Thu, Sep 15, 2022 at 4:50 PM Abhinav Tuli @.***> wrote:
Hey @hakanardo https://github.com/hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful.
On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake https://colab.research.google.com/drive/1Sky4YX0WQlf3F-0pU34QnhB-fzL_SsTZ?usp=sharing notebook and the assosciated docs https://docs.deeplake.ai/en/latest/Dataloader-and-Query.html.
— Reply to this email directly, view it on GitHub https://github.com/activeloopai/Hub/pull/1847#issuecomment-1248211197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4LBQESMNJVGN6MYL3DV6MZTLANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>
-- Håkan Ardö
I've benchmarked the deeplake loader using the attached benchmark. It performs slightly better than my branch with shuffle=True, but significantly worse than the python loader with shuffle=False. Am I doing something wrong? The exact number are at the bottom of the script.
Also, the distribution of the indexes in the produced batches looks a lot more uniform. That is very nice!
hey @hakanardo can you please share hub and deeplake versions
Thanks @hakanardo! You might want to use num_workers=0 for deeplake (we're spinning up threads for fetching and decompression, num_workers spins up processes separately for transformation and collate) as in the current implementation there's an interprocess communication overhead, that we'll be working on fixing shortly.
Thanx for the tip, num_workers=0 improves things, but I still only get some 80% GPU utilization in my trainings as opposed to about 98% with the shuffle_thread branch. I also tried to move the entire dataloading to a separate process by wrapping it in another dataloader similar to how the old loader works:
class IterableDatasetWrapper(IterableDataset):
def __init__(self, dl):
self.dl = dl
def __iter__(self):
for b in self.dl:
yield b
dataloader = DataLoader(IterableDatasetWrapper(dataloader), num_workers=1, collate_fn=lambda a: a[0])
But that seams to dedlock on the api.dataset call on line 60 of convert_to_hub3.py.
Se also https://github.com/activeloopai/Hub/issues/1888
I think using multiprocessing.set_start_method("spawn", force=True) at the top should fix the issue of it getting stuck.
There's an issue with forking Hub3Dataset object that we're aware of.
That worked, but requires the dataloader to be picklable, which is a bit annoying. But it seems to be enough to wrap it in a separate thread:
class DONE: pass
class ThreadedItteretor(Thread):
def __init__(self, itterable):
super().__init__(daemon=True)
self.itterable = itterable
self.queue = Queue(2)
self.start()
def run(self) -> None:
while True:
for val in self.itterable:
val['images'] = val['images'].pin_memory()
self.queue.put(val)
self.queue.put(DONE)
def __iter__(self):
while True:
val = self.queue.get()
if val is DONE:
break
yield val
This allows me to train at full GPU utilization again. Would you consider including something like this as an option in the new deeplake dataloader?
Hey @hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running?
I think the main effect comes from doing the data loading on the CPU in parallel with the training on the GPU instead of in series with it. Thereby the entire data-loading time can be ignored as the data is already available when the GPU needs it. Are you benchmarking with a database big enough to not fit in the cache to make sure there is an io-time-cost to the dataloading?
On Tue, Oct 4, 2022 at 7:08 PM Abhinav Tuli @.***> wrote:
Hey @hakanardo https://github.com/hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running?
— Reply to this email directly, view it on GitHub https://github.com/activeloopai/deeplake/pull/1847#issuecomment-1267306270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4IZUIKKDUOAIT7VZYDWBRQBNANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>
-- Håkan Ardö
Got it. Yup, I was using imagenet. The thing is that the experimental dataloader is already spinning up threads in C++, to prefetch and decompress data in parallel with training on GPU. The collation and transform is however done serially, perhaps that could be the difference in our results. Were you using some transform or collate? Is it the same script as the one you sent before?
Closing this for now as we didn't see a lot of performance changes.