deeplake Move pytorch shuffling out of main thread

This can more than double the performace of simple pytorch micro-benchmark.

🚀 🚀 Pull Request

Checklist:

[ ] My code follows the style guidelines of this project and the Contributing document
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have kept the coverage-rate up
[ ] I have performed a self-review of my own code and resolved any problems
[ ] I have checked to ensure there aren't any other open Pull Requests for the same change
[ ] I have described and made corresponding changes to the relevant documentation
[ ] New and existing unit tests pass locally with my changes

Changes

Sep 06 '22 11:09 hakanardo

All committers have signed the CLA.

Sep 06 '22 11:09 CLAassistant

Hi @hakanardo ! Thanks a lot for your contribution. Do you mind signing the CLA, while we review the PR. Also, if you haven't already, please consider joining our community on slack.

Sep 06 '22 11:09 tatevikh

Hi @hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?

Sep 08 '22 06:09 farizrahman4u

Sure, they'll have to be extracted from a bigger experimental tangle, but I'll get back to you with a stand alone micro-benchmark...

On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman @.***> wrote:

Hi @hakanardo https://github.com/hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?

— Reply to this email directly, view it on GitHub https://github.com/activeloopai/Hub/pull/1847#issuecomment-1240288504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>

-- Håkan Ardö

Sep 08 '22 06:09 hakanardo

Hi, here is the benchmark. Including results as comments at the bottom.

On Thu, Sep 8, 2022 at 8:48 AM Hakan Ardo @.***> wrote:

Sure, they'll have to be extracted from a bigger experimental tangle, but I'll get back to you with a stand alone micro-benchmark...

On Thu, Sep 8, 2022 at 8:39 AM Fariz Rahman @.***> wrote:

Hi @hakanardo https://github.com/hakanardo, thanks for the contribution. Can you provide the code and results for the benchmarks mentioned in the PR description?

— Reply to this email directly, view it on GitHub https://github.com/activeloopai/Hub/pull/1847#issuecomment-1240288504, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4P5VX3RRN63N7WFWIDV5GCYLANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>

-- Håkan Ardö

-- Håkan Ardö

Sep 12 '22 11:09 hakanardo

hub_bench.zip

Sep 12 '22 11:09 hakanardo

@hakanardo thank you very much for these, these would be extremely helpful for us internally to improve Hub. I'll let @farizrahman4u and @levongh provide any additional updates once they have them. Have a nice start of the week in the meantime!

Sep 12 '22 11:09 mikayelh

Hey @hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful.

On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake notebook and the assosciated docs.

Sep 15 '22 14:09 AbhinavTuli

I use a "Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz" system with an "GeForce RTX 2080 Ti" GPU. Thanx for the pointer, I will have a look!

On Thu, Sep 15, 2022 at 4:50 PM Abhinav Tuli @.***> wrote:

Hey @hakanardo https://github.com/hakanardo ! Thanks for your PR. I was looking into the benchmark file you sent but was unable to recreate your results. The speed seems to be almost the same on main vs on your branch. Any specifics that you can share about your machine that might help us recreate this would be helpful.

On another note, we have recently introduced a new experimental dataloader implemented in C++ to Hub (only works on linux for now). In the near future we'll be deprecating the existing dataloader. If you're interested, you can take a look at this Getting started with Deep Lake https://colab.research.google.com/drive/1Sky4YX0WQlf3F-0pU34QnhB-fzL_SsTZ?usp=sharing notebook and the assosciated docs https://docs.deeplake.ai/en/latest/Dataloader-and-Query.html.

— Reply to this email directly, view it on GitHub https://github.com/activeloopai/Hub/pull/1847#issuecomment-1248211197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4LBQESMNJVGN6MYL3DV6MZTLANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>

-- Håkan Ardö

Sep 15 '22 15:09 hakanardo

I've benchmarked the deeplake loader using the attached benchmark. It performs slightly better than my branch with shuffle=True, but significantly worse than the python loader with shuffle=False. Am I doing something wrong? The exact number are at the bottom of the script.

Also, the distribution of the indexes in the produced batches looks a lot more uniform. That is very nice!

deeplake_bench.zip

Sep 15 '22 17:09 hakanardo

hey @hakanardo can you please share hub and deeplake versions

Sep 15 '22 17:09 levongh

Thanks @hakanardo! You might want to use num_workers=0 for deeplake (we're spinning up threads for fetching and decompression, num_workers spins up processes separately for transformation and collate) as in the current implementation there's an interprocess communication overhead, that we'll be working on fixing shortly.

Sep 15 '22 17:09 AbhinavTuli

Thanx for the tip, num_workers=0 improves things, but I still only get some 80% GPU utilization in my trainings as opposed to about 98% with the shuffle_thread branch. I also tried to move the entire dataloading to a separate process by wrapping it in another dataloader similar to how the old loader works:

class IterableDatasetWrapper(IterableDataset):
    def __init__(self, dl):
        self.dl = dl

    def __iter__(self):
        for b in self.dl:
            yield b

dataloader = DataLoader(IterableDatasetWrapper(dataloader), num_workers=1, collate_fn=lambda a: a[0])

But that seams to dedlock on the api.dataset call on line 60 of convert_to_hub3.py.

Sep 20 '22 07:09 hakanardo

Se also https://github.com/activeloopai/Hub/issues/1888

Sep 20 '22 08:09 hakanardo

I think using multiprocessing.set_start_method("spawn", force=True) at the top should fix the issue of it getting stuck. There's an issue with forking Hub3Dataset object that we're aware of.

Sep 21 '22 11:09 AbhinavTuli

That worked, but requires the dataloader to be picklable, which is a bit annoying. But it seems to be enough to wrap it in a separate thread:

    class DONE: pass
    class ThreadedItteretor(Thread):
        def __init__(self, itterable):
            super().__init__(daemon=True)
            self.itterable = itterable
            self.queue = Queue(2)
            self.start()

        def run(self) -> None:
            while True:
                for val in self.itterable:
                    val['images'] = val['images'].pin_memory()
                    self.queue.put(val)
                self.queue.put(DONE)

        def __iter__(self):
            while True:
                val = self.queue.get()
                if val is DONE:
                    break
                yield val

This allows me to train at full GPU utilization again. Would you consider including something like this as an option in the new deeplake dataloader?

Sep 28 '22 06:09 hakanardo

Hey @hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running?

Oct 04 '22 17:10 AbhinavTuli

I think the main effect comes from doing the data loading on the CPU in parallel with the training on the GPU instead of in series with it. Thereby the entire data-loading time can be ignored as the data is already available when the GPU needs it. Are you benchmarking with a database big enough to not fit in the cache to make sure there is an io-time-cost to the dataloading?

On Tue, Oct 4, 2022 at 7:08 PM Abhinav Tuli @.***> wrote:

Hey @hakanardo https://github.com/hakanardo! Thanks for the suggestion, I tried it but couldn't really get a performance boost while iterating. Do you think that the better performance here is due to pinning memory in the separate thread that you're running?

— Reply to this email directly, view it on GitHub https://github.com/activeloopai/deeplake/pull/1847#issuecomment-1267306270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2GZ4IZUIKKDUOAIT7VZYDWBRQBNANCNFSM6AAAAAAQFYAKVU . You are receiving this because you were mentioned.Message ID: @.***>

-- Håkan Ardö

Oct 04 '22 17:10 hakanardo

Got it. Yup, I was using imagenet. The thing is that the experimental dataloader is already spinning up threads in C++, to prefetch and decompress data in parallel with training on GPU. The collation and transform is however done serially, perhaps that could be the difference in our results. Were you using some transform or collate? Is it the same script as the one you sent before?

Oct 05 '22 03:10 AbhinavTuli

Closing this for now as we didn't see a lot of performance changes.

Mar 31 '23 14:03 tatevikh