tinygrad icon indicating copy to clipboard operation
tinygrad copied to clipboard

UNet3D MLPerf

Open francislata opened this issue 1 year ago • 24 comments

Overview

This is the training script for the UNet3D MLPerf model.

How to run training script

Download dataset

Run the following script from the root folder of tinygrad:

./examples/mlperf/scripts/setup_kits19_dataset.sh

Training model

For training on a tinybox green:

time PYTHONPATH=. WANDB=1 TRAIN_BEAM=3 FUSE_CONV_BW=1 GPUS=6 BS=6 MODEL=unet3d python3 examples/mlperf/model_train.py

francislata avatar Feb 22 '24 03:02 francislata

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

github-actions[bot] avatar Feb 22 '24 03:02 github-actions[bot]

Bounty locked. We have two more tinyboxes being built today, lmk when you are ready for an attempt. What's your username on discord?

geohot avatar Feb 22 '24 11:02 geohot

@geohot - thanks for the heads up! The only thing I haven’t tried is training on an AMD GPU, so I might be ready to give it a quick run on the original image size.

Also, my Discord name is flata.

francislata avatar Feb 22 '24 11:02 francislata

Does this support mulitGPU? Have you tested that?

How sure are you this code is correct? What's the expected training time?

geohot avatar Feb 22 '24 11:02 geohot

I have added support for multiGPU for this one

I’ll do some more testing to get an accurate training time duration and will let you know once I’m ready for training on a tinybox.

francislata avatar Feb 22 '24 11:02 francislata

Recent update: I ran this training script on my M3 Max using BS=1 on the original size of (128, 128, 128) and it actually trains just fine. Synced with @chaosagent recently and may think there could be something going on with the autograd and/or allocator that’s causing the OOM issue on CUDA.

I’ll do an investigation to narrow it down further + sanity check training correctness before moving to a tinybox for an attempt.

francislata avatar Feb 29 '24 16:02 francislata

You have tinybox access? Ask on discord if not

geohot avatar Mar 02 '24 13:03 geohot

You have tinybox access? Ask on discord if not

I recently pasted my SSH public key on Discord for tinybox access here.

francislata avatar Mar 05 '24 16:03 francislata

Sorry, missed this. Gave you access to tiny4, ssh instructions in discord

geohot avatar Mar 07 '24 00:03 geohot

After trying my training script out on tiny4, it also receives the OOM error experienced in CUDA. I re-introduced gradient accumulation to my code and running a forward pass multiple times works. It is just when the optim.step() is executed, the gradients are realized, and it is when it runs out of memory.

Whether multi-GPU is used or not, it still runs out of memory (as mentioned in my description before). What I'm thinking now is to implement activation checkpointing to see if I can reduce memory pressure when calling optim.step(). So this is going to be my focus for this one and will report back once I have something for it.

francislata avatar Mar 11 '24 01:03 francislata

Why is it using so much RAM? Gradient checkpointing is complex, have you looked into the root cause? Do you expect it to use less RAM?

geohot avatar Mar 12 '24 16:03 geohot

You can try again with the new scheduler and sgd contiguous!

chaosagent avatar Mar 17 '24 01:03 chaosagent

I tried the latest changes to scheduler + SGD contiguous and it didn't work. Also found out that the model isn't actually using a lot of RAM. In addition, I tried @chaosagent 's #3780 and no luck on that as well. So it may be related to this issue: #3572. Will investigate further to confirm if it fixes that issue here on UNet3D MLPerf training.

francislata avatar Mar 19 '24 01:03 francislata

So it may be related to this issue: https://github.com/tinygrad/tinygrad/issues/3572

just an update on this one: in conv2d implementation, I saw that it bypasses the second expand when I run with WINO=1 and it trains with JIT enabled and BS=1 on the full image size now.

From this point, I'll be running with WINO=1. I will also test out FP16 to make it faster (thanks @chaosagent). I'll also add some time logs, similar to resnet, to keep tabs of the timings of various things during training.

francislata avatar Mar 21 '24 05:03 francislata

@chenyuxyz - thanks for the heads up with BEAM. I have this run right now with BEAM=2. Beam itself took awhile but I'm getting ~1.5 it/s right now.

I'll keep improving on the iteration speed by scaling batch size by the number of GPUs & the dataloader for the training set.

francislata avatar Apr 01 '24 02:04 francislata

cool. Feel free to share on discord too. How many steps are there in total?

-- Chen-Yu Yang @.***

On Sun, Mar 31, 2024, at 7:19 PM, Francis Lata wrote:

@chenyuxyz https://github.com/chenyuxyz - thanks for the heads up with BEAM. I have this run https://wandb.ai/flata/tinygrad_unet3d_mlperf/runs/9gj7qasz/workspace?nw=nwuserflata right now with BEAM=2. Beam itself took awhile but I'm getting ~1.5 it/s right now.

I'll keep improving on the iteration speed by scaling batch size by the number of GPUs & the dataloader for the training set.

— Reply to this email directly, view it on GitHub https://github.com/tinygrad/tinygrad/pull/3470#issuecomment-2029043417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATISJQETTPR7JD6SO3T5VDY3C73FAVCNFSM6AAAAABDUE427KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRZGA2DGNBRG4. You are receiving this because you were mentioned.Message ID: @.***>

chenyuxyz avatar Apr 01 '24 02:04 chenyuxyz

not sure exactly how many steps until convergence just yet but per epochs, there are 84 iterations. it will start evaluating at around 500 epochs and as of writing this message, it's at 130 epochs. will provide updates on discord.

francislata avatar Apr 01 '24 03:04 francislata

@francislata , hey I'm a new contributor, and I would love to get my hands on to help you people to solve the #3572. AS far as I have searched, this issue is only mentioned in your PR.

Can you please help me get started with this issue?

thanks in advance :D and sorry if its not the right place to ask for this

xXWarMachineRoXx avatar Apr 06 '24 19:04 xXWarMachineRoXx

on a newly setup tinybox, when i run time PYTHONPATH=. WANDB=1 TRAIN_BEAM=3 FUSE_CONV_BW=1 GPUS=6 BS=6 MODEL=unet3d python3 examples/mlperf/model_train.py, it finished immediately without an error. I think it's due to missing data set. Can you add some checking to that to make sure the pipeline reads the correct number of data?

also for completion, add a small README like bert / resnet

chenyuxyz avatar Aug 31 '24 14:08 chenyuxyz

@chenyuxyz - Regarding the instant failure without an error when attempting to train, I'll give it a look. I'll simulate it again on a tinybox green.

About the README, did you want a similar directory path as BERT and ResNet, like as if we're submitting it to MLPerf?

francislata avatar Sep 02 '24 13:09 francislata

I also tested by following the instructions in the description, I was able to get the training started: Screenshot 2024-09-02 at 10 08 53

Where there any steps that was different from yours when you attempted to train the model that I missed so I can reproduce the issue?

francislata avatar Sep 02 '24 14:09 francislata

oh i did not re-run the dataset download because there's one in raid, maybe that's why. running from scratch with your instruction now

started here https://wandb.ai/chenyuxyz/tinygrad_unet3d_mlperf/runs/kr61rmub

chenyuxyz avatar Sep 05 '24 05:09 chenyuxyz

By the way, when I click on your W&B link, I get a 404. I think it could be an access control setting for the project level.

francislata avatar Sep 05 '24 19:09 francislata

finished in 22 hours! fixed wandb permission too https://wandb.ai/chenyuxyz/tinygrad_unet3d_mlperf/runs/kr61rmub?nw=nwuserchenyuxyz

can you add a small README with the steps to run this? it's easier to track in the code base than the pr description

chenyuxyz avatar Sep 06 '24 04:09 chenyuxyz