UNet3D MLPerf
Overview
This is the training script for the UNet3D MLPerf model.
How to run training script
Download dataset
Run the following script from the root folder of tinygrad:
./examples/mlperf/scripts/setup_kits19_dataset.sh
Training model
For training on a tinybox green:
time PYTHONPATH=. WANDB=1 TRAIN_BEAM=3 FUSE_CONV_BW=1 GPUS=6 BS=6 MODEL=unet3d python3 examples/mlperf/model_train.py
This branch currently is behind tinygrad/master. The line count difference bot is disabled.
Bounty locked. We have two more tinyboxes being built today, lmk when you are ready for an attempt. What's your username on discord?
@geohot - thanks for the heads up! The only thing I haven’t tried is training on an AMD GPU, so I might be ready to give it a quick run on the original image size.
Also, my Discord name is flata.
Does this support mulitGPU? Have you tested that?
How sure are you this code is correct? What's the expected training time?
I have added support for multiGPU for this one
I’ll do some more testing to get an accurate training time duration and will let you know once I’m ready for training on a tinybox.
Recent update:
I ran this training script on my M3 Max using BS=1 on the original size of (128, 128, 128) and it actually trains just fine. Synced with @chaosagent recently and may think there could be something going on with the autograd and/or allocator that’s causing the OOM issue on CUDA.
I’ll do an investigation to narrow it down further + sanity check training correctness before moving to a tinybox for an attempt.
You have tinybox access? Ask on discord if not
You have tinybox access? Ask on discord if not
I recently pasted my SSH public key on Discord for tinybox access here.
Sorry, missed this. Gave you access to tiny4, ssh instructions in discord
After trying my training script out on tiny4, it also receives the OOM error experienced in CUDA. I re-introduced gradient accumulation to my code and running a forward pass multiple times works. It is just when the optim.step() is executed, the gradients are realized, and it is when it runs out of memory.
Whether multi-GPU is used or not, it still runs out of memory (as mentioned in my description before). What I'm thinking now is to implement activation checkpointing to see if I can reduce memory pressure when calling optim.step(). So this is going to be my focus for this one and will report back once I have something for it.
Why is it using so much RAM? Gradient checkpointing is complex, have you looked into the root cause? Do you expect it to use less RAM?
You can try again with the new scheduler and sgd contiguous!
I tried the latest changes to scheduler + SGD contiguous and it didn't work. Also found out that the model isn't actually using a lot of RAM. In addition, I tried @chaosagent 's #3780 and no luck on that as well. So it may be related to this issue: #3572. Will investigate further to confirm if it fixes that issue here on UNet3D MLPerf training.
So it may be related to this issue: https://github.com/tinygrad/tinygrad/issues/3572
just an update on this one: in conv2d implementation, I saw that it bypasses the second expand when I run with WINO=1 and it trains with JIT enabled and BS=1 on the full image size now.
From this point, I'll be running with WINO=1. I will also test out FP16 to make it faster (thanks @chaosagent). I'll also add some time logs, similar to resnet, to keep tabs of the timings of various things during training.
@chenyuxyz - thanks for the heads up with BEAM. I have this run right now with BEAM=2. Beam itself took awhile but I'm getting ~1.5 it/s right now.
I'll keep improving on the iteration speed by scaling batch size by the number of GPUs & the dataloader for the training set.
cool. Feel free to share on discord too. How many steps are there in total?
-- Chen-Yu Yang @.***
On Sun, Mar 31, 2024, at 7:19 PM, Francis Lata wrote:
@chenyuxyz https://github.com/chenyuxyz - thanks for the heads up with
BEAM. I have this run https://wandb.ai/flata/tinygrad_unet3d_mlperf/runs/9gj7qasz/workspace?nw=nwuserflata right now withBEAM=2. Beam itself took awhile but I'm getting ~1.5 it/s right now.I'll keep improving on the iteration speed by scaling batch size by the number of GPUs & the dataloader for the training set.
— Reply to this email directly, view it on GitHub https://github.com/tinygrad/tinygrad/pull/3470#issuecomment-2029043417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATISJQETTPR7JD6SO3T5VDY3C73FAVCNFSM6AAAAABDUE427KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRZGA2DGNBRG4. You are receiving this because you were mentioned.Message ID: @.***>
not sure exactly how many steps until convergence just yet but per epochs, there are 84 iterations. it will start evaluating at around 500 epochs and as of writing this message, it's at 130 epochs. will provide updates on discord.
@francislata , hey I'm a new contributor, and I would love to get my hands on to help you people to solve the #3572. AS far as I have searched, this issue is only mentioned in your PR.
Can you please help me get started with this issue?
thanks in advance :D and sorry if its not the right place to ask for this
on a newly setup tinybox, when i run time PYTHONPATH=. WANDB=1 TRAIN_BEAM=3 FUSE_CONV_BW=1 GPUS=6 BS=6 MODEL=unet3d python3 examples/mlperf/model_train.py, it finished immediately without an error. I think it's due to missing data set. Can you add some checking to that to make sure the pipeline reads the correct number of data?
also for completion, add a small README like bert / resnet
@chenyuxyz - Regarding the instant failure without an error when attempting to train, I'll give it a look. I'll simulate it again on a tinybox green.
About the README, did you want a similar directory path as BERT and ResNet, like as if we're submitting it to MLPerf?
I also tested by following the instructions in the description, I was able to get the training started:
Where there any steps that was different from yours when you attempted to train the model that I missed so I can reproduce the issue?
oh i did not re-run the dataset download because there's one in raid, maybe that's why. running from scratch with your instruction now
started here https://wandb.ai/chenyuxyz/tinygrad_unet3d_mlperf/runs/kr61rmub
By the way, when I click on your W&B link, I get a 404. I think it could be an access control setting for the project level.
finished in 22 hours! fixed wandb permission too https://wandb.ai/chenyuxyz/tinygrad_unet3d_mlperf/runs/kr61rmub?nw=nwuserchenyuxyz
can you add a small README with the steps to run this? it's easier to track in the code base than the pr description