Overview

This is the training script for the UNet3D MLPerf model.

How to run training script

Download dataset

Run the following script from the root folder of tinygrad:

./examples/mlperf/scripts/setup_kits19_dataset.sh

Training model

For training on a tinybox green:

time PYTHONPATH=. WANDB=1 TRAIN_BEAM=3 FUSE_CONV_BW=1 GPUS=6 BS=6 MODEL=unet3d python3 examples/mlperf/model_train.py

Feb 22 '24 03:02 francislata

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

Feb 22 '24 03:02 github-actions[bot]

Bounty locked. We have two more tinyboxes being built today, lmk when you are ready for an attempt. What's your username on discord?

Feb 22 '24 11:02 geohot

@geohot - thanks for the heads up! The only thing I haven’t tried is training on an AMD GPU, so I might be ready to give it a quick run on the original image size.

Also, my Discord name is flata.

Feb 22 '24 11:02 francislata

Does this support mulitGPU? Have you tested that?

How sure are you this code is correct? What's the expected training time?

Feb 22 '24 11:02 geohot

I have added support for multiGPU for this one

I’ll do some more testing to get an accurate training time duration and will let you know once I’m ready for training on a tinybox.

Feb 22 '24 11:02 francislata

Recent update: I ran this training script on my M3 Max using BS=1 on the original size of (128, 128, 128) and it actually trains just fine. Synced with @chaosagent recently and may think there could be something going on with the autograd and/or allocator that’s causing the OOM issue on CUDA.

I’ll do an investigation to narrow it down further + sanity check training correctness before moving to a tinybox for an attempt.

Feb 29 '24 16:02 francislata

You have tinybox access? Ask on discord if not

Mar 02 '24 13:03 geohot

You have tinybox access? Ask on discord if not

I recently pasted my SSH public key on Discord for tinybox access here.

Mar 05 '24 16:03 francislata

Sorry, missed this. Gave you access to tiny4, ssh instructions in discord

Mar 07 '24 00:03 geohot

After trying my training script out on tiny4, it also receives the OOM error experienced in CUDA. I re-introduced gradient accumulation to my code and running a forward pass multiple times works. It is just when the optim.step() is executed, the gradients are realized, and it is when it runs out of memory.

Whether multi-GPU is used or not, it still runs out of memory (as mentioned in my description before). What I'm thinking now is to implement activation checkpointing to see if I can reduce memory pressure when calling optim.step(). So this is going to be my focus for this one and will report back once I have something for it.

Mar 11 '24 01:03 francislata

Why is it using so much RAM? Gradient checkpointing is complex, have you looked into the root cause? Do you expect it to use less RAM?

Mar 12 '24 16:03 geohot

You can try again with the new scheduler and sgd contiguous!

Mar 17 '24 01:03 chaosagent

I tried the latest changes to scheduler + SGD contiguous and it didn't work. Also found out that the model isn't actually using a lot of RAM. In addition, I tried @chaosagent 's #3780 and no luck on that as well. So it may be related to this issue: #3572. Will investigate further to confirm if it fixes that issue here on UNet3D MLPerf training.

Mar 19 '24 01:03 francislata

So it may be related to this issue: https://github.com/tinygrad/tinygrad/issues/3572

just an update on this one: in conv2d implementation, I saw that it bypasses the second expand when I run with WINO=1 and it trains with JIT enabled and BS=1 on the full image size now.

From this point, I'll be running with WINO=1. I will also test out FP16 to make it faster (thanks @chaosagent). I'll also add some time logs, similar to resnet, to keep tabs of the timings of various things during training.

Mar 21 '24 05:03 francislata

@chenyuxyz - thanks for the heads up with BEAM. I have this run right now with BEAM=2. Beam itself took awhile but I'm getting ~1.5 it/s right now.

I'll keep improving on the iteration speed by scaling batch size by the number of GPUs & the dataloader for the training set.

Apr 01 '24 02:04 francislata

cool. Feel free to share on discord too. How many steps are there in total?

-- Chen-Yu Yang @.***

On Sun, Mar 31, 2024, at 7:19 PM, Francis Lata wrote:

@chenyuxyz https://github.com/chenyuxyz - thanks for the heads up with BEAM. I have this run https://wandb.ai/flata/tinygrad_unet3d_mlperf/runs/9gj7qasz/workspace?nw=nwuserflata right now with BEAM=2. Beam itself took awhile but I'm getting ~1.5 it/s right now.

I'll keep improving on the iteration speed by scaling batch size by the number of GPUs & the dataloader for the training set.

— Reply to this email directly, view it on GitHub https://github.com/tinygrad/tinygrad/pull/3470#issuecomment-2029043417, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATISJQETTPR7JD6SO3T5VDY3C73FAVCNFSM6AAAAABDUE427KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRZGA2DGNBRG4. You are receiving this because you were mentioned.Message ID: @.***>

Apr 01 '24 02:04 chenyuxyz

not sure exactly how many steps until convergence just yet but per epochs, there are 84 iterations. it will start evaluating at around 500 epochs and as of writing this message, it's at 130 epochs. will provide updates on discord.

Apr 01 '24 03:04 francislata

@francislata , hey I'm a new contributor, and I would love to get my hands on to help you people to solve the #3572. AS far as I have searched, this issue is only mentioned in your PR.

Can you please help me get started with this issue?

thanks in advance :D and sorry if its not the right place to ask for this

Apr 06 '24 19:04 xXWarMachineRoXx

on a newly setup tinybox, when i run time PYTHONPATH=. WANDB=1 TRAIN_BEAM=3 FUSE_CONV_BW=1 GPUS=6 BS=6 MODEL=unet3d python3 examples/mlperf/model_train.py, it finished immediately without an error. I think it's due to missing data set. Can you add some checking to that to make sure the pipeline reads the correct number of data?

also for completion, add a small README like bert / resnet

Aug 31 '24 14:08 chenyuxyz

@chenyuxyz - Regarding the instant failure without an error when attempting to train, I'll give it a look. I'll simulate it again on a tinybox green.

About the README, did you want a similar directory path as BERT and ResNet, like as if we're submitting it to MLPerf?

Sep 02 '24 13:09 francislata

I also tested by following the instructions in the description, I was able to get the training started: Screenshot 2024-09-02 at 10 08 53

Where there any steps that was different from yours when you attempted to train the model that I missed so I can reproduce the issue?

Sep 02 '24 14:09 francislata

oh i did not re-run the dataset download because there's one in raid, maybe that's why. running from scratch with your instruction now

started here https://wandb.ai/chenyuxyz/tinygrad_unet3d_mlperf/runs/kr61rmub

Sep 05 '24 05:09 chenyuxyz

By the way, when I click on your W&B link, I get a 404. I think it could be an access control setting for the project level.

Sep 05 '24 19:09 francislata

finished in 22 hours! fixed wandb permission too https://wandb.ai/chenyuxyz/tinygrad_unet3d_mlperf/runs/kr61rmub?nw=nwuserchenyuxyz

can you add a small README with the steps to run this? it's easier to track in the code base than the pr description

Sep 06 '24 04:09 chenyuxyz