Running on Apple M1/M2/M3 chips

Open alisonpeard opened this issue 1 year ago • 0 comments

Hi, I'm trying to run the PyTorch training implementation on an Apple M2 chip with MPS. I can run StyleGAN-ADA image generation following these steps but when I try to train DiffAugment I get this error:

/AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolutionA14.mm:3237: failed assertion `destination datatype must be fp32'

My steps so far:

Clone the repo and cd data-efficient-gans/DiffAugment-stylegan2-pytorch
conda create -n DiffAug python=3.9
conda activate DiffAug
conda install pytorch torchvision torchaudio -c pytorch
pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3
pip install Pillow psutil scipy
Following the advice here,
- I replace all instances of torch.device('cuda') with torch.device('mps')
- I replace random array generation with random_array = np.random.RandomState(seed).randn(1, G.z_dim).astype(np.float32) in generate.py as described.
- In training_loop.py I replace instances of torch.cuda.Event(enable_timing=True) with time.perf_counter()
- I remove torch.backends.cuda.matmul.allow_tf32 = allow_tf32, torch.backends.cudnn.allow_tf32 = allow_tf32, torch.cuda.reset_peak_memory_stats(), and all_gen_c = torch.from_numpy(np.stack(all_gen_c)).pin_memory().to(device) as they have no MPS equivalent.

At this point I can generate images from the pretrained model, e.g.,

python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

but training aborts with the message below:

python train.py --outdir=../training-runs --data=../datasets/100-shot-obama.zip --gpus=1 --kimg 1
# /AppleInternal/Library/BuildRoots/20d6c351-ee94-11ec-bcaf-#7247572f23b4/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSNDArray/Kernels/MPSNDArrayConvolutionA14.mm:3237: failed assertion `destination datatype must be fp32'

Using pdb I can trace the error from .../DiffAugment-stylegan2-pytorch/training/loss.py(80)accumulate_gradients() -> loss_Gmain.mean().mul(gain).backward():80 totorch/autograd/graph.py(769)_engine_run_backward() -> return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass but I'm not really able to figure out what's going on. I have checked all tensors in the training loop are float-32 type.

Any suggestions would be appreciated! I don't have access to NVIDIA GPUs at the moment and the Colab also seems to be outdated.

Oct 01 '24 19:10 alisonpeard