Build sdist and wheel in CI
summary
People often run into issues building from source, so it would help if there were pre-built wheels with sensible defaults. This PR adds GitHub Actions that builds wheels for various Python and CUDA versions and saves them to the workflow artifacts on each commit push.
To get the wheel, one needs to go to the workflow page and download it following https://docs.github.com/en/actions/managing-workflow-runs-and-deployments/managing-workflow-runs/downloading-workflow-artifacts. For example,
gh -R calebho/apex run download 12365956032 -n dist-py3.10-cu12.1.1
Still clunky, but much faster than building from source
The specific versions were chosen based on what is currently supported in PyTorch stable (2.5 at the time of writing). Specifically:
- Python 3.9 - 3.12 (I believe 3.13 is still experimental at this point) https://pypi.org/project/torch/2.5.1/#files
- CUDA 11.8, 12.1, and 12.4 (based on the version matrix on https://pytorch.org/)
Notes
- The build containers run Ubuntu 20 which has glibc 2.31, so the runtime environment will need at least this version of glibc
- PyTorch 2.5 is hardcoded and I don't think there's ABI guarantees across versions, so the runtime environment will probably also need PyTorch 2.5
Good follow ups
- Do periodic GitHub releases and attach the wheels to the release; this way people can do
pip install https://github.com/NVIDIA/apex/releases/...on the appropriate wheel - #209: this PR doesn't publish sdist or wheel to PyPI; someone from NVIDIA ought to own that process
- Add PyTorch versions to the build matrix
test plan
builds are green: https://github.com/calebho/apex/actions/runs/12365956032/job/34511835345
ran L0 tests on Python 3.10 + CUDA 12.1 on 2xA100 40GB
❯ pip list
Package Version
------------------------ -----------
apex 0.1
cxxfilt 0.3.0
exceptiongroup 1.2.2
expecttest 0.3.0
filelock 3.13.1
fsspec 2024.2.0
iniconfig 2.0.0
Jinja2 3.1.3
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.2.1
numpy 2.2.0
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.1.105
nvidia-nvtx-cu12 12.1.105
packaging 24.2
pip 24.3.1
pluggy 1.5.0
pytest 8.3.4
PyYAML 6.0.2
setuptools 75.6.0
sympy 1.13.1
tomli 2.2.1
torch 2.5.1+cu121
tqdm 4.67.1
triton 3.1.0
typing_extensions 4.9.0
❯ nvidia-smi
Tue Dec 17 05:00:05 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-40GB On | 00000000:07:00.0 Off | 0 |
| N/A 30C P0 55W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-40GB On | 00000000:0F:00.0 Off | 0 |
| N/A 30C P0 52W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
❯ python tests/L0/run_test.py
testGradScaler (test_adam.AdamTest) ... ok
testGradScalerCapturable (test_adam.AdamTest) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/torch/amp/grad_scaler.py:415: FutureWarning: GradScaler is going to stop passing itself as a keyword argument to the passed optimizer. In the near future GradScaler registers `grad_scale: Tensor` and `found_inf: Tensor` to the passed optimizer and let the optimizer use them directly.
warnings.warn(
ok
testGradScalerCapturableMaster (test_adam.AdamTest) ... ok
testLargeTensor (test_adam.AdamTest) ... skipped 'Insufficient cuda memory'
testNative (test_adam.AdamTest) ... ok
test_float (test_fused_novograd.TestFusedNovoGrad) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/apex/optimizers/fused_novograd.py:176: UserWarning: The torch.cuda.*DtypeTensor constructors are no longer recommended. It's best to use methods such as torch.tensor(data, dtype=*, device='cuda') to create tensors. (Triggered internally at ../torch/csrc/tensor/python_tensor.cpp:78.)
group['exp_avg_sq'][0] = torch.cuda.FloatTensor(v_16, device=self.param_groups[0]["params"][0].device)
ok
test_half (test_fused_novograd.TestFusedNovoGrad) ... ok
test_multi_device (test_fused_novograd.TestFusedNovoGrad) ... ok
test_multi_params (test_fused_novograd.TestFusedNovoGrad) ... ok
test_adagrad_option (test_fused_optimizer.TestFusedAdagrad) ... ok
test_float (test_fused_optimizer.TestFusedAdagrad) ... ok
test_half (test_fused_optimizer.TestFusedAdagrad) ... skipped 'PyTorch optimizer is not numerically correct for fp16'
test_multi_device (test_fused_optimizer.TestFusedAdagrad) ... ok
test_multi_params (test_fused_optimizer.TestFusedAdagrad) ... ok
test_multi_params_different_devices_throws (test_fused_optimizer.TestFusedAdagrad) ... ok
test_adam_option (test_fused_optimizer.TestFusedAdam) ... ok
test_bfloat16 (test_fused_optimizer.TestFusedAdam) ... ok
test_float (test_fused_optimizer.TestFusedAdam) ... ok
test_fp16_output (test_fused_optimizer.TestFusedAdam) ... skipped 'No longer support output fp16 param'
test_frozen_model (test_fused_optimizer.TestFusedAdam) ... ok
test_half (test_fused_optimizer.TestFusedAdam) ... ok
test_multi_device (test_fused_optimizer.TestFusedAdam) ... ok
test_multi_params (test_fused_optimizer.TestFusedAdam) ... skipped 'Disable until 8/1/2019 adam/adamw upstream picked'
test_scale (test_fused_optimizer.TestFusedAdam) ... skipped 'No longer support fuse scaling'
test_float (test_fused_optimizer.TestFusedSGD) ... ok
test_half (test_fused_optimizer.TestFusedSGD) ... ok
test_multi_device (test_fused_optimizer.TestFusedSGD) ... ok
test_float (test_lamb.TestFusedLAMB) ... ok
test_half (test_lamb.TestFusedLAMB) ... skipped 'PyTorch optimizer is not numerically correct for fp16'
test_lamb_option (test_lamb.TestFusedLAMB) ... ok
test_multi_device (test_lamb.TestFusedLAMB) ... ok
test_multi_params (test_lamb.TestFusedLAMB) ... ok
test_bfloat16 (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_float (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_half (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_lamb_option (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_multi_device (test_lamb.TestFusedMixedPrecisionLamb) ... ok
test_multi_params (test_lamb.TestFusedMixedPrecisionLamb) ... ok
----------------------------------------------------------------------
Ran 38 tests in 6.665s
OK (skipped=6)
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_False_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/apex/_autocast_utils.py:26: FutureWarning: `torch.cuda.amp.autocast_mode._cast(value, dtype)` is deprecated. Please use `torch.amp.autocast_mode._cast(value, 'cuda', dtype)` instead.
return torch.cuda.amp.autocast_mode._cast(args, torch.get_autocast_gpu_dtype())
ok
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_False_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_True_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_bfloat16_elementwise_affine_True_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_False_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_False_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_True_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_layer_norm_float16_elementwise_affine_True_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_False_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_False_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_True_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_bfloat16_elementwise_affine_True_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_False_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_False_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_True_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_autocast_fused_rms_norm_float16_elementwise_affine_True_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_layer_norm_elementwise_affine_False_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_layer_norm_elementwise_affine_True_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_rms_norm_elementwise_affine_False_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_compile_fused_rms_norm_elementwise_affine_True_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_export_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... /scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:632: FutureWarning: 'torch.onnx.utils.export_to_pretty_string' is deprecated in version 2.5 and will be removed in the future. Please use onnx.printer.to_text() instead.
return fn(*args, **kwargs)
ok
test_layer_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_layer_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_export_cuda (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_False_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_bfloat16_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_bfloat16_memory_efficient_True_cuda_bfloat16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_elemwise_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_False_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_half_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_False_float16_memory_efficient_True_cuda_float16 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_16_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_False_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_mixed_batch_size_65536_contiguous_True_elementwise_affine_True_mixed_fused_True_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_16_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_False_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_False_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
test_rms_norm_regular_batch_size_65536_contiguous_True_elementwise_affine_False_mixed_fused_False_float32_memory_efficient_True_cuda_float32 (test_fused_layer_norm.TestFusedLayerNormCUDA) ... ok
----------------------------------------------------------------------
Ran 86 tests in 161.233s
OK
test_creation_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_none_bias_False_cuda (test_mlp.TestMLPCUDA) ... /home/calebh/rsc/apex/tests/L0/run_mlp/test_mlp.py:79: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead.
with torch.cuda.amp.autocast_mode.autocast(enabled=enable_autocast):
/scratch/slurm_tmpdir/38660/env/lib/python3.10/site-packages/apex/_autocast_utils.py:26: FutureWarning: `torch.cuda.amp.autocast_mode._cast(value, dtype)` is deprecated. Please use `torch.amp.autocast_mode._cast(value, 'cuda', dtype)` instead.
return torch.cuda.amp.autocast_mode._cast(args, torch.get_autocast_gpu_dtype())
ok
test_mlp_autocast_fp16_use_activation_none_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_relu_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_relu_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_sigmoid_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_autocast_fp16_use_activation_sigmoid_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_none_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_none_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_relu_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_relu_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_sigmoid_bias_False_cuda (test_mlp.TestMLPCUDA) ... ok
test_mlp_use_activation_sigmoid_bias_True_cuda (test_mlp.TestMLPCUDA) ... ok
test_no_grad_cuda (test_mlp.TestMLPCUDA) ... ok
test_numeric_cuda (test_mlp.TestMLPCUDA) ... ok
test_performance_half_cuda (test_mlp.TestMLPCUDA) ... ok
----------------------------------------------------------------------
Ran 16 tests in 0.979s
OK
Fail to import hypothesis in common_utils, tests are not derandomized
Executing tests from /home/calebh/rsc/apex/tests/L0/run_optimizers
Executing tests from /home/calebh/rsc/apex/tests/L0/run_fused_layer_norm
Executing tests from /home/calebh/rsc/apex/tests/L0/run_mlp
Pytorch MLP time 0.9917 ms
C++ MLP time 0.5324 ms
cc @crcrpar
I think this repo is a better implementation to build apex wheels.