Add ROCm Build Support to `smart` CLI
Description
SmartSim has non-traditional installation process compared to most python projects. It is a twofold process involving:
# A pip install step
pip install smartsim[ml]
# And a CLI build step
smart build # ... args ...
This is because SmartSim cannot legally ship a RedisAI binary. To circumvent this problem, SmartSim ships the smart build CLI tool that will clone and build RedisAI from source so that SmartSim can rely on it as a dependency.
Previously, SmartSim would rely on the RedisAI get_deps.sh script to fetch dependencies of RedisAI in the smart build CLI. This meant that SmartSim was effectively constrained to only being supported on hardware platforms that RedisAI directly supported (e.g. CPU, Nvidia), despite the fact that SmartSim has been successfully built by hand with ROCm support and offers ROCm support when built with a more sophisticated package manager
With the merge of https://github.com/CrayLabs/SmartSim/pull/451 the smart build CLI no longer utilizes (and therefore is no longer constrained to) the get_deps.sh script from RedisAI. Because we are now fetching our own dependencies, we should use this opportunity to offer a first class smart build --device=rocm
Justification
There are many potential users of SmartSim who are utilizing exclusively ROCm machines, and thus cannot use the "out of the box" SmartSim installation. Asking these users to build SmartSim by hand or jump to more esoteric package managers than pip is likely preventing many from attempting to adopt SmartSim.
Implementation Strategy
The first step to offering ROCm support through smart build would be to add it as a valid option to CLI arguments in smartsim._core._cli.build
parser.add_argument(
"--device",
type=str.lower,
default="cpu",
choices=["cpu", "gpu", "rocm"],
# ^^^^^^ Added as a target here
help="Device to build ML runtimes for",
)
The "gpu" option should remain as alias for Nvidia support for backwards compatibility, but there is also a good argument to be made that an explicit "nvidia" target should be added here as well. The _TDeviceStr type should be updated to reflect these new values.
The other (arguable more correct) option would be to add a new --gpu-lib flag that takes values of "cuda" or "rocm"
parser.add_argument(
"--gpu-lib", # <-- note the totally new target
type=str.lower,
choices=["cuda", "rocm"],
# etc. for defaults, help, other params
)
and a warning/error should be raised if a user attempts to specify this flag when used in conjunction with --device=cpu
Next, ROCm builds of libtorch and libtensorflow (and ideally onnxruntime, but this is not a necessity for a first iteration) will need to be located. If we cannot find anywhere to pull these dependencies, we could always opt to offer them ourselves similar to how we offer Apple Silicon builds
Finally, the url property of the _RAIDependency subclasses defined in smartsim._core._install.builder will need to be updated to pull the dependency from the correct link when --device=rocm is specified. For most, this will be when a device instance attr is equal to "rocm".
Acceptance Criteria
- [ ]
smart build --device=rocmorsmart build --device=gpu --gpu-lib=rocmis made a valid call ofsmart build - [ ] The
_TDeviceStrtype is updated to reflect newly accepted values - [ ] A ROCm build of
libtorchis found or supplied - [ ] A ROCm build of
libtensorflowis found or supplied - [ ] [Optional] A ROCm build of
onnxruntimeis found or supplied - [ ]
_PTArchiveLinux.urlpoints to the ROCm build when a ROCm device is requested - [ ]
_PTArchiveMacOSX.urlthrows aBuildErrora ROCm device is requested - [ ]
_TFArchive.urlpoints to the ROCm build when a ROCm device is requested - [ ] [Optional]
_ORTArchive.urlpoints to the ROCm build when a ROCm device is requested - [ ] The existing test suite passes on a ROCm machine when installed with ROCm support and
SMARTSIM_TEST_DEVICE=gpu
One possible update to this ticket is that the CLI command --device=gpu will auto-detect the type of accelerator (e.g. nvidia, amd, intel) and then we have optional arguments --gpu-lib=cuda. The name of these parameters can be updated, but it was thought we should split device auto detection from specification of ml lib to override auto detection.