Add ROCm Build Support to `smart` CLI

Open MattToast opened this issue 1 year ago • 1 comments

Description

SmartSim has non-traditional installation process compared to most python projects. It is a twofold process involving:

# A pip install step 
pip install smartsim[ml]
# And a CLI build step
smart build # ... args ...

This is because SmartSim cannot legally ship a RedisAI binary. To circumvent this problem, SmartSim ships the smart build CLI tool that will clone and build RedisAI from source so that SmartSim can rely on it as a dependency.

Previously, SmartSim would rely on the RedisAI get_deps.sh script to fetch dependencies of RedisAI in the smart build CLI. This meant that SmartSim was effectively constrained to only being supported on hardware platforms that RedisAI directly supported (e.g. CPU, Nvidia), despite the fact that SmartSim has been successfully built by hand with ROCm support and offers ROCm support when built with a more sophisticated package manager

With the merge of https://github.com/CrayLabs/SmartSim/pull/451 the smart build CLI no longer utilizes (and therefore is no longer constrained to) the get_deps.sh script from RedisAI. Because we are now fetching our own dependencies, we should use this opportunity to offer a first class smart build --device=rocm

Justification

There are many potential users of SmartSim who are utilizing exclusively ROCm machines, and thus cannot use the "out of the box" SmartSim installation. Asking these users to build SmartSim by hand or jump to more esoteric package managers than pip is likely preventing many from attempting to adopt SmartSim.

Implementation Strategy

The first step to offering ROCm support through smart build would be to add it as a valid option to CLI arguments in smartsim._core._cli.build

parser.add_argument(
    "--device",
    type=str.lower,
    default="cpu",
    choices=["cpu", "gpu", "rocm"],
    #                      ^^^^^^ Added as a target here
    help="Device to build ML runtimes for",
)

The "gpu" option should remain as alias for Nvidia support for backwards compatibility, but there is also a good argument to be made that an explicit "nvidia" target should be added here as well. The _TDeviceStr type should be updated to reflect these new values.

Edit from alternate proposal:

The other (arguable more correct) option would be to add a new --gpu-lib flag that takes values of "cuda" or "rocm"

parser.add_argument(
    "--gpu-lib",  # <-- note the totally new target
    type=str.lower,
    choices=["cuda", "rocm"],
    # etc. for defaults, help, other params
)

and a warning/error should be raised if a user attempts to specify this flag when used in conjunction with --device=cpu

Next, ROCm builds of libtorch and libtensorflow (and ideally onnxruntime, but this is not a necessity for a first iteration) will need to be located. If we cannot find anywhere to pull these dependencies, we could always opt to offer them ourselves similar to how we offer Apple Silicon builds

Finally, the url property of the _RAIDependency subclasses defined in smartsim._core._install.builder will need to be updated to pull the dependency from the correct link when --device=rocm is specified. For most, this will be when a device instance attr is equal to "rocm".

Acceptance Criteria

[ ] smart build --device=rocm or smart build --device=gpu --gpu-lib=rocm is made a valid call of smart build
[ ] The _TDeviceStr type is updated to reflect newly accepted values
[ ] A ROCm build of libtorch is found or supplied
[ ] A ROCm build of libtensorflow is found or supplied
[ ] [Optional] A ROCm build of onnxruntime is found or supplied
[ ] _PTArchiveLinux.url points to the ROCm build when a ROCm device is requested
[ ] _PTArchiveMacOSX.url throws a BuildError a ROCm device is requested
[ ] _TFArchive.url points to the ROCm build when a ROCm device is requested
[ ] [Optional] _ORTArchive.url points to the ROCm build when a ROCm device is requested
[ ] The existing test suite passes on a ROCm machine when installed with ROCm support and SMARTSIM_TEST_DEVICE=gpu

Feb 22 '24 17:02 MattToast

One possible update to this ticket is that the CLI command --device=gpu will auto-detect the type of accelerator (e.g. nvidia, amd, intel) and then we have optional arguments --gpu-lib=cuda. The name of these parameters can be updated, but it was thought we should split device auto detection from specification of ml lib to override auto detection.

Feb 23 '24 18:02 mellis13