`cml runner`: Request spot instances from requirements
What?
Would it be possible to add the ability to request spot instances from a list of requirements rather than an instance type or a GPU type?
For example, I would like to tell cml runner, I want an instance at the lowest price that:
- has 2 nvidia GPUs
- has at least 8 GB of ram
- is the latest instance generation
- is in any availability zone
- etc.
(more context: discord#cml/1000042237830373406)
Why?
Spot instances are not available 100% of the time and as explained in the aws best practices guide, the less constraints, the more chance we have to fulfil our spot instance request.
Possible solutions
I think we have multiple way of implementing it.
The first and low cost solution would be to allow multiple value for the --cloud-type option:
cml runner
--cloud-spot
--cloud-type=g3.4xlarge,g4dn.xlarge,g5.8xlarge
The requirements to instance type conversion would need to be done beforehand. But after all, instance types don't change often.
The second solution would be to implement all the requirement logic into cml runner. Not sure what the api could look like but something like this could be useful:
cml runner
--cloud-spot
--cloud-spot-requirement="AcceleratorCount>=1"
--cloud-spot-requirement="AcceleratorManufacturers=NVIDIA"
...
Third solution (basically the second one but probably easier to implement):
{
"AcceleratorCount": {
"Min": 1
},
"AcceleratorManufacturers": [
"nvidia"
]
}
cml runner
--cloud-spot
--cloud-spot-json-requirements=path_to_requirements.json
...
See also
- https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/ec2#Client.GetInstanceTypesFromInstanceRequirements
- https://docs.aws.amazon.com/autoscaling/ec2/userguide/create-asg-instance-type-requirements.html
@courentin what are your thoughts on providing a list to --cloud-type when --cloud-spot is active sequentially address the instance types for the first one that is immediately available. (I haven't researched to see if all the providers have some form of requirements spec API like the one @0x2b3bfa0 linked for AWS)
@dacbd it would be very useful
Thanks for raising this @courentin . I think this is very important for viable spot and even on demand GPU instances allocation in the "wild". My thoughts about implementation/ux - options:
- Option 1 looks like a nice stop gap solution, but it's putting the burden of researching the instance types on the user.
- Option 2 is the primary way to go imo.
- With option 3 being a nice additional input imo. but not instead of the straightforward options for the useful dimensions - cpu/mem/GPU/gpu-mem ranges (min/max)
- Option 1 is rather simple to implement but, indeed, makes users responsible for figuring out instance types, which is not ideal
- Option 2 is related to https://github.com/iterative/terraform-provider-iterative/issues/158#issuecomment-965625347 and would be handy on every cloud, albeit not easily portable
- Option 3 sounds like a nested field in an hypothetical
cml.yaml(ortomlorxlsxfor that matter), in addition to option 2 as @omesser said