cml icon indicating copy to clipboard operation
cml copied to clipboard

`cml runner`: Request spot instances from requirements

Open courentin opened this issue 3 years ago • 5 comments

What?

Would it be possible to add the ability to request spot instances from a list of requirements rather than an instance type or a GPU type?

For example, I would like to tell cml runner, I want an instance at the lowest price that:

  • has 2 nvidia GPUs
  • has at least 8 GB of ram
  • is the latest instance generation
  • is in any availability zone
  • etc.

(more context: discord#cml/1000042237830373406)

Why?

Spot instances are not available 100% of the time and as explained in the aws best practices guide, the less constraints, the more chance we have to fulfil our spot instance request.

Possible solutions

I think we have multiple way of implementing it.

The first and low cost solution would be to allow multiple value for the --cloud-type option:

cml runner
  --cloud-spot
  --cloud-type=g3.4xlarge,g4dn.xlarge,g5.8xlarge

The requirements to instance type conversion would need to be done beforehand. But after all, instance types don't change often.


The second solution would be to implement all the requirement logic into cml runner. Not sure what the api could look like but something like this could be useful:

cml runner
  --cloud-spot
  --cloud-spot-requirement="AcceleratorCount>=1"
  --cloud-spot-requirement="AcceleratorManufacturers=NVIDIA"
  ...

Third solution (basically the second one but probably easier to implement):

{
      "AcceleratorCount": {
          "Min": 1
      },
      "AcceleratorManufacturers": [
          "nvidia"
      ]
}
cml runner
  --cloud-spot
  --cloud-spot-json-requirements=path_to_requirements.json
  ...

courentin avatar Jul 22 '22 21:07 courentin

See also

  • https://pkg.go.dev/github.com/aws/aws-sdk-go-v2/service/ec2#Client.GetInstanceTypesFromInstanceRequirements
  • https://docs.aws.amazon.com/autoscaling/ec2/userguide/create-asg-instance-type-requirements.html

0x2b3bfa0 avatar Jul 22 '22 23:07 0x2b3bfa0

@courentin what are your thoughts on providing a list to --cloud-type when --cloud-spot is active sequentially address the instance types for the first one that is immediately available. (I haven't researched to see if all the providers have some form of requirements spec API like the one @0x2b3bfa0 linked for AWS)

dacbd avatar Jul 25 '22 05:07 dacbd

@dacbd it would be very useful

courentin avatar Jul 28 '22 21:07 courentin

Thanks for raising this @courentin . I think this is very important for viable spot and even on demand GPU instances allocation in the "wild". My thoughts about implementation/ux - options:

  • Option 1 looks like a nice stop gap solution, but it's putting the burden of researching the instance types on the user.
  • Option 2 is the primary way to go imo.
  • With option 3 being a nice additional input imo. but not instead of the straightforward options for the useful dimensions - cpu/mem/GPU/gpu-mem ranges (min/max)

omesser avatar Sep 20 '22 20:09 omesser

  • Option 1 is rather simple to implement but, indeed, makes users responsible for figuring out instance types, which is not ideal
  • Option 2 is related to https://github.com/iterative/terraform-provider-iterative/issues/158#issuecomment-965625347 and would be handy on every cloud, albeit not easily portable
  • Option 3 sounds like a nested field in an hypothetical cml.yaml (or toml or xlsx for that matter), in addition to option 2 as @omesser said

0x2b3bfa0 avatar Sep 21 '22 03:09 0x2b3bfa0