cloudypad icon indicating copy to clipboard operation
cloudypad copied to clipboard

[Azure] Some issues during create

Open linuslabo opened this issue 1 year ago • 5 comments

I just wanted to report the following issues happened in the cloudypad create azure process:

  1. With dynamic ip the installation fails (it seems) for mismatched params somewhere:

Params:

You are about to provision Azure machine with the following details:
    Azure subscription: **************
    Azure location: italynorth
    Instance name: cloudypad
    SSH key: *******
    VM Size: Standard_NV6ads_A10_v5
    Spot instance: true
    Public IP Type: dynamic
    Disk size: 50

Error:

azure-native:compute:VirtualMachine cloudydev-vm created (47s) 
     +  pulumi:pulumi:Stack CloudyPad-Azure-cloudydev creating (66s) error: Expected a single IP, got: [{"etag":"W/\"***********************\"","id":"/subscriptions/**********************/resourceGroups/CloudyPad-cloudydev/providers/Microsoft.Network/networkInterfaces/cloudydev-network-interface********/ipConfigurations/cloudydev-ipcfg","name":"cloudydev-ipcfg","primary":true,"privateIPAddress":"10.0.0.4","privateIPAddressVersion":"IPv4","privateIPAllocationMethod":"Dynamic","provisioningState":"Succeeded","subnet":{"id":"/subscriptions/*****************/resourceGroups/CloudyPad-cloudydev/providers/Microsoft.Network/virtualNetworks/cloudydev-vnet/subnets/cloudydev-subnet"},"type":"Microsoft.Network/networkInterfaces/ipConfigurations"}]
  1. When choosing NV6ads A10 v5 as instance type, NVIDIA drivers fail to install:

Params:

You are about to provision Azure machine with the following details:
    Azure subscription: ***************
    Azure location: italynorth
    Instance name: mypad
    SSH key: ***********
    VM Size: Standard_NV6ads_A10_v5
    Spot instance: true
    Public IP Type: static
    Disk size: 60

Error:

[  105.914213] nvidia-nvlink: Nvlink Core is being initialized, major device number 236
[  105.915500] nvidia 0002:00:00.0: enabling device (0000 -> 0002)
[  105.918834] NVRM: The NVIDIA GPU 0002:00:00.0 (PCI ID: 10de:2236)
               NVRM: installed in this system is not supported by the
               NVRM: NVIDIA 550.127.05 driver release.
               NVRM: Please see 'Appendix A - Supported NVIDIA GPU Products'
               NVRM: in this release's README, available on the operating system
               NVRM: specific graphics driver download page at www.nvidia.com.
[  105.919146] nvidia: probe of 0002:00:00.0 failed with error -1
[  105.919177] NVRM: The NVIDIA probe routine failed for 1 device(s).
[  105.919180] NVRM: None of the NVIDIA devices were initialized.

If you prefer I can create an issue for each problem. If more logs are needed I will try to replicate and share.

linuslabo avatar Dec 31 '24 16:12 linuslabo

Thanks for reporting these issues !

  • NV6ads A10 v5 is using Nvidia A10 GPU, I'll check why driver installed don't support it
  • Dynamic IP bug should be straightforward to fix, I'll look into it asap

PierreBeucher avatar Jan 02 '25 07:01 PierreBeucher

Dynamic IP bug will be fixed in next release: https://github.com/PierreBeucher/cloudypad/pull/92

PierreBeucher avatar Jan 02 '25 15:01 PierreBeucher

Enquired a bit about A10 instance issue, looks like Cloudy Pad should use data center driver in some situation instead of default one. This is a bit more tricky as depending on instance types some other driver should be used, I'll have to map supported instance types to proper driver - not a small feat !

In the meantime I'll remove these instance from the list.

PierreBeucher avatar Jan 02 '25 15:01 PierreBeucher

hey, @PierreBeucher , any updates on nvidia driver installation issue? the A10 instance still fails to install nvidia drivers and it's still in the list

marcosace avatar Mar 13 '25 17:03 marcosace

I haven't had time to look into this driver issue, I should remove this instance from the list asap.

PierreBeucher avatar Mar 13 '25 17:03 PierreBeucher

Same issue. I looked for the cheapest Nvidia GPU on Azure, and it seems to be the NV6ads A10 v5.
It seems that it’s necessary to install the Nvidia GRID driver for this instance type (https://learn.microsoft.com/fr-fr/azure/virtual-machines/linux/n-series-driver-setup ).
I’ll test it, it could be economically interesting to use the NV6ads A10 v5 instead of the NC4as T4 v3.

tounefr avatar Sep 07 '25 19:09 tounefr

Unsupported machines have been removed from the list. We do now support Datacenter drivers which allowed most Datacenter GPUs to be supported in AWS, GCP and others - except for Azure A10 as they require custom Azure GRID Drivers. Will need to implement a custom driver selection specifically for Azure A10 instances.

See https://github.com/PierreBeucher/cloudypad/blob/2063b2115b27d9c74d5768a552388a44a46e3d80/src/providers/azure/cli.ts#L61 and https://github.com/PierreBeucher/cloudypad/pull/277

PierreBeucher avatar Sep 08 '25 07:09 PierreBeucher

Closing as unsupported Azure instances were removed from listing

PierreBeucher avatar Oct 07 '25 10:10 PierreBeucher