[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds
1. Issue or feature description
An issue has been identified when trying to run https://github.com/karpathy/nanoGPT with the HAMi framework; it's currently unsuccessful. However, when the same code is run using the official https://github.com/NVIDIA/k8s-device-plugin, it operates smoothly. This inconsistency may be attributed to HAMi's use of CUDA hijacking Ref #46 . A closer examination of the Hami-Core's functionality or configuration might be needed to pinpoint the problem.
Related GPU Environments
- k8s version: v1.26.15
- Nvidia Driver Version: 535.86.10
- Nvidia-toolkit Version: v1.13.4
- ContainerD Version: v1.6.24
- Cuda Version: 12.2
- Linux Kernel Version: 6.6.22-amd64
- HAMi Version: v2.3.12
- PyTorch Version: 2.1.0+cu121
- Pod image: kubeflownotebookswg/jupyter-pytorch-cuda-full:v1.8.0
Initializing a new model from scratch
number of parameters: 10.65M
num decayed parameter tensors: 26, with 10,740,096 parameters
num non-decayed parameter tensors: 13, with 4,992 parameters
using fused AdamW: True
compiling the model... (takes a ~minute)
Traceback (most recent call last):
File "/home/jovyan/nanoGPT/train.py", line 264, in <module>
losses = estimate_loss()
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/jovyan/nanoGPT/train.py", line 224, in estimate_loss
logits, loss = model(X, Y)
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
return callback(frame, cache_entry, hooks, frame_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 641, in _convert_frame
result = inner_convert(frame, cache_size, hooks, frame_state)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 133, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 389, in _convert_frame_assert
return _compile(
^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 569, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 491, in compile_inner
out_code = transform_code_object(code, transform)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/bytecode_transformation.py", line 1028, in transform_code_object
transformations(instructions, code_options)
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/convert_frame.py", line 458, in transform
tracer.run()
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2074, in run
super().run()
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 724, in run
and self.step()
^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 688, in step
getattr(self, inst.opname)(inst)
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/symbolic_convert.py", line 2162, in RETURN_VALUE
self.output.compile_subgraph(
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 857, in compile_subgraph
self.compile_and_call_fx_graph(tx, pass2.graph_output_vars(), root)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 957, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1024, in call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/output_graph.py", line 1009, in call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/repro/after_dynamo.py", line 117, in debug_wrapper
compiled_gm = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/__init__.py", line 1568, in __call__
return compile_fx(model_, inputs_, config_patches=self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1150, in compile_fx
return aot_autograd(
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/backends/common.py", line 55, in compiler_fn
cg = aot_module_simplified(gm, example_inputs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 3891, in aot_module_simplified
compiled_fn = create_aot_dispatcher_function(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 3429, in create_aot_dispatcher_function
compiled_fn = compiler_fn(flat_fn, fake_flat_args, aot_config, fw_metadata=fw_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 2212, in aot_wrapper_dedupe
return compiler_fn(flat_fn, leaf_flat_args, aot_config, fw_metadata=fw_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 2392, in aot_wrapper_synthetic_base
return compiler_fn(flat_fn, flat_args, aot_config, fw_metadata=fw_metadata)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 1573, in aot_dispatch_base
compiled_fw = compiler(fw_module, flat_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 1092, in fw_compiler_base
return inner_compile(
^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/repro/after_aot.py", line 80, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/debug.py", line 228, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 54, in newFunction
return old_func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 341, in compile_fx_inner
compiled_graph: CompiledFxGraph = fx_codegen_and_compile(
^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 565, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py", line 970, in compile_to_fn
return self.compile_to_module().call
^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/utils.py", line 189, in time_wrapper
r = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/graph.py", line 941, in compile_to_module
mod = PyCodeCache.load_by_key_path(key, path, linemap=linemap)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1139, in load_by_key_path
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_jovyan/w7/cw7ravsc5anhkpigvbwronnhuedvnysdh7qebhz5f5ahxmyxvbhx.py", line 905, in <module>
async_compile.wait(globals())
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1418, in wait
scope[key] = result.result()
^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 1277, in result
self.future.result()
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information
2. Steps to reproduce the issue
Follow the quickStart in https://github.com/karpathy/nanoGPT?tab=readme-ov-file#quick-start
3. Information to attach (optional if deemed irrelevant)
3. Details error
nvidia-smi -a
base) jovyan@nanogpt-0:~/nanoGPT$ nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Thu Jun 6 08:39:56 2024
Driver Version : 535.86.10
[HAMI-core Msg(571:140025334466368:libvgpu.c:836)]: Initializing.....
CUDA Version : 12.2
Attached GPUs : 1
GPU 00000000:AF:00.0
Product Name : Tesla V100-PCIE-16GB
Product Brand : Tesla
Product Architecture : Volta
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : N/A
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0320218190518
GPU UUID : GPU-3a6ec8f0-24eb-1905-1f17-7bdb4e850ffa
Minor Number : 1
VBIOS Version : 88.00.1A.00.03
MultiGPU Board : No
Board ID : 0xaf00
Board Part Number : 900-2G500-0100-030
GPU Part Number : 1DB4-893-A1
FRU Part Number : N/A
Module ID : 1
Inforom Version
Image Version : G500.0200.00.03
OEM Object : 1.1
ECC Object : 5.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
GPU Reset Status
Reset Required : No
Drain and Reset Recommended : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xAF
Device : 0x00
Domain : 0x0000
Device Id : 0x1DB410DE
Bus Id : 00000000:AF:00.0
Sub System Id : 0x121410DE
GPU Link Info
PCIe Generation
Max : 3
Current : 3
Device Current : 3
Device Max : 3
Host Max : 3
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Atomic Caps Inbound : N/A
Atomic Caps Outbound : N/A
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 16384 MiB
Reserved : 232 MiB
Used : 0 MiB
Free : 13707 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 10 MiB
Free : 16374 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : N/A
OFA : N/A
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Texture Memory : N/A
Texture Shared : N/A
CBU : 0
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending Page Blacklist : No
Remapped Rows : N/A
Temperature
GPU Current Temp : 34 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 90 C
GPU Slowdown Temp : 87 C
GPU Max Operating Temp : 83 C
GPU Target Temperature : N/A
Memory Current Temp : 32 C
Memory Max Operating Temp : 85 C
GPU Power Readings
Power Draw : 38.95 W
Current Power Limit : 250.00 W
Requested Power Limit : 250.00 W
Default Power Limit : 250.00 W
Min Power Limit : 100.00 W
Max Power Limit : 250.00 W
Module Power Readings
Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Clocks
Graphics : 1245 MHz
SM : 1245 MHz
Memory : 877 MHz
Video : 1117 MHz
Applications Clocks
Graphics : 1245 MHz
Memory : 877 MHz
Default Applications Clocks
Graphics : 1245 MHz
Memory : 877 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1380 MHz
SM : 1380 MHz
Memory : 877 MHz
Video : 1237 MHz
Max Customer Boost Clocks
Graphics : 1380 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2216157
Type : C
Name :
Used GPU Memory : 426 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2664335
Type : C
Name :
Used GPU Memory : 686 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3189308
Type : C
Name :
Used GPU Memory : 1328 MiB
[HAMI-core Msg(571:140025334466368:multiprocess_memory_limit.c:468)]: Calling exit handler 571
@wawa0210 @archlitchi PTAL
You can use the environment variable LIBCUDA_LOG_LEVEL to increase the logging level of the hami core and obtain more context
Append the log after set the LIBCUDA_LOG_LEVEL to 4
(base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetMemoryErrorCounter:67
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetNvLinkErrorCounter:75
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetTotalEccErrors:108
[HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceResetNvLinkErrorCounters:125
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString
[HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000
[HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000
File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.
torch._dynamo.config.suppress_errors = True
在将
LIBCUDA_LOG_LEVEL``4(base) (⎈|N/A:N/A)➜ cat output.txt | grep -i error [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlErrorString:2 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceClearEccErrorCounts:10 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetDetailedEccErrors:38 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetMemoryErrorCounter:67 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetNvLinkErrorCounter:75 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceGetTotalEccErrors:108 [HAMI-core Debug(492:140563747359616:hook.c:293)]: loading nvmlDeviceResetNvLinkErrorCounters:125 [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorString [HAMI-core Debug(492:140563747359616:libvgpu.c:79)]: into dlsym cuGetErrorName [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000 [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000 [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorString:6000 [HAMI-core Info(492:140563747359616:hook.c:343)]: into cuGetProcAddress_v2 symbol=cuGetErrorName:6000 File "/opt/conda/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 490, in catch_errors RuntimeError: Failed to find C compiler. Please specify via CC environment variable. torch._dynamo.config.suppress_errors = True
This looks more like a container environment issue
Today, I had an offline debug session with @archlitchi . Despite setting CUDA_DISABLE_CONTROL to true and removing ld.so.preload from the GPU node, the issue persisted. We suspect that this is because Hami is using the v1.4.0 Nvidia device plugin, which may be the reason why nanoGPT cannot run. We need to install a clean Nvidia device plugin v1.4.0 to confirm this. If our suspicion is correct, we might need to upgrade the Nvidia device plugin in Hami.
Confirmed that the issue mentioned also occurs in version 0.14.0 of k8s-device-plugin. Thus, we should update k8s-device-plugin to at least version 0.14.5.