TimWang issues

Results 13 issues of


                                            TimWang

Workloads keep in hang state except cuda-sample:vectoradd under MPS mode

### 1. Quick Debug Information * OS/Version(Garden Linux 934.11): * Kernel Version: 5.15.135-gardenlinux-amd64 * Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd:/1.6.20 * K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s/v1.26.11...

[Refactor] Simplify UnmarshalJSON with Dedicated Handlers

This PR refactors the `UnmarshalJSON` method for `ReplicatedDevices` to improve code clarity and structure. Key changes include: - **Modularization**: Created separate handlers (`handleStringInput`, `handleNumericInput`, `handleListInput`) for different input types, enhancing...

feat: Add GPU node selector to scheduler deployment

The code changes in `deployment.yaml` add the GPU node selector to the scheduler deployment. This change allows the scheduler to select nodes with the `gpu` label set to `"on"`. The...

Add feature to use the node selector in the scheduler to filter BM Nodes with GPUs

### 1. Issue or feature description #### Description In our cluster, we have both VM nodes and BM (Bare Metal) nodes. However, only the BM nodes have GPUs. Therefore, I...

Use tag with different version to manage the git branch

Tags in Git serve as a means to designate significant moments in your repository's history. They are commonly employed to indicate release versions, such as v1.0, v2.0, and so on....

[Doc]: Add Kubernetes (K8s) Deployment Documentation for vLLM with GPUs

### 📚 The doc issue ### Context: I am currently managing H100 GPUs using Kubernetes (K8s). However, I’ve noticed that the vLLM documentation only provides deployment instructions for Docker, which...

documentation

[Doc]: Add deploying_with_k8s guide

### Description This Pull Request introduces detailed documentation on how to deploy vLLM with Kubernetes. The new documentation is designed to help users efficiently manage and scale their machine learning...

ready

feat:Update k8s-device-plugin to v0.14.5 to Resolve nanoGPT Runtime Issue

**What type of PR is this?** During an offline debugging session with @archlitchi , we identified that the current NVIDIA device plugin (v1.4.0) is causing compatibility issues with nanoGPT, preventing...

kind/bug

dco-signoff: yes

do-not-merge/work-in-progress

size/XL

[Bug]HAMi Framework Fails to Execute nanoGPT with CUDA, While NVIDIA k8s-device-plugin Succeeds

--- ### 1. Issue or feature description An issue has been identified when trying to run https://github.com/karpathy/nanoGPT with the HAMi framework; it's currently unsuccessful. However, when the same code is...

[Docs]: update model deployment with dedicated namespace

## Pull Request Description This pull request updates the `samples/quickstart/model.yaml` file to enhance resource allocation, namespace organization, and deployment configuration for the `deepseek-r1-distill-llama-8b` model. Key changes include migrating to a...