Hakan Baba comments

Results 10 comments of


                                            Hakan Baba

Improvements to compile and run Kube 1.6

For some reason this pull request shows much more commits than there is . :( Something may be messed up with the rebase - merge operations.

fix: write the mms config file to a subdir of base_dir instead of /etc

Dear Maintainers. We would appreciate very much if you could add some comments to this PR ?

SageMaker inference should be able to run as non-root user.

If I may, could I suggest alternative paths to `/etc` join( base_dir, "etc") join( base_dir, "conf") join( base_dir, "config") **Rationale** `sagemaker-inference` already has a concept of [`base_dir`](https://github.com/aws/sagemaker-inference-toolkit/blob/45fa4fb33c13a70640aa200fbfa576c323f973da/src/sagemaker_inference/environment.py#L30). `base_dir` defaults to...

Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

I see the defragment error with 0.9.1 as well. (Changed the issue title to reflect 0.9.1)

Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

The assertion comes from [this line](https://github.com/microsoft/DeepSpeed/blame/39b429d56ef12b3dc82fc177e2f0f801db744a3d/deepspeed/runtime/zero/stage3.py#L410). According to the blame, it did not change for a year or so. Looking upper in the call stack, that defragment function is called...

`tritonserver:25.08` with `vllm==0.10.1.1` and `VLLM_USE_V1=1` While setting up metric, failed to initialize Python stub, AsyncLLM has no attribute 'engine'

The issue here is that the vllm_backend does not support the V1 metrics from vllm. (As far as I can tell) At the time of writing this, the latest vllm_backend's...

Hakan Baba

Improvements to compile and run Kube 1.6

fix: write the mms config file to a subdir of base_dir instead of /etc

SageMaker inference should be able to run as non-root user.

Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

Running the train_dolly.py with `transformers[torch]==4.28.1`, `deepspeed==0.9.1` and V100 GPU gives error in defragment

Prebuilt kernels not found, using JIT backend

Add support for protobuf-language-server

`tritonserver:25.08` with `vllm==0.10.1.1` and `VLLM_USE_V1=1` While setting up metric, failed to initialize Python stub, AsyncLLM has no attribute 'engine'