Xibin Liu
Xibin Liu
Ran the "[Running Megatron-LM/llama2 on A3 Mega](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/blob/main/sample_workloads/megatron-gke/README.md#running-megatron-lmllama2-on-a3-mega)" test. Created the node pool with a3_highgpu_8g machines. Modified the helm/values.yaml to use tcpx stack ``` stack: "tcpx" # one of {"tcp", "tcpx",...
Run the "[Running Megatron-LM/llama2 on A3 Mega](https://github.com/GoogleCloudPlatform/ai-infra-cluster-provisioning/tree/main/sample_workloads/megatron-gke#running-megatron-lmllama2-on-a3-mega)" sample workload following the instruction. Install the helm: ``` helm install a3-high-exp-1 helm/ --values helm/values.yaml ``` Got the following events from the pods:...
# Description Integrate SDK for managed diagnostics - include new SDK google-cloud-mldiagnostics seed-env: --seed-commit=3d8c7df56aa55f363d5453fc82b64c5093ea2863 - add new config params - modify profiler.py to add ML run and profiling - modify...