Dong Meng

Results 11 comments of Dong Meng

Hello @eladeban, Thanks for the prompt response. I don't think complexity is the issue, as lenet also have similar error. I suspect the additional node/ops created by tensorflow estimator interface....

Sorry for the late reply, yes, I am using channels_first. Let me modify the regularizer and give it a try

Hello, thank you for looking into the code. I have tried to modify the `output_boundary` to: ``` name: "resnet_model/block_layer4" op: "Identity" input: "resnet_model/Relu_48" device: "/replica:0/task:0/device:GPU:0" attr { key: "T" value...

I see. Let try this again. Thanks for clarifying.

ah, I think autozone require all zone in that region have a2 instance, however in n1-central1 region, only n1-central1-a and c has a2 instance.

Hello, per updated instruction at https://github.com/triton-inference-server/server/tree/main/deploy/gke-marketplace-app#demo-instruction, we instead install istio to the GKE cluster.

@DoctorTeeth could you please help and review? Thanks

Observe the same issue, ideally, I want to run: Node 1: python -m torch.distributed.launch --nproc_per_node=2 / --nnodes=2 --node_rank=0 --master_addr=masterAddr / --master_port=1234 train.py ......... Node 2: python -m torch.distributed.launch --nproc_per_node=2 /...

@gaocegege Could you elaborate on what is the condition that inter-pod cost less than inter-process? Thanks

> @gaocegege I guess if some pod previously had 2 gpus (on one node) and then it divided into 2 pods with 1 gpus each, both pods will be allocated...