Dong Meng
Dong Meng
Hello @eladeban, Thanks for the prompt response. I don't think complexity is the issue, as lenet also have similar error. I suspect the additional node/ops created by tensorflow estimator interface....
Sorry for the late reply, yes, I am using channels_first. Let me modify the regularizer and give it a try
Hello, thank you for looking into the code. I have tried to modify the `output_boundary` to: ``` name: "resnet_model/block_layer4" op: "Identity" input: "resnet_model/Relu_48" device: "/replica:0/task:0/device:GPU:0" attr { key: "T" value...
I see. Let try this again. Thanks for clarifying.
ah, I think autozone require all zone in that region have a2 instance, however in n1-central1 region, only n1-central1-a and c has a2 instance.
Hello, per updated instruction at https://github.com/triton-inference-server/server/tree/main/deploy/gke-marketplace-app#demo-instruction, we instead install istio to the GKE cluster.
@DoctorTeeth could you please help and review? Thanks
Observe the same issue, ideally, I want to run: Node 1: python -m torch.distributed.launch --nproc_per_node=2 / --nnodes=2 --node_rank=0 --master_addr=masterAddr / --master_port=1234 train.py ......... Node 2: python -m torch.distributed.launch --nproc_per_node=2 /...
@gaocegege Could you elaborate on what is the condition that inter-pod cost less than inter-process? Thanks
> @gaocegege I guess if some pod previously had 2 gpus (on one node) and then it divided into 2 pods with 1 gpus each, both pods will be allocated...