update triton_ops.py from triton/python/tutorials/06-fused-attention.py
Update deepspeed/ops/transformer/inference/triton_ops.py with latest triton/python/tutorials/06-fused-attention.py,
num_stages = 1 in deepspeed/ops/transformer/inference/triton_ops.py , num_stages=2 in triton/python/tutorials/06-fused-attention.py, because when running stable diffusion inference with deepspeed inference engine with num_stages=2 gives out of memory error (either BLOCK = 64 or num_stages=1)
stable diffusion ineference with deepspeed inference works with latest triton with this update on A100. But the output image is not as good as using triton older version with which now it works
with num_stages=1
with BLOCK=64
@microsoft-github-policy-service agree company="AMD"