Jiajia Qin
Jiajia Qin
This PR merges MatMulPackedVec4Program to MatMulPackedProgram and refactors MatMulSplitKProgram. To see the logs from the Cloud Build CI, please join either our [discussion](https://groups.google.com/a/tensorflow.org/forum/#!forum/tfjs) or [announcement](https://groups.google.com/a/tensorflow.org/forum/#!forum/tfjs-announce) mailing list. --- This change...
hand_detector Kernel | Time(ms) | Inputs | Output | GPUPrograms -- | -- | -- | -- | -- Conv2DBackpropInput | 9.73 | input0: 4D[1,16,16,256]input1: 4D[2,2,128,256] | 1,32,32,128 | UnpackProgram:...
Tested using https://honry.github.io/webnn-samples/style_transfer/?backend=webgl Type | Time(ms) | Inputs | Output -- | -- | -- | -- Conv2D | 82.86 | input0: 4D[1,548,548,4]input1: 4D[9,9,4,3] | 1,540,540,3 Conv2D | 63.09 |...
Fix #6822 Problem 1: On some GPUs, even if `a` and `b` are both non-NaN, the value of `isNaN` in `vec4 isNaN = min(vec4(isnan(a)) + vec4(isnan(b)), vec4(1.0));` are still larger...
The `async read` in `backend_webgl.ts` always creates a new `PIXEL_PACK_BUFFER` [buffer](https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-webgl/src/backend_webgl.ts#L336). Once the download is finished, delete [it ](https://github.com/tensorflow/tfjs/blob/master/tfjs-backend-webgl/src/backend_webgl.ts#L370) in GPU. So this buffer is created and deleted over and...
### Description #21618 This PR optimizes grouped conv by 1) more sequential memory access in gpu 2) reusing input's data to reduce global memory access times. See `Conv|GroupedConv` op in...
### Description ### Motivation and Context
### Description This PR makes the intermediate generated buffers static in GQA for the static kv cache so that it's possible to use the graph capture capability on llm. The...
This PR enables graph capture capabilities in the WebGPU provider, which is similar with jsep one #18989. All limitations are similar with JS/CUDA EP: 1. Models with control-flow ops (i.e....
Fixed #26766