Shanbin Ke comments

Results 28 comments of


                                            Shanbin Ke

running with python 3.7+ torch 0.4+

just in case if anyone wants to run code with python3.7+ and torch0.4+ ...

[JAX cuDNN SDPA API] Add is_training and fix seqlen/head_dim checks

@superbobry how do we make sure test_sdpa_inference have 4 gpus? also could you share some info on the internal checks failure?

[JAX cuDNN SDPA API] Add is_training and fix seqlen/head_dim checks

@superbobry Hi, I think all issues are resolved, could you take another look and trigger the internal review?

[JAX cuDNN SDPA API] Add is_training and fix seqlen/head_dim checks

> Sorry for the delay @Cjkkkk. `DotProductAttentionTest.test_sdpa_inference` seems to fail internally with > > ``` > Traceback (most recent call last): > File "[...]/jax/_src/test_util.py", line 456, in test_method_wrapper > return...

[JAX cuDNN SDPA API] Add is_training and fix seqlen/head_dim checks

> Yeah, it seems likely. Can you, perhaps, skip your test if

What's the difference of flash attention implement between cudnn and Dao-AILab?

@MoFHeka , it is not correct to say it is implemented in tensorflow, it is implemented in XLA and there is a PR https://github.com/openxla/xla/pull/6872 pending to integrate the final piece...

Unable to run the program

add `set(CMAKE_CXX_COMPILER "clang-9")` into skeleton/CMakeList.txt solved my problem. If remove this line, then default compiler is g++ in my case, which will cause this problem. Hope it helps.

[XLA:GPU] add force inline and no preserve local option to get better llvm splits

> > Compilation: TSL:XlaCompile:#module=pjit__wrapped_step_fn,program_id=24#: 3.754429084 (parallel + inline) > > What are the units, seconds? > > Mentioning both runtime and compile time in the bug description is a bit...

[XLA:GPU] add force inline and no preserve local option to get better llvm splits

Update some more models compilation with this changes: https://docs.google.com/spreadsheets/d/1uIRf66UT9hOBOge3nvRZebDintgM0zmozNts0tOiXQA/edit?usp=sharing. Seems preserveLocals=False is not doing any better than parallel + inline. So i will just remove that.

[XLA:GPU] add force inline and no preserve local option to get better llvm splits

@cheshire Hi, any updates on this?