Jianyu Huang issues

Results 51 issues of


                                            Jianyu Huang

[FEA] FP8 GEMM implementation

**Is your feature request related to a problem? Please describe.** Recently more details about Nvidia's latest H100 GPU are released in https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/ . Tensor Core will support FP8 E4M3 and...

feature request

inactive-30d

inactive-90d

Clang-format for refactoring the code

Use "clang-format" (https://clang.llvm.org/docs/ClangFormat.html) to update the format of BLISlab code.

Enable uvm using p2p (UVM on a pair GPU)

Summary: Add the APIs for using UVM where the preferred location is on GPU device instead of on CPU device. Differential Revision: D36657705

fb-exported

cla signed

Extract the quantized comm into common component

Summary: This will be better shared between Trec and HPC. - It's open source so TorchRec can call it from FBGEMM. - Add Codec-based quantized comm support with FP32, FP16,...

fb-exported

cla signed

Reuse quantize utils functions

Summary: Reuse the quantize utils functions and dedup the code. Differential Revision: D37745225

fb-exported

cla signed

Extract the quantized comm into fbgemm

Summary: This will be better shared between Trec and HPC. It's open source so TorchRec can call it from FBGEMM. Differential Revision: D37745301

fb-exported

cla signed

Upgrade asmjit (combined)

Summary: From D35292923 Differential Revision: D36121284

fb-exported

cla signed

Upgrade the ops to fbgemm namespace in the frontend

Summary: Debug https://fb.workplace.com/groups/210783077585773/permalink/407320664598679/ - Use fbgemm namespace for the update - load CPU ops with "torch.ops.load_library("//deeplearning/fbgemm/fbgemm_gpu:permute_pooled_embedding_ops_cpu")" Differential Revision: D36390480

fb-exported

cla signed

Update nightly and release version

For FBGEMM release v0.1.0

cla signed

debug for the wheel fail

cla signed