Douglas Lehr
Douglas Lehr
Table Batched Embedding Kernel for AMD GPUs. This forward kernel takes advantage of GCN intrinsics to enable data pipelining for embedding bag indices loads while performing accumulations and stores.
Add CUDAMallocManagedAllocator Backend With the new CUDAAllocator class, we have created a new CUDAMallocManagedAllocator, which will handle allocator requests from both cpu and cuda device types when the backend is...
### 🐛 Describe the bug The following unit tests fail due to a shared objects not being built for custom operators, and custom backends. test_pruning_op test_calling_custom_op (__main__.TestCustomOperators) test_pruning_op test_calling_custom_op_inside_script_module (__main__.TestCustomOperators)...
### 🐛 Describe the bug Need to find out why these passing tests are skipped in upstream CI for `test_unary_ufuncs.py` test_unary_ufuncs test_reference_numerics_hard_polygamma_polygamma_n_2_cuda_float16 (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs test_reference_numerics_hard_polygamma_polygamma_n_2_cuda_float17 (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs test_reference_numerics_hard_polygamma_polygamma_n_2_cuda_float18 (__main__.TestUnaryUfuncsCUDA) test_unary_ufuncs...
### 🐛 Describe the bug Currently we do not have hipGraph support in PyTorch. We are working to add that implementation. In the meantime, the cudaGraph unit tests test_cuda test_graph_capture_simple...
Add support for serving profiling by request count instead of capturing entire serving run.