Gene
Gene
is bitrate now supported?
how the speed compares to uni-directional?
cuda 11.6 torch1.11 same problem [7/8] /usr/local/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=lightseq_layers_new -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1013\" -I/opt/conda/lib/python3.8/site-packages/lightseq/csrc/kernels/includes -I/opt/conda/lib/python3.8/site-packages/lightseq/csrc/ops_new/includes -I/opt/conda/lib/python3.8/site-packages/lightseq/csrc/lsflow/includes -I/opt/conda/lib/python3.8/site-packages/lightseq/csrc/layers_new/includes -isystem /opt/conda/lib/python3.8/site-packages/torch/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.8/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /opt/conda/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=1 -D__CUDA_NO_HALF_OPERATORS__...
error: 'scf.for' op expects region #0 to have 0 or 1 blocks for q torch.Size([1120, 280, 4, 32]) (1120, 280, 4, 32) True torch.bfloat16 for triton nightly triton_nightly-2.1.0.dev20230822000928-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Found out this "error: 'scf.for' op expects region #0 to have 0 or 1 blocks" is triggered by _bwd_kernel autotune, under the configuration of "SEQUENCE_PARALLEL": False. I highly doubt it...
Using `git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82` to solve this problem too