Void Main issues

Results 13 issues of


                                            Void Main

Achievement list?

Is there any possible way to show users all the available achievements registered by developer, so that the user kind of has a roadmap to work on?

[New Game] Add Chrome T-Rex game

Add the Chrome T-Rex rush game to PLE. It should be a fun game for learning & playing. :smile: Here's a random agent playing the game: ![t-rex-with-random-agent](https://user-images.githubusercontent.com/552990/81060770-b86d4a00-8f05-11ea-8087-a3bca372a3df.gif) ## Game Spec...

fix: skip None output instead of throw an error

# Description For some models, there may be `None` output for scripted model, for example, in `torchvision.inception_v3`, the second output is a None constant. Current implementation throws error as reported...

component: conversion

WIP

cla signed

❓ [Question] Why BERT Base is slower w/ Torch-TensorRT than native PyTorch?

## ❓ Question I'm trying to optimize hugging face's BERT Base uncased model using Torch-TensorRT, the code works after disabling full compilation (`require_full_compilation=False`), and the avg latency is ~10ms on...

question

performance

回复的时候不能@用户

貌似回复的时候没有@用户的提示？

Performance gap between triton and flash attn

I'm using the sample code from [tutorial 6](https://triton-lang.org/master/getting-started/tutorials/06-fused-attention.html) and measure the performance on A100, here's the bwd latency graph (ran twice, the results look similar): ![CleanShot 2022-11-28 at 10 58...

help wanted

[Question] Why DiT-XL/2 takes 119 GFlops to generate 256x256 images?

Hi guys, I wonder why it takes 119 GFlops for DiT-XL/2 to generate 256x256 images. According to my calculation, it should be over 228 GFlops, can anyone please kindly point...

[enhancement] support llama

Implement LlaMa as requested in issue #506 . ## Steps to use first convert llama-7b-hf weights from huggingface with `huggingface_llama_convert.py`: `python3 huggingface_llama_convert.py -saved_dir=/path/to/export/folder/ -in_file=/path/to/llama-7b-hf -infer_gpu_num=1 -weight_data_type=fp16 -model_name=llama_7b` next, compile and...

Needs help on performance optimization

Hi community, I'm building a triton kernel which first loads some discontinuous indexes from one tensor, and loads actual data with these indexes from another tensor. I'm trying to implement...

High kernel launch overhead

Hey team, I'm suffering high triton kernel launch overhead. Here's my nsys capture: ![CleanShot 2023-11-10 at 10 28 53](https://github.com/openai/triton/assets/552990/d62f05c8-b00c-43fc-b1c7-0680a9988706) The kernel executes around 80us on GPU, however, it takes 220us...