AGI-player issues

Results 5 issues of


                                            AGI-player

deepspeed zero2预训练loss不稳定

使用新版本代码（commit 33f2c0d4f89cf76671c0fdfbcee79d732b6a020e），随机初始化权重训练llama2-13b模型，多机情况下采用deepspeed zero3模式，配置如下，采用bf16，训练学习率（warmup）和loss情况如下，整体看着还算正常 { "train_batch_size": "auto", "train_micro_batch_size_per_gpu" :"auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled":...

pending

failed to convert Qwen1.5-32B-Chat

run the command as follow: python convert_checkpoint.py --model_dir /Qwen1.5-32B-Chat/ --dtype bfloat16 --output_dir /Qwen1.5-32B/trt_ckpts/bf16/1-gpu/ error: the config.json for Qwen1.5-32B-Chat may cause this problem ![image](https://github.com/NVIDIA/TensorRT-LLM/assets/99712469/8c7afc20-ddba-4b88-bb20-62d9ba1c4111)

failed to use "stop_words_list" for tensorrt-llm==0.9.0

i use GenerationExecutorWorker for web service, using the parameters stop_words_list = [["hello, yes"]] by modifying the as_inference_request function in exectutor.py as follow: the ir parameter as follow: ![image](https://github.com/NVIDIA/TensorRT-LLM/assets/99712469/15256616-a4d2-4d2a-8419-1fa9b0835d63) then failed

triaged

neeed more info

About the low_idx setting

hello, in the axis_aligned_target_assigner.py, as follows: ![1](https://github.com/azhuantou/HSSDA/assets/99712469/e83dd5c2-e188-4bdb-972a-2bac2d336187) the idx in line 86 means the index of gt including all the three classes. While in the following code: ![2](https://github.com/azhuantou/HSSDA/assets/99712469/072ec98d-0860-4067-a4e2-9f32c9e829df) the gt_ids...

failed to use TensorRT-LLM/examples/apps/fastapi_server.py

run inference with /TensorRT-LLM/examples/run.py , it's ok mpirun -n 4 -allow-run-as-root python3 /load/trt_llm/TensorRT-LLM/examples/run.py \ --input_text "hello，who are you?" \ --max_output_len=50 \ --tokenizer_dir /load/Qwen1.5-32B-Chat/ \ --engine_dir=/load/output/trt_llm/trt_engines_qw32/f16_sq0.5_4gpu/ but failed to use TensorRT-LLM/examples/apps/fastapi_server.py...

Investigating

functionality issue