intellinjun

Results 12 comments of intellinjun

`model = AutoModelForCausalLM.from_pretrained(model_path, device_map='cpu', torch_dtype=torch.float16, quantization_config=woq_config, trust_remote_code=True, _use_neural_speed=False_)` Do you want to use neural_speed? If yes, try to use neural speed = True.

P > C:\Windows\system32>D:\o\1\run_whisper.exe -l zh -m D:\o\1\whisper_gpu_int8_gpu-cuda_model.onnx -f D:\o\1\1.wav -osrt whisper_init_from_file_no_state: loading model from 'D:\o\1\whisper_gpu_int8_gpu-cuda_model.onnx' whisper_model_load: loading model NE_ASSERT: E:\whisper_opt\intel_extension_for_transformers\llm\runtime\graph\core\ne_layers.c:643: wtype != NE_TYPE_COUNT This method uses the cpp model for...

@murilocurti The latest Neural Speed already supports PHI2, and you can try it now

@RachelShalom please add -b 2048 in " python scripts/inference.py --model_name llama -m llama_files/ne_llama_int4.bin -c 1500 -n 400 --color -p "$PROMPT1" "

--keep should be the number of the first few tokens reserved when cutting off the [streaming LLM](https://github.com/intel/neural-speed/blob/main/docs/infinite_inference.md). And the second problem is that we haven't developed this feature yet.

> --keep should be the number of the first few tokens reserved when cutting off the streaming LLM. And the second problem is that we haven't developed this feature yet....

https://inteltf-jenk.sh.intel.com/job/neural_speed_extension/159/artifact/report.html acc extension test result

@irjawais Can you check the memory usage when converting the model? From your description, it seems that there may be insufficient memory.

@bil-ash Thank you for your suggestion, we will assess the needs internally and get back to you as soon as possible.

> @LJ-underdog are you still working on this? yes, this pr is prepare for mate