bug: Qwen Embedded doesn't work.
Issue description
Qwen3-Embedding-8B-Q4_K_M.gguf (downloaded from huggingface) works on gpu: false (cpu) but not when run on vulkan.
Expected Behavior
await embedContext.getEmbeddingFor(...) returns successfully, or throws an error that is understood.
Actual Behavior
Logs:
Embedding model loaded, priming with large data.
[node-llama-cpp] state_write_data: writing state
[node-llama-cpp] state_write_data: - writing model info
[node-llama-cpp] state_write_data: - writing output ids
[node-llama-cpp] state_write_data: - writing logits
[node-llama-cpp] state_write_data: - writing embeddings
[node-llama-cpp] state_write_data: - writing memory module
[node-llama-cpp] init: embeddings required but some input tokens were not marked as outputs -> overriding
[node-llama-cpp] output_reserve: reallocating output buffer from size 0.59 MiB to 152.11 MiB
[node-llama-cpp] init: embeddings required but some input tokens were not marked as outputs -> overriding
[node-llama-cpp] init: embeddings required but some input tokens were not marked as outputs -> overriding
D:/a/node-llama-cpp/node-llama-cpp/llama/llama.cpp/src/llama-context.cpp:622: fatal error
Exit code -1073740791
I get the impression that the warnings init: embeddings required but some input tokens were not marked as outputs -> overriding can be safely ignored, as that's what I do with the chat version of the qwen model and it hasn't caused issues (not only that but despite my best efforts I don't know how to fix that warning).
Steps to reproduce
I've omitted the text I was using to test, I just copied text from my internal wiki, you can probably copy text from any available wiki to do the test, as long as it's between 2000 and 3000 characters it should suffice. Running the embedder on a sample size of ~100 tokens seems to work, so I suspect the crash is related to the buffer reallocation.
I did manage to get a similar setup to work reliably on the cpu (but veeeery slowly) on a forked thread in my development program, but I haven't been able to replicate the setup in my sample test. The only things I think are different is {gpu: false} was provided to getLlama, the context size was 8192 and the batch-size wasn't set.
import { getLlama, Llama, LlamaLogLevel } from "node-llama-cpp";
(async function main() {
let llamacpp = await getLlama();
llamacpp.logLevel = LlamaLogLevel.debug;
let embedModel = await llamacpp.loadModel({
modelPath: "../Models/Qwen3-Embedding-8B-Q4_K_M.gguf",
useMlock: true,
useMmap: true,
gpuLayers: "auto",
metadataOverrides: {
general: {}
}
});
let embedContext = await embedModel.createEmbeddingContext({contextSize: 4096 , batchSize: 256 /*, threads: 4*/});
console.log("\nEmbedding model loaded, priming with large data.");
await embedContext.getEmbeddingFor(`
A really long text of 2000+ characters.
`.trim()); // pre-reserve big outputs/KV
console.log("Embedding model primed.");
})();
package.json
{
"type": "module",
"scripts": {
"start": "node vulcan.js",
"test": "echo \"Error: no test specified\" && exit 1",
"build": "tsc --build",
"rebuild": "tsc --build --force"
},
"dependencies": {
"@types/node": "^18.0.0",
"node-llama-cpp": "^3.14.2",
"typescript": "^5.4.5"
}
}
tsconfig
{
"compileOnSave": true,
"compilerOptions": {
"module": "ESNext", /* Specify what module code is generated. */
"target": "ES2020", /* Set the JavaScript language version for emitted JavaScript and include compatible library declarations. */
"moduleResolution": "node",
"types": ["node"],
"sourceMap": true,
"esModuleInterop": true, /* Emit additional JavaScript to ease support for importing CommonJS modules. This enables 'allowSyntheticDefaultImports' for type compatibility. */
"forceConsistentCasingInFileNames": true, /* Ensure that casing is correct in imports. */
"strict": true, /* Enable all strict type-checking options. */
"skipLibCheck": true, /* Skip type checking all .d.ts files. */
"useDefineForClassFields": false,
"experimentalDecorators": true,
"emitDecoratorMetadata": false,
}
}
My Environment
| Dependency | Version |
|---|---|
| Operating System | Windows 11 Pro (10.0.22621) |
| CPU | Intel i7-11700 |
| Node.js version | 20.11.1 |
| Typescript version | 5.4.5 |
node-llama-cpp version |
3.14.2 |
npx --yes node-llama-cpp inspect gpu output:
npm WARN deprecated [email protected]: This package is no longer supported.
npm WARN deprecated [email protected]: This package is no longer supported.
npm WARN deprecated [email protected]: This package is no longer supported.
OS: Windows 10.0.22621 (x64)
Node: 20.11.1 (x64)
node-llama-cpp: 3.14.2
Prebuilt binaries: b6845
Vulkan: available
Vulkan device: AMD Radeon RX 6800 XT
Vulkan used VRAM: 4.88% (786.42MB/15.73GB)
Vulkan free VRAM: 95.11% (14.97GB/15.73GB)
CPU model: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
Math cores: 0
Used RAM: 70.61% (22.5GB/31.86GB)
Free RAM: 29.38% (9.36GB/31.86GB)
Used swap: 51.28% (33.78GB/65.86GB)
Max swap size: 65.86GB
mmap: supported
Additional Context
A slightly different setup results in the program crashing without throwing the error.
Relevant Features Used
- [ ] Metal support
- [ ] CUDA support
- [x] Vulkan support
- [ ] Grammar
- [ ] Function calling
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.
The following works (I have no idea why):
let embedContext = await model.createEmbeddingContext({contextSize: 8192 , batchSize: 1024, threads: 4});
This doesn't work:
let embedContext = await model.createEmbeddingContext({contextSize: 8192});
@Griffork I tried reproducing this issue on my end, but I didn't encounter any misbehavior or crash.
I've made this code that I tried using to reproduce the issue; you can run it with npx --yes vite-node ./repro.ts.
I'd love if you could try to run it and make modifications to reproduce the exact issue you're facing:
import {getLlama, resolveModelFile} from "node-llama-cpp";
const modelPath = await resolveModelFile("hf:Qwen/Qwen3-Embedding-8B-GGUF:Q4_K_M");
const llama = await getLlama({
gpu: "vulkan"
});
console.log("gpu", llama.gpu);
console.log("Loading model:", modelPath);
const model = await llama.loadModel({
modelPath,
useMlock: true,
useMmap: true
});
const desiredEmbeddingTokenLength = 8000;
console.log("Creating embedding context");
const context = await model.createEmbeddingContext({
contextSize: 8192
// contextSize: 4096,
// batchSize: 256
});
function createTextOfTokenLength(destLength: number, baseText: string, maxOff: number = 10) {
const baseTokenLength = model.tokenize(baseText).length;
let resText = baseText.repeat(Math.ceil(destLength / baseTokenLength) + 1);
let currentRestTextTokenLength = model.tokenize(resText).length;
while (currentRestTextTokenLength > destLength || destLength - currentRestTextTokenLength > maxOff) {
if (currentRestTextTokenLength > destLength)
resText = resText.substring(0, resText.length - 1);
else
resText += baseText;
currentRestTextTokenLength = model.tokenize(resText).length;
}
return resText;
}
const textToEmbed = createTextOfTokenLength(desiredEmbeddingTokenLength, "Geography is the study of places and the relationships between people and their environments. ");
console.log("Word count:", textToEmbed.split(" ").length);
console.log("Token length:", model.tokenize(textToEmbed).length);
console.log("Generating embedding...");
const startTime = performance.now();
const embedding = await context.getEmbeddingFor(textToEmbed);
const endTime = performance.now();
console.log(`Generated embedding in ${Math.floor(endTime - startTime)}ms`);
Thanks for looking into it. Seems yours is working fine, and mine is not. Maybe my model download has bad defaults?
It's possible that the model you used has incorrect metadata which makes using it impossible without manual configuration. This isn't a widespread issue, but it can happen with some quantizations found in the wild. If you give me a link to the specific model you used I can give it a look.