Inference fails on Windows with non-AVX CPU (Intel N6000)

Open goney3 opened this issue 7 months ago • 1 comments

Hello,

I am attempting to run the BitNet model on a Windows 11 machine with an Intel N6000 CPU, which does not have AVX/AVX2 support. The installation completes, but inference results in a repeating character output (e.g., "GGGGGG...").

Key Findings:

This behavior is reproducible on my Intel N6000 machine.
I can successfully compile and run the same model on a Raspberry Pi 4 B, which proves that AVX is not a fundamental requirement for the model's logic. This suggests the bug is specific to the Windows x86 non-AVX build.

Steps to Reproduce:

On a Windows machine with a non-AVX CPU (e.g., Intel N6000), follow the standard installation instructions.
During the build process, a compilation error occurs in 3rdparty/llama.cpp/common/common.cpp due to a missing header. Adding #include <chrono> fixes this initial error.
The project then compiles successfully.
Running inference with a command like python run_inference.py -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf -p "Once upon a time" results in a repeating character output.

What I've Tried:

Compiling with the default settings.
Forcing a build with -DLLAMA_SSE4_2=ON.
Forcing a generic build with no flags.

All of these configurations compile successfully but produce the same incorrect inference output. The system_info log confirms that AVX is disabled.

This seems to be a bug in the x86 fallback code path when compiled with the Windows toolchain.

Jun 29 '25 02:06 goney3

System Environment:

Operating System: Windows 11
CPU: Intel N6000 (non-AVX, supports up to SSE4.2)
Compilation Toolchains Attempted:
- MSYS2 with UCRT64 (GCC 15.1.0)
- MSYS2 with CLANG64 (Clang 20.1.7)
- Visual Studio 2022 Developer Command Prompt (ClangCL 19.1.5)

Problem Description:

When compiling bitnet.cpp on a non-AVX x86 Windows machine, the build process completes successfully but produces a non-functional executable. When running inference with the resulting llama-cli.exe, the model outputs repetitive garbage text (e.g., "GGGGGGGG..." or "cluster mass cluster...").

Crucially, the system_info log from the compiled executable shows that the specialized BitNet math engine is disabled (MATMUL_INT8 = 0), despite the CPU's SSE3/SSSE3 support being correctly identified (SSE3 = 1, SSSE3 = 1).

This issue does not occur when compiling and running the same model on a Raspberry Pi 4 (ARM), which proves the model file itself is valid. The issue is specific to the Windows x86 build process.

Steps to Reproduce:

This outlines the cleanest manual build process that demonstrates the issue.

Install Prerequisites:

Install MSYS2 and the UCRT64 toolchain:

pacman -S --needed base-devel mingw-w64-ucrt-x86_64-toolchain mingw-w64-ucrt-x86_64-cmake git

Clone the Repository:
- Open the UCRT64 terminal.
- Clone the repository and its submodules:
```
git clone --recursive https://github.com/microsoft/bitnet.cpp.git
cd bitnet.cpp
```
Configure the Build:
- Create a build directory:
```
mkdir build
cd build
```
- Run CMake, disabling AVX and attempting to manually enable the required x86 BitNet kernels:
```
cmake .. -G "MinGW Makefiles" -DGGML_AVX=OFF -DGGML_AVX2=OFF -DGGML_AVX512=OFF -DBITNET_X86_TL2=ON
```
Compile the Code:
```
cmake --build . --config Release
```

Run Inference:

./bin/llama-cli.exe -m "path/to/ggml-model-i2_s.gguf" -p "Once upon a time" -n 128 --no-mmap

Expected Behavior:

The model should generate coherent text.
The system_info log should show MATMUL_INT8 = 1, indicating the specialized BitNet math kernels are enabled.

Actual Behavior:

The model outputs repetitive garbage text.
The system_info log consistently shows MATMUL_INT8 = 0.

Log Snippet:

system_info: n_threads = 4 (n_threads_batch = 4) / 4 | AVX = 0 | ... | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
...
Once upon a timeGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Investigation and Analysis:

A comprehensive debugging process was undertaken to isolate the root cause. The following potential issues were systematically ruled out:

Faulty Model File: The exact same GGUF model file works perfectly when compiled and run on a Raspberry Pi 4 (ARM), proving the model data is valid.
Compiler Choice: The issue persists identically across three different toolchains: MinGW GCC, MinGW Clang, and the official Visual Studio ClangCL toolchain.
Build Environment: Switching from the older mingw64 MSYS2 environment to the modern UCRT64 environment successfully enabled SSE3/SSSE3 detection but did not solve the MATMUL_INT8 issue.
Build Flags: Manually setting -DBITNET_X86_TL2=ON during CMake configuration has no effect on the final executable; MATMUL_INT8 remains 0.
Build Script Logic: Manually editing the top-level or submodule CMakeLists.txt files to force-set the BITNET_X86_TL2 or GGML_BITNET_X86_TL2 variables also had no effect.

Conclusion and Root Cause:

The root cause was identified by inspecting the C++ source code in 3rdparty/llama.cpp/ggml/src/ggml.c. The function responsible for reporting MATMUL_INT8 support is:

int ggml_cpu_has_matmul_int8(void) {
#if defined(__ARM_ARCH)
    return ggml_arm_arch_features.has_i8mm;
#else
    return 0;
#endif
}

This code explicitly shows that the MATMUL_INT8 feature is only compiled for ARM architectures. For all other architectures, including x86, the function returns 0.

The issue is not a compilation bug on my end but rather that the high-performance BitNet kernels necessary for correct inference are not implemented or enabled for x86 CPUs in the current version of the project. The build scripts and official documentation do not reflect this platform limitation, leading to a frustrating user experience where a seemingly successful compilation produces a non-functional program.

This report is submitted to inform the developers of this issue and to help other Windows 11 x86 users who may be encountering the same problem. Thank you for your work on this project.

Jun 30 '25 01:06 goney3