llama.cpp JSON parser differential

The safetensors reference implementation uses serde for JSON parsing of the header. However, in L750 of convert.py, Python’s built-in JSON parser is used when lazy loading safetensors files. This results in a parser differential: these two parsers interpret some files differently, resulting in diverging behaviors. For instance, the safetensors reference implementation rejects JSON inputs with duplicate keys, but Python’s built-in JSON parser keeps the last key-value pair. Therefore, one can have a JSON file rejected by the safetensors reference implementation, but accepted by this conversion script. For more information, this behavior is detailed in TOB-SFTN-7 of the Safetensors audit. The safetensors documentation also recommends methods to parse metadata.

May 26 '23 18:05 suhacker1

What is the real world significance of this? Are there any model files that are affected by it?

May 26 '23 20:05 SlyEcho

The conversion script accepts manipulated or corrupted safetensors files that are rejected by the reference implementation. Even if this isn’t critical right now, this behavior is unclear to users of this library, and the fix (aligning with the serde JSON parser) is straightforward.

Jun 09 '23 21:06 suhacker1

aligning with the serde JSON parser) is straightforward

But serde is a Rust library, this would mean increasing the number or dependencies if some other library is needed.

The Hugging Face doc you linked doesn't mention this at all and they are using regular JSON parsing (via requests).

The audit document actually proposes that safetensors itself uses a different format than JSON.

Jun 09 '23 22:06 SlyEcho

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 09 '24 01:04 github-actions[bot]