StringZilla icon indicating copy to clipboard operation
StringZilla copied to clipboard

Bug: Last elements in `basic_charset` initialization are discarded

Open alexbarev opened this issue 1 year ago • 2 comments

Describe the bug

When initializing a basic_charset with character arrays such as those returned by digits() and others, the last element of the array is incorrectly discarded.

https://github.com/ashvardanian/StringZilla/blob/152ed046701aceae2c3d4c995b0b661a9aa06d95/include/stringzilla/stringzilla.hpp#L192-L195

This is due to the use of count_characters - 1 in the loop that populates the basic_charset bitset. https://github.com/ashvardanian/StringZilla/blob/152ed046701aceae2c3d4c995b0b661a9aa06d95/include/stringzilla/stringzilla.hpp#L283-L289

While this prevents including a null terminator for string literals like sz::char_set("x"), it causes incorrect behavior when handling character arrays that do not have a null terminator, resulting in the exclusion of the final character.

I believe this local block is the full extent of the affected code.

https://github.com/ashvardanian/StringZilla/blob/152ed046701aceae2c3d4c995b0b661a9aa06d95/include/stringzilla/stringzilla.hpp#L330-L341

Steps to reproduce

#include "stringzilla/stringzilla.hpp"

namespace sz = ashvardanian::stringzilla;

int main() {
        sz::string haystack = "239";

        // Test with null-terminated string
        assert(haystack.contains_only(sz::char_set("0123456789")));
        // Passes: null terminator is correctly discarded

        // Test with initializer list
        static std::initializer_list all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
        assert (haystack.contains_only(sz::char_set {all}));
        // Passes: constructor for initializer list is called

        // Test with carray
        sz::carray<10> all_digits = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
        assert(haystack.contains_only(sz::char_set {all_digits}));
        assert(haystack.is_digit());
        // Fails: '9' is incorrectly discarded
}

Expected behavior

No asserts

StringZilla version

v3.11.0

Operating System

Ubuntu 22.04.5

Hardware architecture

x86

Which interface are you using?

C++ bindings

Contact Details

No response

Are you open to being tagged as a contributor?

  • [X] I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • [X] I have searched the existing issues

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

alexbarev avatar Dec 06 '24 23:12 alexbarev

I will be able to write fix and add tests tomorrow

Proposed Fix

  1. Using carray with Null Terminator
inline carray<11> const &digits() noexcept { 
    static carray<11> const all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '\0'}; 
    return all;
}
  1. Using std::initializer_list<char>
inline auto const &digits() noexcept {
    static std::initializer_list all = {'0', '1', '2', '3', '4', '5', '6', '7', '8', '9'};
    return all;
}

Then the constructor for the initializer list will be called here: https://github.com/ashvardanian/StringZilla/blob/152ed046701aceae2c3d4c995b0b661a9aa06d95/include/stringzilla/stringzilla.hpp#L335-L335

alexbarev avatar Dec 07 '24 00:12 alexbarev

Thank you @alexbarev! Let's start with tests, and later merge a patch.

ashvardanian avatar Dec 07 '24 09:12 ashvardanian