Abstract

Referring to discussion: 4455, this pull request introduces the implementation to retrieve the start and end positions of nested objects within the JSON during parsing.

Motivation

We have a service implementation with JSON schema where a field within the nested objects contains the hash value for that object. The service verifies the hash value of each of the nested objects before operating on the rest of the data sent.

For example, consider the following JSON:

{
    "name": "foo",
    "data":
    {
        "type": "typeA",
        "value": 1,
        "details": {
            "nested_type": "nested_typeA",
            "nested_value": 2
        }
    },
    "data_hash": "hashA"
}

Here, data_hash contains the hash of the object "details". In order to verify the data hash, we need to be able to retrieve the exact string that parsed out "details" including the spaces and newlines. Currently there is no way to achieve this using nlohmann/json parser.

Changes proposed

Add two fields to basic_json: size_t start_position and size_t end_position.
Add a reference to the lexer in json_sax_parser to retrieve the current position in the input string.
Whenever a BasicJsonType is created by the parser, calculate the start and end positions for that object from the original string and store those values.

Memory considerations

We considered storing substrings in the output JSON objects and sub-objects directly as well, however, considering the memory footprint increase that it would create, we opted for the option where only two size_t fields are stored per basic_json created.

Validation

We have added tests to the class_parser test suite that cover the following cases:

Array inside an object
Objects inside arrays
Doubly nested objects
String fields
Integer and float fields
Float values with insignificant digits
Boolean fields
Null fields

Since the change affects the sax_parser, for each of these test cases we validate scenarios where no callback is passed, a callback is passed that accepts all fields, and a callback is passed that filters specific fields.

Pull request checklist

Read the Contribution Guidelines for detailed information.

[x] Changes are described in the pull request, or an existing issue is referenced.
[x] The test suite compiles and runs without error.
[x] Code coverage is 100%. Test cases can be added by editing the test suite.
[x] The source code is amalgamated; that is, after making changes to the sources in the include/nlohmann directory, run make amalgamate to create the single-header files single_include/nlohmann/json.hpp and single_include/nlohmann/json_fwd.hpp. The whole process is described here.

Please don't

The C++11 support varies between different compilers and versions. Please note the list of supported compilers. Some compilers like GCC 4.7 (and earlier), Clang 3.3 (and earlier), or Microsoft Visual Studio 13.0 and earlier are known not to work due to missing or incomplete C++11 support. Please refrain from proposing changes that work around these compiler's limitations with #ifdefs or other means.
Specifically, I am aware of compilation problems with Microsoft Visual Studio (there even is an issue label for this kind of bug). I understand that even in 2016, complete C++11 support isn't there yet. But please also understand that I do not want to drop features or uglify the code just to make Microsoft's sub-standard compiler happy. The past has shown that there are ways to express the functionality such that the code compiles with the most recent MSVC - unfortunately, this is not the main objective of the project.
Please refrain from proposing changes that would break JSON conformance. If you propose a conformant extension of JSON to be supported by the library, please motivate this extension.
Please do not open pull requests that address multiple issues.

Nov 25 '24 23:11 sushshring

coverage: 99.639% (+0.005%) from 99.634% when pulling c4d1091b03284a92a29ada37589e565f2e2673e8 on sushshring:develop into 6cb099e30ee90b20255f5f3a07cb89c163abd518 on nlohmann:develop.

Nov 26 '24 05:11 coveralls

Thanks for the effort!

However, adding two size_t members is a lot of overhead. When we introduced diagnostics, we hid a single pointer behind a preprocessor macro to avoid every single client to suffer from the overhead. Issues like #4514 show that the memory efficiency is already quite bad.

I am hesitant how to continue here.

Nov 26 '24 06:11 nlohmann

I wonder if this is something that could be done with the data in the custom base class, so that only those that want to opt in to this behavior could enable it. That does make it a custom class, and not nlohmann::json or nlohmann::ordered_json.

Nov 26 '24 14:11 gregmarr

Taking that advice, I'm gonna add a new class like below to use as a json custom base class

class json_base_class_with_start_end_markers {
    size_t start_position = std::string::npos;
    size_t end_position = std::string::npos;

public:
    size_t get_start_position() const noexcept
    {
        return start_position;
    }

    size_t get_end_position() const noexcept
    {
        return end_position;
    }

    void set_start_position(size_t start) noexcept
    {
        start_position = start;
    }

    void set_end_position(size_t end) noexcept
    {
        end_position = end;
    }
};

We will use if (std::is_base_of<json_base_class_with_start_end_markers, BasicJsonType>){} whenever the start_position and end_position setters are called within json_sax.hpp.

Nov 26 '24 23:11 sushshring

Some comments from my side. I am honestly not 100% sold to the use case this PR is solving. As I stated in #4455 (comment), I think we preserve every relevant part of the raw input except whitespace. Did I miss anything?

Whitespace is key to our use case - my colleague will provide a more detailed explanation in #4455 soon.

Dec 06 '24 22:12 sushshring

Some comments from my side. I am honestly not 100% sold to the use case this PR is solving. As I stated in #4455 (comment), I think we preserve every relevant part of the raw input except whitespace. Did I miss anything?

The main issue we are trying to solve is when needing to sign the string corresponding to a nested object in a json, example: { "object_to_sign": { "int_value": 23, "float_value": 1.234, "string_value": "hello" }, "signature": "" } In this scenario, for us to calculate the signature using a public key to validate the contents of "object_to_sign" are correct depends fully on every thing about the string representing "object_to_sign" in the original payload (including spaces/line breaks). When defining a protocol for this, it's not possible (since the sender could be employing any json library) to simply use the nlohmann::json object generated from parsing to generate the string for "object_to_sign" again, as it could differ slightly, even if its contents are equivalent.

Even thought this is a somewhat niche scenario, it's mostly exposing information nlohmann has access to and, as an optional feature, I think it can add value to the library, as doing it outside it would require some form of parsing the contents of the string again, doubling the complexity and involving using two separate json parsers where this change allows the use of a single one.

Dec 09 '24 18:12 raphgianotti

I think it's going to be difficult to use the macro for handle_start_end_pos_for_json_value due to the unused parameters when the base class is used. It may make sense to write those two functions out individually.

I had another thought about these:

#define HANDLE_START_END_POS_DEFINITION(__handlecase, __statement,...)   \
    template <class Q = BasicJsonType>                                                                                                                  \
    typename std::enable_if<std::is_base_of<::nlohmann::detail::json_base_class_with_start_end_markers, Q>::value, void>::type                          \
    handle_start_end_pos_for_##__handlecase(__VA_ARGS__)                                                                                                \
    {                                                                                                                                                   \
        if (m_lexer_ref)                                                                                                                                \
        {                                                                                                                                               \
            __statement                                                                                                                                 \
        }                                                                                                                                               \
    }                                                                                                                                                   \
    \
    template <class Q = BasicJsonType>                                                                                                                  \
    typename std::enable_if<!std::is_base_of<::nlohmann::detail::json_base_class_with_start_end_markers, Q>::value, void>::type                         \
    handle_start_end_pos_for_##__handlecase(__VA_ARGS__){}

    HANDLE_START_END_POS_DEFINITION(start_object,
    {
        ref_stack.back()->start_position = m_lexer_ref->get_position() - 1;
    })

What if we created two set functions, and put them in the base class, with no-ops in the main class if the base class isn't used? Then the function can always be called, and you don't need to protect the one-line callers, but you would want to protect the more involved ones because of the extra code involved.

struct json_base_class_with_start_end_markers {
    size_t start_position = ...;
    size_t end_position = ...;
    // functions for storing the start and end positions when this is the basic_json base class.
    void set_start_position(size_t i) { start_position = i; }
    void set_end_position(size_t i) { end_position = i; }
}

class basic_json {
    // no-op functions for storing the start and end positions when the base class isn't used.
    typename std::enable_if<!std::is_base_of<detail::json_base_class_with_start_end_markers, decltype(*this)>::value, void>::type
    void set_start_position(size_t){}
    typename std::enable_if<!std::is_base_of<detail::json_base_class_with_start_end_markers, decltype(*this)>::value, void>::type
    void set_end_position(size_t){}

};

    void handle_start_end_pos_for_start_object()
    {
        if (m_lexer_ref != nullptr)
        {
            ref_stack.back()->set_start_position(m_lexer_ref->get_position() - 1);
        }
    }

Dec 09 '24 23:12 gregmarr

@gregmarr I'd considered that, but decided against it to minimize changes to basic_json. Since it's a no-op I'm guessing the compiler might just optimize those away. I will update with that change

Dec 10 '24 18:12 sushshring

Yeah, since this is header-only, the call itself should definitely get eliminated, and then any computations that are just related to that will hopefully be eliminated too.

Dec 10 '24 19:12 gregmarr

@gregmarr After some prototyping I have realized that to correctly enable SFINAE in the derived class, we need to add the following:

    template<class T = CustomBaseClass>
    typename std::enable_if<!std::is_same<T, detail::json_base_class_with_start_end_markers>::value, void>::type
    set_start_position(size_t){}
    template<class T = CustomBaseClass>
    typename std::enable_if<!std::is_same<T, detail::json_base_class_with_start_end_markers>::value, void>::type
    set_end_position(size_t){}
 
    template<class T = CustomBaseClass>
    typename std::enable_if<std::is_same<T, detail::json_base_class_with_start_end_markers>::value, void>::type
    set_start_position(size_t pos)
    {
        reinterpret_cast<detail::json_base_class_with_start_end_markers*>(this)->set_start_position(pos);
    }
    template<class T = CustomBaseClass>
    typename std::enable_if<std::is_same<T, detail::json_base_class_with_start_end_markers>::value, void>::type
    set_end_position(size_t pos)
    {
        reinterpret_cast<detail::json_base_class_with_start_end_markers*>(this)->set_end_position(pos);
    }

The methods need the alternate implementation in order to work with SFINAE. The already implemented versions in the base class do not count for the SFINAE pattern's error ignore. This results in a rather ugly reinterpret_cast<> within basic_json that will certainly violate https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#c153-prefer-virtual-function-to-casting. The safer alternative is to add virtual methods to both base classes and override those in basic_json, which will likely affect performance slightly.

The macros that I've added in the recent commit don't work with C++11, however, that can be fixed by adding a new version of the macros that takes no arguments and removes the varargs, which makes the CI checks happy. I will implement that in my next commit instead.

Dec 11 '24 00:12 sushshring

A very wild idea: why not adding another macro like DIAGNOSTIC_POSITIONS and implement this without inheritance?

Dec 11 '24 05:12 nlohmann

The methods need the alternate implementation in order to work with SFINAE. The already implemented versions in the base class do not count for the SFINAE pattern's error ignore.

There is no SFINAE for the simple functions with this method.

    void handle_start_end_pos_for_end_array()
    {
        if (m_lexer_ref != nullptr)
        {
            ref_stack.back()->set_end_position(m_lexer_ref->get_position());
        }
    }

    bool end_array()
    {
        bool keep = true;

        if (ref_stack.back())
        {
            keep = callback(static_cast<int>(ref_stack.size()) - 1, parse_event_t::array_end, *ref_stack.back());
            if (keep)
            {
                handle_start_end_pos_for_end_array();

Since object and array do the same thing, that could also be simplified to reduce the number of functions:

    void record_start_pos()
    {
        if (m_lexer_ref != nullptr)
        {
            ref_stack.back()->set_start_position(m_lexer_ref->get_position() - 1);
        }
    }
    void record_end_pos()
    {
        if (m_lexer_ref != nullptr)
        {
            ref_stack.back()->set_end_position(m_lexer_ref->get_position());
        }
    }

why not adding another macro like DIAGNOSTIC_POSITIONS and implement this without inheritance?

That's certainly a possibility. I thought the base class might be cleaner, but it changes the types, so maybe it isn't. We'd need to encode this define into the intermediate namespace selection so that we don't link against something with the wrong definition, as we do for JSON_DIAGNOSTICS.

https://github.com/nlohmann/json/blob/develop/include/nlohmann/detail/abi_macros.hpp#L33-L37

Dec 11 '24 16:12 gregmarr

A very wild idea: why not adding another macro like DIAGNOSTIC_POSITIONS and implement this without inheritance?

This seems feasible, thanks for the note on the intermediate namespace @gregmarr. We are prototyping this locally as an option and will post the next iteration based on our findings.

Dec 11 '24 18:12 sushshring

@nlohmann pushed a new commit that implements the diagnostic style macro. I personally think this looks cleaner as well, PTAL.

Dec 12 '24 00:12 sushshring

The ci_clang_tidy job keeps complaining about the parser callback being passed by value in json_sax, but that’s not really any code we’re modifying in this PR. What’s the suggestion for that?

Dec 12 '24 06:12 sushshring

The ci_clang_tidy job keeps complaining about the parser callback being passed by value in json_sax, but that’s not really any code we’re modifying in this PR. What’s the suggestion for that?

I think everywhere else we pass it as non-const parameter, which seems to be OK for Clang-Tidy.

Dec 12 '24 06:12 nlohmann

🔴 Amalgamation check failed! 🔴

The source code has not been amalgamated. @sushshring Please read and follow the Contribution Guidelines.

Dec 17 '24 19:12 github-actions[bot]

🔴 Amalgamation check failed! 🔴

The source code has not been amalgamated. @sushshring Please read and follow the Contribution Guidelines.

Not sure why it's saying this, i've run amalgamate on the recent commit.

Dec 17 '24 19:12 sushshring

🔴 Amalgamation check failed! 🔴

The source code has not been amalgamated. @sushshring Please read and follow the Contribution Guidelines.

Not sure why it's saying this, i've run amalgamate on the recent commit.

The pipeline was quite busy, so maybe this was just delayed.

Dec 17 '24 20:12 nlohmann

Once the pipeline is through, please check why the coverage went down. Note you can download an artifact from the coverage job which contains HTML pages showing which lines are not covered. As always, coverage information is a bit fuzzy and sometimes you see the closing braces of functions in red which makes no sense. Nonetheless you should make sure every added code is covered by a test.

Dec 17 '24 20:12 nlohmann

The coverage check is also odd. The commit here https://github.com/nlohmann/json/pull/4517/commits/8c67186a32f4cb7f36363c697a507fa1e35068bc has the same coverage, which seemed acceptable to it.

Regardless, I can add one more test that improves the coverage for json_type_t::discarded, but the remaining missing coverage check is for the default switch branch which has an assert(false) since it should never be hit.

Dec 17 '24 20:12 sushshring

These lines can be skipped by adding // LCOV_EXCL_LINE.

Dec 17 '24 20:12 nlohmann

@nlohmann looks like CI is all green. Any other blockers before this can be checked in?

Dec 18 '24 16:12 sushshring

Please update to the latest develop branch which contains a fix for Clang-Tidy, see #4558.

Dec 18 '24 17:12 nlohmann

🔴 Amalgamation check failed! 🔴

The source code has not been amalgamated. @sushshring Please read and follow the Contribution Guidelines.

Dec 18 '24 17:12 github-actions[bot]

Thanks a lot!

Dec 18 '24 21:12 nlohmann

Woot woot 🎉!

Dec 18 '24 21:12 sushshring

@sushshring I overworked the documentation of this feature a bit. Could you please have a look at

https://github.com/nlohmann/json/blob/cleanup/docs/mkdocs/docs/api/macros/json_diagnostic_positions.md
https://github.com/nlohmann/json/blob/cleanup/docs/mkdocs/docs/api/basic_json/start_pos.md
https://github.com/nlohmann/json/blob/cleanup/docs/mkdocs/docs/api/basic_json/end_pos.md

If you have any comments, please add them to https://github.com/nlohmann/json/pull/4560. Thanks!

Jan 08 '25 07:01 nlohmann

json start/end position implementation

Abstract

Motivation

Changes proposed

Memory considerations

Validation

Pull request checklist

Please don't

🔴 Amalgamation check failed! 🔴

🔴 Amalgamation check failed! 🔴

🔴 Amalgamation check failed! 🔴

🔴 Amalgamation check failed! 🔴