GenSON SchemaBuilder incorrecly marking field as required for copied schema

Here's a reproduction of the issue:

builder_1 = genson.SchemaBuilder()
builder_1.add_object({"a": 0})
builder_1.add_object({"b": 0})

# Make a copy of builder_1.
builder_2 = genson.SchemaBuilder()
builder_2.add_schema(builder_1)

# This passes.
assert builder_1.to_schema() == builder_2.to_schema()

another_schema = {
    '$schema': 'http://json-schema.org/schema#',
    'properties': dict(c={'type': 'integer'}),
    'required': ['c'],
    'type': 'object'
}
builder_1.add_schema(another_schema)
builder_2.add_schema(another_schema)

# This fails, builder_2 marks 'c' as required.
assert builder_1.to_schema() == builder_2.to_schema()

Jul 21 '24 19:07 rsteiger

Issue #25 seems to be rearing its ugly head again. You can find my explanation of the basic reason for why this is happening here. I apparently failed to add docs the README as I had said I would, but that said, I thought I had fixed the problem. Let me look a little deeper here.

Jul 21 '24 22:07 wolverdude

Okay, so it looks like I fixed the problem that was specifically described in issue #25, but while yours is the same underlying problem, it wasn't fixed by that particular solution. Yay for treating symptoms, not causes! Unfortunately, I think we're stuck with this particular cause, so I'll have to stay focused on the symptoms.

What's happening is that GenSON implicitly converts a SchemaBuilder into a dict schema if it gets passed to add_schema(). This is logically simple, but it means that any information not serialized by to_schema() gets lost in the transition. This is actually good, because it forces GenSON not to stay very close to the bare JSON-Schema spec, and thus (hopefully) be more intuitive. But for the arcane reasons outlined in issue #25, the presence of an empty required field is a nonintuitive thing that gets lost.

The fix for issue #25 solved this by ensuring output schemas would always contain the required field if any input schema contained it, even if empty. But that doesn't work in this case because the input schema (builder_1) is itself a SchemaBuilder, and when serialized it "helpfully" leaves out the empty required field because no one passed it an input schema that directed it to do otherwise.

Jul 21 '24 22:07 wolverdude

One way around this is just to use deepcopy to create an actual recursive duplicate of the builder object. But that will only work for fresh objects, not for adding a schema to another schema that already has its own data. In theory, SchemaBuilder itself could do something like this instead of the implicit dict conversion when another SchemaBuilder object is passed to add_schema(), but that would require adding an extra internal-only API all the way down, and I would rather not do that.

Outside of that, I unfortunately don't think this is fixable without seriously messing up the seed schema functionality. If this is a serious problem for your use-case, and you can do without seed schemas, I would suggest creating a custom object SchemaStrategy that inherits from the canonical one and sets this ivar to True instead of False. That will tell it to always include the required key in the output, so then the info that no keys are required won't be lost between builder_1 and builder_2.

Jul 21 '24 23:07 wolverdude

Thank you for investigating. In my case I have a distributed workload where I'm loading a shared schema, adding objects, then updating the shared schema. Now, within a worker, instead of passing a schema around I am passing a builder around and only serializing it when updating the shared schema. It would be nice to not rely on this behavior, but it sounds like a difficult fix.

Jul 22 '24 13:07 rsteiger

Hey @wolverdude 🙋‍♂ I am running into another funny behavior that is related and didn't find any reasons from former explanations. If you multiply the calls to add_schema you are losing the required field. Here is a code snippet:

from genson import SchemaBuilder

builder = SchemaBuilder()
builder.add_schema({"type": "object", "properties": {}})
builder.add_object({"name": "Toto", "age": 5})

print(builder.to_schema())
# {'$schema': 'http://json-schema.org/schema#', 'type': 'object', 'properties': {'name': {'type': 'string'}, 'age': {'type': 'integer'}}, 'required': ['age', 'name']}

builder.add_object({"hobby": "sport"})
print(builder.to_schema())
# {'name': {'type': 'string'}, 'age': {'type': 'integer'}, 'hobby': {'type': 'string'}}}

Hope this is useful

Jan 22 '25 15:01 g-prz

@g-prz This is the intended behavior. Your first object has completely different keys (name, age) than your second one (hobby), so GenSON doesn't list any of these as required or one of the two given objects wouldn't validate under the generated schema. However, once required is empty, it isn't included in the output. If you want to change that, add "required": [] to your initial schema. That will "seed" the SchemaBuilder with the knowledge that it should always include the required in its output.

Though your comment makes me realize this required-dropping behavior is not actually documented in the readme. I'll go fix that.

Jan 25 '25 20:01 wolverdude

I understand, I thought that add_object amends an existing schema and would therefore preserve the former declaration (hence enabling dynamic modifications).

Thank you for the clarification.

Jan 27 '25 13:01 g-prz