orjson icon indicating copy to clipboard operation
orjson copied to clipboard

Support something like OPT_STRICT_INTEGER for deserialization

Open nwalters512 opened this issue 1 year ago • 5 comments

Consider this code:

import orjson
orjson.dumps(orjson.loads("2800000000000000000"), option=orjson.OPT_STRICT_INTEGER)

This will fail, which is currently the expected behavior. The value is deserialized to int(2800000000000000000), which is >= 2 ** 53, and so cannot be serialized per orjson.OPT_STRICT_INTEGER.

However, the following code succeeds:

# 2 ** 64 == 18446744073709551616
orjson.dumps(orjson.loads("18446744073709551616"), option=orjson.OPT_STRICT_INTEGER)
# b'1.8446744073709552e19'

This is because orjson parses that value as float(18446744073709551616), and it can always serialize floats back to strings, albeit not losslessly.

What I'd like is to be able to opt-in to parsing any numbers that are >= 2 ** 53 as floats so that they can be round-tripped through orjson without error. We could do this via roughly the same API as orjson.dumps uses:

import orjson
orjson.loads("2800000000000000000", option=orjson.OPT_STRICT_INTEGER)

We could of course use a different constant, which might be reasonable given that the semantics are different ("strict" could imply failure, but what I really want are just stricter bounds on what is parsed as an int instead of a float).

Would you be open to a PR implementing his feature? If so, do you have thoughts on what the API should look like?

nwalters512 avatar Apr 11 '24 17:04 nwalters512

Do you have a use case for strictly 53 bit numbers, or is this more that 64-bit integers, 128-bit integers and how floats can be coerced is not done well? Does deserializing a number that doesn't have a dot or exponent strictly as an integer handle the concern? In general, stricter behavior and support for 128-bit integers and floats would be good. I think the implementation is likely on the scale of writing a yyjson-like parser in Rust, though--the parser passing "raw numbers" probably being a big regression and definitely more difficult to handle now that it's fallible after the tape has been constructed.

ijl avatar Apr 15 '24 21:04 ijl

I do have a use case for strictly 53-bit integers: I'm interoperating with JavaScript. Specifically, I'm round-tripping JSON to/from JavaScript and Python, and I want to be able to do so losslessly. Here's a sequence where things would fail with OPT_STRICT_INTEGER and the current parsing behavior:

  1. Start with JSON serialized from Python:
>>> orjson.dumps({ "number": 2.8e18 }, option=orjson.OPT_STRICT_INTEGER)
b'{"number":2.8e18}'
  1. Parse that with JavaScript:
> JSON.parse('{"number":2.8e18}')
{ number: 2800000000000000000 }
  1. Serialize back to JSON with JavaScript:
> JSON.stringify({ number: 2800000000000000000 })
'{"number":2800000000000000000}'
  1. Parse again with Python:
>>> data = orjson.loads('{"number":2800000000000000000}')
>>> data
{'number': 2800000000000000000}
  1. Serialize again with Python:
>>> orjson.dumps(data, option=orjson.OPT_STRICT_INTEGER)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: Integer exceeds 53-bit range

If I could add the proposed option on step 4 above, things would work perfectly: the value would be deserialized to float(2.8e18), and I could keep going endlessly through steps 1-4 above without loss of information or encountering errors.

As for an implementation, I spent some time looking through the codebase, and while I'm not particularly familiar with either Rust or yyjson, I think perhaps we could hook into things here:

https://github.com/ijl/orjson/blob/632345a8aa56bdbe9a7d2bcc69cda849ebd2683d/src/deserialize/yyjson.rs#L184-L196

parse_i64 and parse_u64 could, in theory, take some options as another argument and internally decide to return either a PyLong or a PyFloat, depending on if val is in the desired range or not. Does this sound at all feasible?

nwalters512 avatar Apr 16 '24 00:04 nwalters512

Bumping this so the stalebot doesn't close this out prematurely!

nwalters512 avatar Apr 22 '24 22:04 nwalters512

I think I follow your use case. I don't follow how adding the option is a real solution given you can specify >53 bit integers as floats, yes. I'm not a numerical computing person and I suppose I don't follow how this doesn't quickly become a much wider issue of having defined behavior for all of it, including 128-bit integers and floats. And because none of that is in the spec it would require understanding what is done across as many libraries as possible and looking for a "most compatible without being incorrect" option.

If this is an important issue for you I would suggest experimenting with your own fork instead of waiting on this. I think I bias to not touching number parsing unless there's a general plan.

ijl avatar Apr 30 '24 20:04 ijl

orjson does currently have defined behavior for 128-bit integers and floats, as far as I can tell. The defined behavior does what I want. For instance, orjson.dumps(88291326719355847026813766449910520462) fails with OverflowError: int too big to convert, and orjson.loads("88291326719355847026813766449910520462") gives me 8.829132671935584e+37. Is there something else you meant by "having defined behavior for all of it, including 128-bit integers and floats"?

I'd be happy to work in a fork to try this out, but I'd be hesitant to maintain a fork indefinitely. Do you see any path to this landing in orjson, and if so, is there anything I can do to make that happen (further examples, more detailed descriptions, a proof-of-concept + tests, etc.)? The performance improvements from using orjson have been extremely promising in early testing, so I'm quite motivated to find a way to make this work!

nwalters512 avatar May 04 '24 17:05 nwalters512

@ijl any chance I could get a response from you so I can know if this is worth pursuing?

nwalters512 avatar May 15 '24 21:05 nwalters512