Replace `PyYAML`
Description
PyYAML parses yaml so that msgspec can validate the data, but it doesn't fully pass yaml-test-suite tests and fails even on small examples like this:
from msgspec.yaml import decode
print(
decode(
"! 15",
type=str,
strict=True,
), # msgspec.ValidationError: Expected `str`, got `int`
)
From YAML v1.2 spec:
Example 2.27 Invoice:
from msgspec.yaml import decode
yaml = """\
--- !<tag:clarkevans.com,2002:invoice>
invoice: 34843
date : 2001-01-23
bill-to: &id001
given : Chris
family : Dumars
address:
lines: |
458 Walkman Dr.
Suite #292
city : Royal Oak
state : MI
postal : 48046
ship-to: *id001
product:
- sku : BL394D
quantity : 4
description : Basketball
price : 450.00
- sku : BL4438H
quantity : 1
description : Super Hoop
price : 2392.00
tax : 251.42
total: 4443.52
comments:
Late afternoon is best.
Backup contact is Nancy
Billsmer @ 338-4338."""
print(decode(yaml))
# msgspec.DecodeError: could not determine a constructor for the tag 'tag:clarkevans.com,2002:invoice'
# in "<unicode string>", line 1, column 5
Also:
from msgspec.yaml import decode
# Example 2.23 Various Explicit Tags
yaml = """\
---
not-date: !!str 2002-04-28
picture: !!binary |
R0lGODlhDAAMAIQAAP//9/X
17unp5WZmZgAAAOfn515eXv
Pz7Y6OjuDg4J+fn5OTk6enp
56enmleECcgggoBADs=
application specific tag: !something |
The semantics of the tag
above may be different for
different documents."""
print(decode(yaml))
# msgspec.DecodeError: could not determine a constructor for the tag '!something'
# in "<unicode string>", line 10, column 27
As a replacement, I would like to suggest my lib: yaml-rs (Well, it seems that the signature will change for dumps(), so this will require breaking changes. I think it will be possible to replace loads() for now)
It passes all yaml-test-suite tests and is also faster than PyYAML (written in Rust and based on saphyr parser).
PyYAML is a YAML 1.1 parser, switching it out for yaml-rs which is a YAML 1.2 parser will introduce behavior changes.
Don't get me wrong, I'm all for dropping PyYAML, since it hasn't made any progress toward 1.2 support for the past couple of years, but the original post seems to gloss over the behavior changes this would cause for some.
Related issue: #867
PyYAML is a YAML 1.1 parser, switching it out for yaml-rs which is a YAML 1.2 parser will introduce behavior changes.
Don't get me wrong, I'm all for dropping PyYAML, since it hasn't made any progress toward 1.2 support for the past couple of years, but the original post seems to gloss over the behavior changes this would cause for some.
Related issue: #867
It seems so, we can make two optional dependencies for different versions of yaml? Something like:
[project.optional-dependencies]
yaml_v_1_1 = [
"pyyaml",
]
yaml_v_1_2 = [
"yaml-rs",
]
It's up to the mantainers to decide but I have a question as I'm unfamiliar with the YAML format: is 1.2 not backwards compatible with 1.1? If yes, msgspec is still in "beta" given that version 0.Y.Z means that there might be backwards incompatible changes (which begs the question when will there be a 1.0.0 release and what are the expected features), so it would be ok to introduce another library for it; if not, I wouldn't go for it since it would mean breaking more than just the API.
@chirizxc is there a strict requirement as to why dumps() has a different signature? Could this be circumvented somehow?
It's up to the mantainers to decide but I have a question as I'm unfamiliar with the YAML format: is 1.2 not backwards compatible with 1.1? If yes,
msgspecis still in "beta" given that version 0.Y.Z means that there might be backwards incompatible changes (which begs the question when will there be a 1.0.0 release and what are the expected features), so it would be ok to introduce another library for it; if not, I wouldn't go for it since it would mean breaking more than just the API.
There is no backward compatibility, for example: https://yaml.org/spec/1.2.2/ext/changes/
@chirizxc is there a strict requirement as to why
dumps()has a different signature? Could this be circumvented somehow?
https://github.com/jcrist/msgspec/blob/a46a2c6f8b5991ebbe90fb1d2a9cb96628fd2311/src/msgspec/yaml.py#L86-L99
and
https://github.com/lava-sh/yaml-rs/blob/main/python/yaml_rs/init.py#L63-L64
~~Hmm, generally speaking, there shouldn't be any differences, but I'm not sure about allow_unicode=True.~~
import yaml
import yaml_rs
data = {
"русский": "текст на русском",
"emoji": "😀🎉",
"chinese": "中文",
}
print(yaml_rs.dumps(data))
print("\n\n")
print(yaml.dump(data, allow_unicode=True, sort_keys=False))
# ---
# русский: текст на русском
# emoji: 😀🎉
# chinese: 中文
#
#
# русский: текст на русском
# emoji: 😀🎉
# chinese: 中文
I would also like to note that some features are not fully implemented for dumps() in saphyr: Literal style (|), Folded style (>)
It's up to the mantainers to decide but I have a question as I'm unfamiliar with the YAML format: is 1.2 not backwards compatible with 1.1? If yes,
msgspecis still in "beta" given that version 0.Y.Z means that there might be backwards incompatible changes (which begs the question when will there be a 1.0.0 release and what are the expected features), so it would be ok to introduce another library for it; if not, I wouldn't go for it since it would mean breaking more than just the API.
In general, it would be reasonable to create separate optional-dependencies for different versions of yaml.🤔
Yes but the burden on mantaining those dependencies is on the mantainers.
And besides there has to be justification on supporting YAML 1.2 over 1.1, and with pyyaml being the the most used library for it (and still not supporting 1.2), this will not make msgspec compatible with other parsing libraries.
In my opinion, considering how all that msgspec does for TOML and YAML support is just providing a few wrapper functions, introducing additional optional dependecies to support a "newer" version of YAML seems like the introduction of a wrapper hell.
https://github.com/jcrist/msgspec/blob/0.20.0/src/msgspec/yaml.py is ~190 lines, about 90% of those are just comments, imports or type hints/overloads. ~20 lines of conversion logic where the general structure is like this;
def encode(obj, ...):
data = msgspec.to_builtins(obj, ...)
return external_lib.dump(data)
def decode(buf, ...):
obj = external_lib.load(buf)
return msgspec.convert(obj, ...)
Nothing crazy is going on here, it's all just convenience, but YAML is anything but convenient to work with. Writing your own 2 helper functions to wrap yaml-rs (or insert your favorite library here :D ) when you know you are dealing with YAML 1.2 probably makes more sense, otherwise this repo will just end up littered with issues like #867.