psych icon indicating copy to clipboard operation
psych copied to clipboard

Ruby 2.1/2.2/2.3 yaml parsing anomaly for comma separated integers

Open guidoiaquinti opened this issue 9 years ago • 6 comments

A string of digits separated by comma (without quotes) is evaluated as integer. Is it intentional?

irb(main):001:0> Psych.load("key: 123,456")
=> {"key"=>123456}

irb(main):002:0> Psych.load("key: 123456,7890")
=> {"key"=>1234567890}

I understand that the first example can be seen as a number in American notation but the 2nd example is not following any notation/standard that I'm aware of. I was expecting to have both of the values evaluated as string.

With the quotes it works as expected:

irb(main):003:0> Psych.load("key: \"123,456\"")
=> {"key"=>"123,456"}

irb(main):004:0> Psych.load("key: \"123456,7890\"")
=> {"key"=>"123456,7890"}

Thanks

guidoiaquinti avatar Feb 25 '16 11:02 guidoiaquinti

Apparently Japan and China will separate at four digits, but to be honest Psych internals don't care, they just remove all commas and cast to a number. The only reason for this is legacy support for Syck documents. I agree that both values should evaluate to strings, but I'm not sure how to do that and support old formats.

tenderlove avatar Mar 07 '17 16:03 tenderlove

I've hit this issue today after trying to parse some YAML output by another tool (kustomize build in my case, but not that relevant). I think as YAML becomes more and more ubiquitous it might be a good idea to provide an option for "strict" parsing according to the standard?

It seems to me that we use those regexp to decide whether or not to parse a Scalar as a Float or Integer: https://github.com/ruby/psych/blob/9f8c3655376aecec6056adb2bb12665971980f9f/lib/psych/scalar_scanner.rb#L11-L20

I did this patch on my side as a workaround, which solved the issue I had:

Psych::ScalarScanner.send(:remove_const, "INTEGER")
Psych::ScalarScanner::INTEGER = /^(?:[-+]?0b[0-1_]+          (?# base 2)
                                    |[-+]?0[0-7_]+           (?# base 8)
                                    |[-+]?(?:0|[1-9][0-9_]*) (?# base 10)
                                    |[-+]?0x[0-9a-fA-F_]+    (?# base 16))$/x

(basically replacing the regexp with a subset of the standard one from from http://yaml.org/type/int.html)

Would it be possible to have say two constants:

INTEGER_ENRICHED=/.../
INTEGER_STANDARD=/.../

And chose one or the other based on a flag to keep backward compatibility but offer a stricter/more standard alternative? I can offer to work on a patch if that can help.

Jell avatar Nov 04 '20 19:11 Jell

We would love to see this fixed as well. The current parsing logic does not conform to the YAML specification, so we believe something like @Jell's proposed solution should be considered. Would the maintainers (@tenderlove) of Psych be amenable to a PR implementing the above solution or similar?

sethboyles avatar Apr 13 '21 21:04 sethboyles

@tenderlove Do you have a stance on whether or not you would accept a PR that adds a flag to change the behavior to something stricter? We are willing to work on this, but don't want to get started if you don't think it would be worthwhile.

sethboyles avatar Aug 17 '21 17:08 sethboyles

@tenderlove Do you have a stance on whether or not you would accept a PR that adds a flag to change the behavior to something stricter? We are willing to work on this, but don't want to get started if you don't think it would be worthwhile.

Yes, I'm definitely open to it. There was a PR submitted here for strict hash keys. I made a commit here that demonstrates how to accomplish it.

Maybe this commit should just be extended to be "strict parsing" or something (rather than just specific to hash keys).

tenderlove avatar Aug 19 '21 18:08 tenderlove

@tenderlove I created a PR that implements this (#537). Let me know what you think!

sethboyles avatar Jan 14 '22 20:01 sethboyles

#537 fixed this issue.

hsbt avatar Oct 20 '22 06:10 hsbt