fast_float icon indicating copy to clipboard operation
fast_float copied to clipboard

Question: parsing strings containing two related values

Open rowlesmr opened this issue 3 years ago • 4 comments

I'm writing a library for some scientific data management, and I have a need to parse strings of the sort "+12.345e-02(13)"**, which contains the value 0.12345 and the error 0.00013.

I can see how I can adapt your library to cope with the leading +. How difficult do you think it would be to extend the parsing to include the bracketed digits? At this point in time, I only need to go from char to double.

I don't know if my code would be worthy of putting back into the library, but I can make everything available.

.

** In the grammar I'm following: SIGN = [+-] DIGIT = [0-9] UINT = DIGIT+ INT = SIGN? UNIT EXP = [eE] INT FLOAT = INT EXP | SIGN? DIGIT* '.' UINT EXP? | INT '.' EXP? NUMB = INT | FLOAT NUMERIC = NUMB | NUMB '(' UINT ')'

+ = 1 or more of * = 0 or more of ? = 0 or 1 of | = or

rowlesmr avatar Aug 24 '22 06:08 rowlesmr

Is that a known standard? I am not familiar with your notation.

lemire avatar Aug 24 '22 11:08 lemire

It's a standard scientific representation of a value and its uncertainty. The number in brackets maps onto the rightmost digits in the value. 123(45) : 123 and 45, - 64.3(12): - 64.3 and 1.2, 1.23e3(4): 1230 and 40.

The reference for the grammar is https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax

rowlesmr avatar Aug 24 '22 12:08 rowlesmr

Looks like it could be added as a new template function using a few handfuls of lines. You would need to define a data type corresponding to this format because double and float won't do. If the implementation is, as I expect, quite compact, and you can write a reasonable amount of tests to make sure that the code is reasonably correct, then it looks like something we could merge.

Note that any pull request you provide should be additive. We don't want to change the existing parser or the existing code. We are very deliberate about the syntax we follow currently. E.g., folks may decide that they want to put a + in front of their numbers if they want, but that's forbidden by the C++ standard. If you can write new code that meets your needs, and if this code can be sufficiently small that it can be examined and does not risk harming other users (bloat, bugs,...) then it is good. We will put the bar rather high: it needs to be good code because we can't risk breaking this library. So it needs to be clean, efficient and well tested. However, we can be constructive about it.

lemire avatar Aug 24 '22 12:08 lemire

Thanks for your feedback. I'll try and put something together to see if I can make it work, and then start on finenessing it.

rowlesmr avatar Aug 24 '22 13:08 rowlesmr