Track the absolute position in the drivers of Parser
In the stream function drivers, the abs-position argument is at the end. We should put it as the first argument just because stream and the stream constructors do the same.
Edit: Whatever is there should be fine.
Benchmark regressions need to be checked.
@harendra-kumar referred to this PR here: https://github.com/composewell/streamly/issues/2895 so I thought I'd throw in a few cents.
I don't completely understand this PR , but I would be hesitant about making position tracking a sort of hard coded thing. It looks like in this PR "position" is just the number of tokens, but I'm not sure that's a useful definition in a lot of cases. In some cases "line and column" is more useful than just an absolute position. But even that's sometimes not what you want. In a JSON document you might want a alice->bob->array element 9->charlie style position
The following parser transformer (which I mentioned in the issue) seems to achieve position tracking anyway:
limitParserInput :: forall m a b. Applicative m => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (StreamlyParser.Parser (stepFIn :: s -> a -> m (StreamlyParser.Step s b)) (initialIn :: m (StreamlyParser.Initial s b)) (extractFIn :: s -> m (StreamlyParser.Step s b))) =
StreamlyParser.Parser stepF initial extractF where
stepF :: StrictIntPair s -> (Int, a) -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of
True -> BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
False -> pure $ StreamlyParser.Error "Hit parser size limit" -- this should be more informative
initial :: m (StreamlyParser.Initial (StrictIntPair s) (Int, b))
initial = BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
extractF :: StrictIntPair s -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
extractF (StrictIntPair currentSize state) = BiFunctor.bimap (StrictIntPair currentSize) (currentSize,) <$> extractFIn state
To achieve what I think this PR does using the above function, you'll just need to pattern match on the stepFIn function, and if it fails, just add the current position to the error message. If you don't need to logic to fail on a maxInputSize just remove the argument and the handler in the case statement.
But this has the following benefits:
- It doesn't introduce a cost on parsers that don't use positions AND
- People can define their own concepts of position
By own concepts of "position", for example limitParserInput allows one to specify the size of incoming input tokens, for example. But hey, maybe if we were doing JSON parsing I could have a stack I could push/pop new nodes on/off when we have opening { and closing }.
That being said, I'm just starting to use this library, but it seems to me that position tracking can be done with just parser transforms instead of modifying the library to hard code the logic everywhere, but I may be missing something.
Hi @clintonmead , thank you for the detailed comment.
I can see one problem with the implementation of limitParserInput. It does not
account for any backtracking.
In the current implementation of parsers, the backtracking information is stored
in the Step and the number representing backtracking is in terms of the number
of tokens.
Extending your example, (I did not compile and check)
limitParserInput
:: forall m a b. Applicative m
=> Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser a m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
Parser step initial extract
where
initial = do
st <- initial0
pure $ case st of
IPartial s -> IPartial (0, s)
IDone b -> IDone (0, b)
IError _ -> IError (show 0) -- Hacky way to encode error limited by
-- the implementation.
step (i, s) inp = do
let i1 = i + 1
case i1 <= maxInputSize of
True -> do
res <- step0 state input
pure $ case res of
Partial back s1 -> Partial back (i1 - back, s1)
Continue back s1 -> Continue back (i1 - back, s1)
Done back s1 -> Done back (i1 - back, s1)
Error _ -> pure $ Error $ show i1
False -> pure $ Error $ "Hit parser size limit" -- this should be
-- more
-- informative
extract (i, s) = do
res <- extract0 s
pure $ case res of
Partial back s1 -> Partial back (i - back, s1)
Continue back s1 -> Continue back (i - back, s1)
Done back b -> Done back (i - back, b)
Error _ -> Error $ show i1
Although this accounts for backtracking, we only keep track of the number of tokens. The size of the token is 1. We cannot have a custom size for the token.
We can however solve this by implementing a driver that accounts for the token size. The token size will become a first-class citizen.
type ParserWithToken a m b = Parser (Int, a) m (Int, b)
The diver for this parser would keep track of the token size and the number of tokens. While backtracking, we also subtract the token size appropriately. This can be very useful.
On a side note: I'll need to check if I'm missing anything in limitParserInput.
If this is enough to keep track of the absolute position, then I would prefer
this over changing the parser drivers.
A custom driver might not be required, if we can save the buffer in the parser combinator itself, the parser combinator can do the job of the driver.
backtrackSize :: Int -> [Int] -> Int
backtrackSize i = sum . take i
backtrackBufBy :: Int -> [Int] -> Int
backtrackBufBy i = drop i
limitParserInput
:: forall m a b. Applicative m
=> Int -> Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
Parser step initial extract
where
initial = do
st <- initial0
pure $ case st of
IPartial s -> IPartial (0, [], s)
IDone b -> IDone (0, [], b)
IError _ -> IError (show 0) -- Hacky way to encode error limited by
-- the implementation.
step (accInp, backtrackBuf, s) (inpSize, inp) = do
let newAccInp = accInp + inpSize
newBacktrackBuf = inpSize:backtrackBuf
case newAccInp <= maxInputSize of
True ->
res <- step0 state input
pure $ case res of
Partial back s1 ->
Partial back
( newAccInp - backtrackSize back newBacktrackBuf
, backtrackBufBy back newBacktrackBuf, s1
)
Continue back s1 -> ...
Done back s1 -> ...
Error _ -> ...
False -> pure $ Error $ "Hit parser size limit" -- this should be
-- more
-- informative
extract (i, s) = ...
Let me verify if this covers everything and solves the issue and your use case properly. If so, we can implement this combinator or add a new parser driver where the token size is a first-class citizen.
We can evaluate both the approaches. About the implicit tracking approach in this PR:
- The position is quite useful a lot of times, so always passing it in the driver may not be too bad if the performance impact is not significant. Since it is always there, we can always use it to report useful information about the any error position in any parser.
- Implicit accounting of position in the driver may be easy to use, as it is not visible everywhere but is there when you need to report it. Though it may have to be manipulated in any intermediate combinators that act like drivers.
- Unboxed parsers are the most commonly used ones. Just byte position tracking makes sense for unboxed parsers as they work on byte array inputs. For generic/boxed parsers the position would be in terms of the input tokens but anyway we cannot look inside the tokens, so that is the only sensible way to report position.
About the proposed explicit tracking:
- This seems less invasive as we do not need to change anything, but we need to try and use it to see if there are any difficult issues with it, how ergonomic it is to use etc. But the good thing is that it can be tried without much changes to the existing implementation.
- The position information is flexible and it can potentially be anything.
- For error position reporting maybe we can change the
Errorconstructor toError Int String. - We need to see how this will optimize in practical cases, what are the observed perf characteristics.
Is there any other potential third way like tracking the context in an underlying state monad?
@clintonmead Assuming parser combinators like the above exist. Can you think out loud a sample use case of error reporting or position tracking?
For a CSV file, we may want to track the position by: Line Number + Part Index An example with pseudo-code will help with seeing what the usage would look like.
I am seeing the following perf changes. No allocation changes in Parser, but some CPU regressions:
Data.Parser(cpuTime)
Benchmark default(0)(μs) default(1) - default(0)(%)
------------------------------------------------------------------------------- -------------- --------------------------
All.Data.Parser/o-1-space.some 37.41 +199.72
All.Data.Parser/o-n-heap.takeEQ 37.41 +132.92
All.Data.Parser/o-1-space.deintercalateAll 205.40 +59.07
All.Data.Parser/o-n-heap.lookAhead 65.76 +32.54
All.Data.Parser/o-1-space.streamEqBy 90.22 +24.22
All.Data.Parser/o-1-space.sepByAll (words) 205.46 +18.17
All.Data.Parser/o-1-space.parseIterate (take 1) 808.06 +17.97
All.Data.Parser/o-1-space.parseMany (groupBy (<)) 74.71 +16.73
All.Data.Parser/o-1-space.takeBetween 112.22 +16.70
All.Data.Parser/o-1-space.manyTill 713.63 +15.54
All.Data.Parser/o-n-heap.listEqBy 815.97 +14.03
All.Data.Parser/o-1-space.parseMany groupRollingByEither (Right) 239.55 +13.29
All.Data.Parser/o-1-space.takeBeginBy 76.15 +10.46
All.Data.Parser/o-1-space.splitAp2 76.62 +9.79
All.Data.Parser/o-1-space.parseIterate (take all) 895.20 +9.75
All.Data.Parser/o-1-space.splitApAfter 76.81 +9.56
All.Data.Parser/o-1-space.parseMany groupRollingBy (bound groups) 1387.87 +8.66
All.Data.Parser/o-1-space.parseMany (groupBy (==)) 1350.86 +8.00
All.Data.Parser/o-1-space.alt2parseMany 2752.44 +7.36
All.Data.Parser/o-1-space/filesystem.S.parseMany (Fold.take 1 Fold.sum) 141463.00 +6.90
All.Data.Parser/o-n-space.sequence/100 9168.32 +5.89
All.Data.Parser/o-1-space.parseMany 57.04 +4.92
All.Data.Parser/o-1-space.concatSequence 3262.95 +4.44
All.Data.Parser/o-n-space.sequenceA/100 9314.83 +3.97
All.Data.Parser/o-1-space.parseMany/Unfold/1000 arrays/take 1 894.70 +3.47
All.Data.Parser/o-1-space.alt16 594.02 +3.17
All.Data.Parser/o-1-space.splitAp8 209.14 +3.15
All.Data.Parser/o-1-space.takeWhile 37.39 +3.14
There are significant allocation as well as CPU regressions in ParserK:
Generating reports for Data.ParserK...
Data.ParserK(cpuTime)
Benchmark default(0)(μs) default(1) - default(0)(%)
------------------------------------------ -------------- --------------------------
All.Data.ParserK/o-1-space.monad16 2141.37 +261.17
All.Data.ParserK/o-1-space.takeWhile 2027.39 +256.28
All.Data.ParserK/o-1-space.monad2 2039.30 +247.76
All.Data.ParserK/o-1-space.monad8 2173.41 +242.55
All.Data.ParserK/o-1-space.monad4 2190.79 +235.92
All.Data.ParserK/o-1-space.splitAp8 2195.81 +234.58
All.Data.ParserK/o-1-space.one (recursive) 2520.50 +231.90
All.Data.ParserK/o-1-space.splitAp2 2234.45 +217.17
All.Data.ParserK/o-n-heap.sequence_ 2967.16 +203.91
All.Data.ParserK/o-n-heap.sequenceA_ 3011.03 +196.73
All.Data.ParserK/o-n-heap.sepBy1 7103.05 +97.05
All.Data.ParserK/o-1-space.alt2 10419.60 +85.68
All.Data.ParserK/o-1-space.alt16 126392.00 +57.30
All.Data.ParserK/o-1-space.alt8 60537.80 +52.14
All.Data.ParserK/o-n-heap.manyAlt 12991.30 +42.06
All.Data.ParserK/o-n-heap.sequence 17177.40 +35.06
All.Data.ParserK/o-n-heap.sequenceA 17594.40 +32.23
All.Data.ParserK/o-n-heap.someAlt 20200.40 +22.45
All.Data.ParserK/o-n-heap.choice 1188.96 +9.57
All.Data.ParserK/o-1-space.drain 37.40 -0.14
Data.ParserK(Allocated)
Benchmark default(0)(Bytes) default(1) - default(0)(%)
------------------------------------------ ----------------- --------------------------
All.Data.ParserK/o-1-space.drain 0.00 NaN
All.Data.ParserK/o-1-space.takeWhile 16773126.00 +28.60
All.Data.ParserK/o-1-space.monad2 17908730.00 +26.28
All.Data.ParserK/o-1-space.splitAp2 17908777.00 +26.28
All.Data.ParserK/o-1-space.splitAp8 18838099.00 +25.24
All.Data.ParserK/o-1-space.monad8 18837934.00 +24.69
All.Data.ParserK/o-1-space.monad4 18579719.00 +24.67
All.Data.ParserK/o-1-space.monad16 18941591.00 +23.49
All.Data.ParserK/o-1-space.one (recursive) 21570983.00 +21.41
All.Data.ParserK/o-n-heap.sequenceA_ 25532786.00 +18.45
All.Data.ParserK/o-n-heap.sequence_ 25532786.00 +18.45
All.Data.ParserK/o-n-heap.sepBy1 25991090.00 +17.70
All.Data.ParserK/o-1-space.alt2 43139415.00 +16.45
All.Data.ParserK/o-n-heap.manyAlt 29579237.00 +15.67
All.Data.ParserK/o-n-heap.someAlt 33289419.00 +15.20
All.Data.ParserK/o-n-heap.sequenceA 41488335.00 +11.44
All.Data.ParserK/o-n-heap.sequence 41488335.00 +11.18
All.Data.ParserK/o-1-space.alt8 201287445.00 +10.49
All.Data.ParserK/o-1-space.alt16 411933233.00 +10.03
All.Data.ParserK/o-n-heap.choice 3173156.00 -0.00
No significant changes in Array parsers.