streamly icon indicating copy to clipboard operation
streamly copied to clipboard

Track the absolute position in the drivers of Parser

Open adithyaov opened this issue 1 year ago • 2 comments

adithyaov avatar Oct 08 '24 09:10 adithyaov

In the stream function drivers, the abs-position argument is at the end. We should put it as the first argument just because stream and the stream constructors do the same.

Edit: Whatever is there should be fine.

adithyaov avatar Oct 10 '24 11:10 adithyaov

Benchmark regressions need to be checked.

adithyaov avatar Oct 12 '24 07:10 adithyaov

@harendra-kumar referred to this PR here: https://github.com/composewell/streamly/issues/2895 so I thought I'd throw in a few cents.

I don't completely understand this PR , but I would be hesitant about making position tracking a sort of hard coded thing. It looks like in this PR "position" is just the number of tokens, but I'm not sure that's a useful definition in a lot of cases. In some cases "line and column" is more useful than just an absolute position. But even that's sometimes not what you want. In a JSON document you might want a alice->bob->array element 9->charlie style position

The following parser transformer (which I mentioned in the issue) seems to achieve position tracking anyway:

limitParserInput :: forall m a b. Applicative m => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (StreamlyParser.Parser (stepFIn :: s -> a -> m (StreamlyParser.Step s b)) (initialIn :: m (StreamlyParser.Initial s b)) (extractFIn :: s -> m (StreamlyParser.Step s b))) = 
  StreamlyParser.Parser stepF initial extractF where
    stepF :: StrictIntPair s -> (Int, a) -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    stepF (StrictIntPair currentSize state) (inputSize, input) = let nextSize = currentSize + inputSize in case nextSize <= maxInputSize of 
      True -> BiFunctor.bimap (StrictIntPair nextSize) (nextSize,) <$> stepFIn state input
      False -> pure $ StreamlyParser.Error "Hit parser size limit" -- this should be more informative
    initial :: m (StreamlyParser.Initial (StrictIntPair s) (Int, b))
    initial = BiFunctor.bimap (StrictIntPair 0) (0,) <$> initialIn
    extractF :: StrictIntPair s -> m (StreamlyParser.Step (StrictIntPair s) (Int, b))
    extractF (StrictIntPair currentSize state) = BiFunctor.bimap (StrictIntPair currentSize) (currentSize,) <$> extractFIn state

To achieve what I think this PR does using the above function, you'll just need to pattern match on the stepFIn function, and if it fails, just add the current position to the error message. If you don't need to logic to fail on a maxInputSize just remove the argument and the handler in the case statement.

But this has the following benefits:

  1. It doesn't introduce a cost on parsers that don't use positions AND
  2. People can define their own concepts of position

By own concepts of "position", for example limitParserInput allows one to specify the size of incoming input tokens, for example. But hey, maybe if we were doing JSON parsing I could have a stack I could push/pop new nodes on/off when we have opening { and closing }.

That being said, I'm just starting to use this library, but it seems to me that position tracking can be done with just parser transforms instead of modifying the library to hard code the logic everywhere, but I may be missing something.

clintonmead avatar Dec 11 '24 02:12 clintonmead

Hi @clintonmead , thank you for the detailed comment.

I can see one problem with the implementation of limitParserInput. It does not account for any backtracking.

In the current implementation of parsers, the backtracking information is stored in the Step and the number representing backtracking is in terms of the number of tokens.

Extending your example, (I did not compile and check)

limitParserInput
    :: forall m a b. Applicative m
    => Int -> StreamlyParser.Parser a m b -> StreamlyParser.Parser a m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
    Parser step initial extract
    where
    initial = do
        st <- initial0
        pure $ case st of
            IPartial s -> IPartial (0, s)
            IDone b -> IDone (0, b)
            IError _ -> IError (show 0) -- Hacky way to encode error limited by
                                        -- the implementation.

    step (i, s) inp = do
        let i1 = i + 1
        case i1 <= maxInputSize of
              True -> do
                  res <- step0 state input
                  pure $ case res of
                      Partial back s1 -> Partial back (i1 - back, s1)
                      Continue back s1 -> Continue back (i1 - back, s1)
                      Done back s1 -> Done back (i1 - back, s1)
                      Error _ -> pure $ Error $ show i1
              False -> pure $ Error $ "Hit parser size limit" -- this should be
                                                              -- more
                                                              -- informative

    extract (i, s) = do
        res <- extract0 s
        pure $ case res of
            Partial back s1 -> Partial back (i - back, s1)
            Continue back s1 -> Continue back (i - back, s1)
            Done back b -> Done back (i - back, b)
            Error _ -> Error $ show i1

Although this accounts for backtracking, we only keep track of the number of tokens. The size of the token is 1. We cannot have a custom size for the token.

We can however solve this by implementing a driver that accounts for the token size. The token size will become a first-class citizen.

type ParserWithToken a m b = Parser (Int, a) m (Int, b)

The diver for this parser would keep track of the token size and the number of tokens. While backtracking, we also subtract the token size appropriately. This can be very useful.

On a side note: I'll need to check if I'm missing anything in limitParserInput. If this is enough to keep track of the absolute position, then I would prefer this over changing the parser drivers.

adithyaov avatar Dec 12 '24 08:12 adithyaov

A custom driver might not be required, if we can save the buffer in the parser combinator itself, the parser combinator can do the job of the driver.

backtrackSize :: Int -> [Int] -> Int
backtrackSize i = sum . take i

backtrackBufBy :: Int -> [Int] -> Int
backtrackBufBy i = drop i

limitParserInput
    :: forall m a b. Applicative m
    => Int -> Parser a m b -> StreamlyParser.Parser (Int, a) m (Int, b)
limitParserInput maxInputSize (Parser step0 initial0 extract0) =
    Parser step initial extract
    where
    initial = do
        st <- initial0
        pure $ case st of
            IPartial s -> IPartial (0, [], s)
            IDone b -> IDone (0, [], b)
            IError _ -> IError (show 0) -- Hacky way to encode error limited by
                                        -- the implementation.

    step (accInp, backtrackBuf, s) (inpSize, inp) = do
        let newAccInp = accInp + inpSize
            newBacktrackBuf = inpSize:backtrackBuf
        case newAccInp <= maxInputSize of
              True ->
                  res <- step0 state input
                  pure $ case res of
                      Partial back s1 ->
                          Partial back
                              ( newAccInp - backtrackSize back newBacktrackBuf
                              , backtrackBufBy back newBacktrackBuf, s1
                              )
                      Continue back s1 -> ...
                      Done back s1 -> ...
                      Error _ -> ...
              False -> pure $ Error $ "Hit parser size limit" -- this should be
                                                              -- more
                                                              -- informative

    extract (i, s) = ...

adithyaov avatar Dec 12 '24 09:12 adithyaov

Let me verify if this covers everything and solves the issue and your use case properly. If so, we can implement this combinator or add a new parser driver where the token size is a first-class citizen.

adithyaov avatar Dec 12 '24 09:12 adithyaov

We can evaluate both the approaches. About the implicit tracking approach in this PR:

  • The position is quite useful a lot of times, so always passing it in the driver may not be too bad if the performance impact is not significant. Since it is always there, we can always use it to report useful information about the any error position in any parser.
  • Implicit accounting of position in the driver may be easy to use, as it is not visible everywhere but is there when you need to report it. Though it may have to be manipulated in any intermediate combinators that act like drivers.
  • Unboxed parsers are the most commonly used ones. Just byte position tracking makes sense for unboxed parsers as they work on byte array inputs. For generic/boxed parsers the position would be in terms of the input tokens but anyway we cannot look inside the tokens, so that is the only sensible way to report position.

About the proposed explicit tracking:

  • This seems less invasive as we do not need to change anything, but we need to try and use it to see if there are any difficult issues with it, how ergonomic it is to use etc. But the good thing is that it can be tried without much changes to the existing implementation.
  • The position information is flexible and it can potentially be anything.
  • For error position reporting maybe we can change the Error constructor to Error Int String.
  • We need to see how this will optimize in practical cases, what are the observed perf characteristics.

Is there any other potential third way like tracking the context in an underlying state monad?

harendra-kumar avatar Dec 12 '24 09:12 harendra-kumar

@clintonmead Assuming parser combinators like the above exist. Can you think out loud a sample use case of error reporting or position tracking?

For a CSV file, we may want to track the position by: Line Number + Part Index An example with pseudo-code will help with seeing what the usage would look like.

adithyaov avatar Dec 12 '24 10:12 adithyaov

I am seeing the following perf changes. No allocation changes in Parser, but some CPU regressions:

Data.Parser(cpuTime)
Benchmark                                                                       default(0)(μs) default(1) - default(0)(%)
------------------------------------------------------------------------------- -------------- --------------------------
All.Data.Parser/o-1-space.some                                                           37.41                    +199.72
All.Data.Parser/o-n-heap.takeEQ                                                          37.41                    +132.92
All.Data.Parser/o-1-space.deintercalateAll                                              205.40                     +59.07
All.Data.Parser/o-n-heap.lookAhead                                                       65.76                     +32.54
All.Data.Parser/o-1-space.streamEqBy                                                     90.22                     +24.22
All.Data.Parser/o-1-space.sepByAll (words)                                              205.46                     +18.17
All.Data.Parser/o-1-space.parseIterate (take 1)                                         808.06                     +17.97
All.Data.Parser/o-1-space.parseMany (groupBy (<))                                        74.71                     +16.73
All.Data.Parser/o-1-space.takeBetween                                                   112.22                     +16.70
All.Data.Parser/o-1-space.manyTill                                                      713.63                     +15.54
All.Data.Parser/o-n-heap.listEqBy                                                       815.97                     +14.03
All.Data.Parser/o-1-space.parseMany groupRollingByEither (Right)                        239.55                     +13.29
All.Data.Parser/o-1-space.takeBeginBy                                                    76.15                     +10.46
All.Data.Parser/o-1-space.splitAp2                                                       76.62                      +9.79
All.Data.Parser/o-1-space.parseIterate (take all)                                       895.20                      +9.75
All.Data.Parser/o-1-space.splitApAfter                                                   76.81                      +9.56
All.Data.Parser/o-1-space.parseMany groupRollingBy (bound groups)                      1387.87                      +8.66
All.Data.Parser/o-1-space.parseMany (groupBy (==))                                     1350.86                      +8.00
All.Data.Parser/o-1-space.alt2parseMany                                                2752.44                      +7.36
All.Data.Parser/o-1-space/filesystem.S.parseMany (Fold.take 1 Fold.sum)              141463.00                      +6.90
All.Data.Parser/o-n-space.sequence/100                                                 9168.32                      +5.89
All.Data.Parser/o-1-space.parseMany                                                      57.04                      +4.92
All.Data.Parser/o-1-space.concatSequence                                               3262.95                      +4.44
All.Data.Parser/o-n-space.sequenceA/100                                                9314.83                      +3.97
All.Data.Parser/o-1-space.parseMany/Unfold/1000 arrays/take 1                           894.70                      +3.47
All.Data.Parser/o-1-space.alt16                                                         594.02                      +3.17
All.Data.Parser/o-1-space.splitAp8                                                      209.14                      +3.15
All.Data.Parser/o-1-space.takeWhile                                                      37.39                      +3.14

There are significant allocation as well as CPU regressions in ParserK:

Generating reports for Data.ParserK...
Data.ParserK(cpuTime)
Benchmark                                  default(0)(μs) default(1) - default(0)(%)
------------------------------------------ -------------- --------------------------
All.Data.ParserK/o-1-space.monad16                2141.37                    +261.17
All.Data.ParserK/o-1-space.takeWhile              2027.39                    +256.28
All.Data.ParserK/o-1-space.monad2                 2039.30                    +247.76
All.Data.ParserK/o-1-space.monad8                 2173.41                    +242.55
All.Data.ParserK/o-1-space.monad4                 2190.79                    +235.92
All.Data.ParserK/o-1-space.splitAp8               2195.81                    +234.58
All.Data.ParserK/o-1-space.one (recursive)        2520.50                    +231.90
All.Data.ParserK/o-1-space.splitAp2               2234.45                    +217.17
All.Data.ParserK/o-n-heap.sequence_               2967.16                    +203.91
All.Data.ParserK/o-n-heap.sequenceA_              3011.03                    +196.73
All.Data.ParserK/o-n-heap.sepBy1                  7103.05                     +97.05
All.Data.ParserK/o-1-space.alt2                  10419.60                     +85.68
All.Data.ParserK/o-1-space.alt16                126392.00                     +57.30
All.Data.ParserK/o-1-space.alt8                  60537.80                     +52.14
All.Data.ParserK/o-n-heap.manyAlt                12991.30                     +42.06
All.Data.ParserK/o-n-heap.sequence               17177.40                     +35.06
All.Data.ParserK/o-n-heap.sequenceA              17594.40                     +32.23
All.Data.ParserK/o-n-heap.someAlt                20200.40                     +22.45
All.Data.ParserK/o-n-heap.choice                  1188.96                      +9.57
All.Data.ParserK/o-1-space.drain                    37.40                      -0.14

Data.ParserK(Allocated)
Benchmark                                  default(0)(Bytes) default(1) - default(0)(%)
------------------------------------------ ----------------- --------------------------
All.Data.ParserK/o-1-space.drain                        0.00                        NaN
All.Data.ParserK/o-1-space.takeWhile             16773126.00                     +28.60
All.Data.ParserK/o-1-space.monad2                17908730.00                     +26.28
All.Data.ParserK/o-1-space.splitAp2              17908777.00                     +26.28
All.Data.ParserK/o-1-space.splitAp8              18838099.00                     +25.24
All.Data.ParserK/o-1-space.monad8                18837934.00                     +24.69
All.Data.ParserK/o-1-space.monad4                18579719.00                     +24.67
All.Data.ParserK/o-1-space.monad16               18941591.00                     +23.49
All.Data.ParserK/o-1-space.one (recursive)       21570983.00                     +21.41
All.Data.ParserK/o-n-heap.sequenceA_             25532786.00                     +18.45
All.Data.ParserK/o-n-heap.sequence_              25532786.00                     +18.45
All.Data.ParserK/o-n-heap.sepBy1                 25991090.00                     +17.70
All.Data.ParserK/o-1-space.alt2                  43139415.00                     +16.45
All.Data.ParserK/o-n-heap.manyAlt                29579237.00                     +15.67
All.Data.ParserK/o-n-heap.someAlt                33289419.00                     +15.20
All.Data.ParserK/o-n-heap.sequenceA              41488335.00                     +11.44
All.Data.ParserK/o-n-heap.sequence               41488335.00                     +11.18
All.Data.ParserK/o-1-space.alt8                 201287445.00                     +10.49
All.Data.ParserK/o-1-space.alt16                411933233.00                     +10.03
All.Data.ParserK/o-n-heap.choice                  3173156.00                      -0.00

No significant changes in Array parsers.

harendra-kumar avatar Jun 21 '25 11:06 harendra-kumar