guidance Make `GrammarFunction.match` consume hidden commit points twice

This fixes issue https://github.com/guidance-ai/guidance/issues/624 by having GrammarFuntion.match consume hidden commit points twice.

Namely, if gen has a stop, kwd, previous behavior allowed matches when the match string ended in the stop string. For the grammar to have possibly generated that match string, the stop string would have to be allowable by a part of the grammar following the gen call, so we simply consume the bytes twice.

Feb 20 '24 21:02 hudson-ai

@riedgar-ms request for review :)

If the manual rewinding done here can be achieved with mechanisms built into the parser, I think that would be better. But that code is still somewhat opaque to me and this works without being overly confusing (in my opinion at least)

Feb 21 '24 16:02 hudson-ai

@riedgar-ms request for review :)

If the manual rewinding done here can be achieved with mechanisms built into the parser, I think that would be better. But that code is still somewhat opaque to me and this works without being overly confusing (in my opinion at least)

This is beyond my understanding of the code base right now. @slundberg @marcotcr @Harsha-Nori ?

Feb 21 '24 16:02 riedgar-ms

Thanks @hudson-ai , this is great and subtle point about how the grammars are designed for guiding the generative process and so hidden commit points have odd behavior on fixed string parsing. Unfortunately simply consuming the commit point twice will only work when the hidden commit point is not really meant to be hidden (just happens to be defined as hidden and then followed by a copy of what was hidden). In many cases though the model might have a hidden commit point that is something like the EOS token, and then what comes next in the output is not a literal EOS token but something else.

I am not sure how to make these generative focused grammars play nice with fixed text strings that don't show the hidden commit_point data in them. Once approach would be to prune out hidden commit points that are followed by what they are hiding, this would address all the cases this PR addresses without assuming it in cases where that is not true. Perhaps we can update this PR to do this by keeping track of the bytes that we are "redoing" and if we fail to reprise those in the grammar then we fail with a "hidden commit point" exception. That way .match will work as expected in the cases that are well defined, but will through an error when you try to use hidden commit points that are impossible to parse with a fixed string.

What do you think? Are you up for adding that exception throwing check?

(btw, perhaps we should also export a "join" function just like we export "select" so people can avoid using "+" if the want without accessing internal classes)

Feb 22 '24 19:02 slundberg

I think I follow what you are saying re: hidden commit points that weren't meant to be hidden, but I'm not quite sure I track 100%.

My thinking is as follows:

If we are matching a fixed string (whether it was generated by a grammar or manually specified) and we hit some text that corresponds to a hidden commit point, it's not possible that that text was generated by the grammar node with that hidden commit because, well, it was hidden. Therefore it was either generated by the next node in the grammar or that string cannot match the grammar, and we can verify whether that next node could generate that string by re-consuming the bytes.

In other words, this approach does not assume that commit points are followed by what they are hiding -- rather, it raises a ParserError if the string matches a hidden commit point && the commit point is not followed by what it's hiding.

100% into the idea of raising "hidden commit point" exception in this case!

I'm not quite sure I follow how the collapsing/pruning behavior would work. Can you elaborate a bit? Furthermore, can you think of any tests that my approach would fail on that collapsing/pruning would succeed on?

Feb 22 '24 20:02 hudson-ai

Ok -- I think I see. Here's a test that should pass but doesn't here:

grammar = gen(regex=r'a*', stop='@') + gen(regex=r'b*')
matchstr = 'aaaaaabbbbb'
match = grammar.match(matchstr, allow_partial=True)
assert match is not None

I assume this is what you meant by collapsing hidden commits?

Feb 22 '24 20:02 hudson-ai

@hudson-ai sorry, was out sick for a bit. Yes, that's what I meant by collapsing hidden commit points. In the example you gave I think we should just throw hidden commit point exception since we don't regenerate '@' after the hidden commit point.

So, in summary here are a couple unit tests that should pass:

with pytest.raises(HiddenCommitPointException):
    grammar = gen(regex=r'a*', stop='@') + gen(regex=r'b*')
    matchstr = 'aaaaaabbbbb'
    match = grammar.match(matchstr, allow_partial=True)

grammar = gen(regex=r'a*', stop='@') + "@" + gen(regex=r'b*')
matchstr = 'aaaaaa@bbbbb'
match = grammar.match(matchstr, allow_partial=True)
assert match is not None

Mar 04 '24 16:03 slundberg

@hudson-ai sorry, was out sick for a bit. Yes, that's what I meant by collapsing hidden commit points. In the example you gave I think we should just throw hidden commit point exception since we don't regenerate '@' after the hidden commit point.

So, in summary here are a couple unit tests that should pass:
with pytest.raises(HiddenCommitPointException):
    grammar = gen(regex=r'a*', stop='@') + gen(regex=r'b*')
    matchstr = 'aaaaaabbbbb'
    match = grammar.match(matchstr, allow_partial=True)
grammar = gen(regex=r'a*', stop='@') + "@" + gen(regex=r'b*')
matchstr = 'aaaaaa@bbbbb'
match = grammar.match(matchstr, allow_partial=True)
assert match is not None

No problem @slundberg, glad you took time to recover!

I'm not entirely sure I agree with the first test you sent there, especially since hidden commit points may frequently be things like EOS tokens.

Here are the four main cases I see:

Stop token not repeated in grammar, not present in match
- Shouldn't raise exception, especially if we imagine the stop token could have been EOS?

  grammar = gen(regex=r'a*', stop='@') + gen(regex=r'b*')
  matchstr = 'aaaaaabbbbb'
  match = grammar.match(matchstr, allow_partial=True)

Stop token not repeated in grammar, present in match
- I'm confident that this SHOULD raise an exception, maybe a HiddenCommitPointException exception in particular. Somehow a '@' made it into our string even though it should have been hidden.

  grammar = gen(regex=r'a*', stop='@') + gen(regex=r'b*')
  matchstr = 'aaaaaa@bbbbb'
  match = grammar.match(matchstr, allow_partial=True)

Stop token repeated in grammar, not present in match
- This is just a standard ParserException. We got a b when we were expecting an a or an @.

  grammar = gen(regex=r'a*', stop='@') + '@' + gen(regex=r'b*')
  matchstr = 'aaaaaabbbbb'
  match = grammar.match(matchstr, allow_partial=True)

Stop token repeated in grammar, present in match
- This should not be an exception

  grammar = gen(regex=r'a*', stop='@') + '@' + gen(regex=r'b*')
  matchstr = 'aaaaaa@bbbbb'
  match = grammar.match(matchstr, allow_partial=True)

Let me know if you agree with these behaviors or if you think I'm wrong -- either way, I'm happy to take another stab at getting this working :)

Also need to think through which of these should be partial/full matches...

Mar 04 '24 20:03 hudson-ai

@hudson-ai great concrete cases.

I am not in principle opposed to the cases as you laid them out. It would be nice if we were able to have a .match() that aligns in all cases with what would get produced by a model with that grammar constraint. I am however a bit concerned that there might be some fundamental ambiguity we might run into trying to do that (which is why I was just trying cover the case where the hidden stop token is repeated by the grammar since I think there is not ambiguity there).

Consider this case:

grammar = gen(regex=r'.*', stop='@') + gen(regex=r'[^0-9]*', name="end")
matchstr = 'ajdkjck@sk9dfjjsdfjk'
match = grammar.match(matchstr, allow_partial=True)

The above case cannot be a match because once we see an "@" we know that no numbers can appear (since an "@" can't appear in the first pattern and numbers can't appear in the second).

grammar = gen(regex=r'.*', stop='@') + gen(regex=r'[^0-9]*', name="end")
matchstr = 'ajdkjcksk9dfjjsdfjk'
match = grammar.match(matchstr, allow_partial=True)

If we can make match work in cases like above then I'll be happy to go with the semantics as you proposed :)

...perhaps we can do this by just letting the matching grammar consume hidden commit points when the alternative is a failed match?

Mar 08 '24 00:03 slundberg

@slundberg thank you for taking a look and offering your thoughts. I'll take a crack at this when I have some free time :)

Mar 08 '24 18:03 hudson-ai

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 63.60%. Comparing base (29b5b8d) to head (45236d3). Report is 2 commits behind head on main.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #644      +/-   ##
==========================================
- Coverage   66.97%   63.60%   -3.37%     
==========================================
  Files          53       53              
  Lines        3951     3959       +8     
==========================================
- Hits         2646     2518     -128     
- Misses       1305     1441     +136

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Mar 26 '24 02:03 codecov-commenter

@slundberg I feared you were right that there is a fundamental ambiguity... So I naturally went on a walkabout and wrote an Earley recognizer in rust :laughing:

I realized that the parser is more than capable of resolving the ambiguity -- it should be as easy as wrapping the hidden-commit point node in optional when in the context of grammar.match (and consuming the hidden commit point twice if one actually shows up).

My current implementation of that feels a bit hacky, so I'll mark this as a draft until I feel more satisfied with how I'm doing that (and I would like to clean my tests up a bit). One thought I have is that hidden commit points can always be wrapped in optional, but I'm not yet confident in that (in particular, I haven't yet delved too deep into how you're implementing captures). Any thoughts from you to that end?

Mar 26 '24 05:03 hudson-ai

Always wrapping commit points in optional seems to break generation with stop tokens, but I don't quite understand why yet. Can you point me to anything to help me wrap my head around how they're working? It "works" during matching, but I would like to understand a bit more before going forward with that solution.

Mar 28 '24 03:03 hudson-ai

Sorry for the slow reply! Nice that you implemented one in rust! As #736 suggests, we may move towards that here as well.

As for making all hidden commit points optional...I think the issue is that if you have a hidden commit point that is optional as your stop token then there is nothing stopping you from jumping over that stop point without ever generating it. So if I have the grammar gen(stop="end") + regex([0-9]) then the model could generate "asdf0" and then return because that is a valid match (if the hidden commit point is optional).

Apr 08 '24 23:04 slundberg