llama.cpp Constrained decoding with BNF grammar fails to work with some tokens

I am running Mixtral 8x from here: https://github.com/Mozilla-Ocho/llamafile version v0.6.1 on an Apple M1 Max 64GB.

I am running an evaluation of the MMLU dataset. I have observed that when I use certain phrases llama.cpp spikes in evaluation time (going as low as 5 tokens/second) and ultimately does not adhere to the grammar.

Example success prompt:

Please select the answer to the following question. Prior to selecting your answer, please explain using step-by-step reasoning to how you arrived at your answer. Then provide your answer as "The answer is: <A, B, C, or D>" Question: Suppose X and Y are random variables with E(X) = 37, var(X) = 5, E(Y) = 62, and var(Y) = 12. What are the expected value and variance of the random variable X + Y? A) E(X + Y) = 99, var(X + Y) = 8.5 B) E(X + Y) = 99, var(X + Y) = 13 C) E(X + Y) = 99, var(X + Y) = 17 D) There is insufficient information to answer this question.

Answer: Corresponding grammar:

root ::= work answer
work ::= [^"The answer is:"]+
answer ::= "The answer is: " ("A" | "B" | "C" | "D")

Example failure prompt:

Please select the answer to the following question. Prior to selecting your answer, please explain using step-by-step reasoning to how you arrived at your answer. Then provide your answer as "FINAL_ANSWER: <A, B, C, or D>". Question: Suppose X and Y are random variables with E(X) = 37, var(X) = 5, E(Y) = 62, and var(Y) = 12. What are the expected value and variance of the random variable X + Y? A) E(X + Y) = 99, var(X + Y) = 8.5 B) E(X + Y) = 99, var(X + Y) = 13 C) E(X + Y) = 99, var(X + Y) = 17 D) There is insufficient information to answer this question.

Answer: Corresponding grammar:

root ::= work answer
work ::= [^"FINAL_ANSWER:"]+
answer ::= "FINAL_ANSWER: " ("A" | "B" | "C" | "D")

My suspicion is that FINAL_ANSWER is tokenized into something rare enough that even when the logit bias of the grammar is applied, it is never selected.

Here is an example output with the BNF grammar applied:

[INST]Select the answer to the following question. Do not write explanations or additional text. Just select the ANSWER choice as A, B, C, or D. [/INST]
Question: Many Web browsers allow users to open anonymous windows. During a browsing session in an anonymous window, the browser does not record a browsing history or a list of downloaded files. When the anonymous window is exited, cookies created during the session are deleted. Which of the following statements about browsing sessions in an anonymous window is true? 
A) The activities of a user browsing in an anonymous window will not be visible to people who monitor the user's network, such as the system administrator. 
B) Items placed in a Web store's shopping cart for future purchase during the anonymous browsing session will not be saved on the user's computer. 
C) A user will not be able to log in to e-mail or social media accounts during the anonymous browsing session. 
D) A user browsing in an anonymous window will be protected from viruses launched from any web sites visited or files downloaded. 

Answer:

--- BEGIN mixtral-8x7b RESPONSE ---

B) The activities of a user browsing in an anonymous window will not be visible to people who monitor the user's network, such as the system administrator. (This is incorrect because even though there isn't any local record on the computer about your activity, it can still be seen by those monitoring the connection.) B) Yes, since no browsing or download list will be saved locally and once you exit that window everything gets deleted including cookies which are required for login purposes. D) This is incorrect as anonymity doesn't guarantee virus protection while browsing online or downloading files. Users should always have reliable antivirus software installed regardless of whether they use anonymous mode or not to stay protected against cyber threats and malicious activities.

Thus, the right answer is: B) Yes, since no browsing or download list will be saved locally and once you exit that window everything gets deleted including cookies which are required for login purposes. (Just a note though that some websites might still remember your activity despite using an incognito/anonymous mode based on their server-side records etc., hence it's not entirely untraceable unless those servers are also compromised or breached.)

hence, the answer is: B. Yes, since no browsing or download list will be saved locally and once you exit that window everything gets deleted including cookies which are required for login purposes. But remember this doesn’t mean your activities can’t be traced by websites themselves or those who have access to their servers etc., hence stay cautious! :) /strike] B) Yes, since no browsing or download list will be saved locally and once you exit that window everything gets deleted including cookies which are required for login purposes.

Best regards, Mike Bartram (he/him) Higher Ground Technical Consultants, Chtd. Missoula, MT & Bozeman, MT Phone: +1(406)540-6397 |

Feb 19 '24 22:02 KE7

Would be useful to provide specific commands to reproduce using llama.cpp examples

Feb 20 '24 07:02 ggerganov

For the failure example, does the run ever finish, or does it just hang?

The sampler shouldn't permit end-of-sequence to be generated unless the grammar is fully satisfied, so I think it should be hanging in that case.

Mar 14 '24 13:03 HanClinto

@KE7

My suspicion is that FINAL_ANSWER is tokenized into something rare enough that even when the logit bias of the grammar is applied, it is never selected.

Right -- I think that this is generally correct. Grammars do not guide the token generation -- they only constrain it. If your model is not trained enough to output something like "FINAL_ANSWER" and if that's not represented strongly enough in your prompt, then as long as your grammar has an easier path (literally anything else other than that phrase) -- then it's going to continue generating other nonsense, and never finish the generation.

In particular, your unbounded work rule I think makes it far too easy for the LLM to continue rambling.

Unless it's a word or phrase that your model was specifically trained on, then you might want to fall back to using something like your original success prompt...?

Mar 14 '24 13:03 HanClinto

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 28 '24 01:04 github-actions[bot]