exo Consolidated ChatGPT API improvements: Improve Compatibility, add requests specific token limits, and textual stop sequences

Building on #720 and #716 this introduces textual stop sequences to the ChatGPT API, aligning with the OpenAI API Reference, allowing a request to specify that generation should cease after a given textual sequence has been generated.

This is implemented by buffering tokens on the final node of the inference loop until enough characters have been generated to guarantee that the stop sequences listed are not present. If a stop sequence is found in the middle of a token or spanning tokens, the tokens returned from the final node to the ChatGPT API will not be a faithful replication of the tokens that the inference process emitted, but instead the text prior to the stop sequence retokenised.

This data is passed around in the GenerationOptions object as I introduced in #720. This PR should be considered after its two dependents, I will rebase this and apply any changes to it as needed.

Feb 25 '25 14:02 joshuacoles

@AlexCheema post our discussions yesterday I've chosen to consolidate all my changes up but not including structural generation (and hence function calling) into this PR to make it easier to review. I am going to add some tests and then this will be ready to review.

I am going to rebase the latter two features on-top of this with a git history with less false starts

Mar 07 '25 16:03 joshuacoles

I've added the tests and ensured bench.py runs successfully. This is now ready to review!

Mar 07 '25 16:03 joshuacoles

I have used the official OpenAI SDK to perform the ChatGPT API tests as this is the reference client implementation. I have not however added this to any requirements / setup file as I could not find the other testing dependencies (eg pytest and pytest-asyncio) listed anywhere.

Mar 07 '25 16:03 joshuacoles

Hi Josh! Thanks so much for your change. We have had a read through your PR, thank you so much for putting this together!

One thing we'd like to alter is the attempt_to_match_stop_sequences function to improve the efficiency there (remove string concatenation every time, and use a trie data structure for the stop-sequence search). We'll put together a suggested change this afternoon and forward to see what you think?

Mar 11 '25 16:03 MattBeton

Hey @MattBeton, happy to look over any suggestions you have! Either link them here or shoot me an email.

I have continued working on #771, structured generation is complete as of a couple days ago and I have re-added the basics of my original tool calling demo with less cruft. I will ping you in that PR when that is ready to review.

Mar 12 '25 09:03 joshuacoles

Hi Josh - we had another look at the BufferedOutput class and decided the speed-up of a trie data structure wasn't necessary for the moment. A small suggestion we'd have is to store assembled_text instead of reassembling every time. Something like this:

diff --git a/exo/orchestration/node.py b/exo/orchestration/node.py
index 6280732..f476b40 100644
--- a/exo/orchestration/node.py
+++ b/exo/orchestration/node.py
@@ -37,9 +37,11 @@ class BufferedOutput:
     self.eos_token_id = eos_token_id
     self.stop_sequences = stop_sequences
     self.tokenizer = tokenizer
+    self.assembled_text = ""
 
   def append(self, token: int):
     self.buffer.append((token, self.tokenizer.decode([token])))
+    self.assembled_text += self.tokenizer.decode([token])
     self._token_count += 1
 
     if token == self.eos_token_id:
@@ -51,11 +53,8 @@ class BufferedOutput:
     elif len(self.stop_sequences) > 0:
       self.attempt_to_match_stop_sequences()
 
-  def assembled_text(self) -> str:
-    return "".join([text for _, text in self.buffer])

Once this is sorted we can get @AlexCheema to merge in!

Mar 12 '25 10:03 MattBeton

Hey Matt two comments off the top of my head (I'll take a proper look at the effect of the changes when I get a chance later today):

This will obviously cause assembled text to grow monotonically as the response grows, this probably won't be an issue given the size of LLM responses in general, but something to be aware of if it were to be used in a deep-research like setting.
How does this effect the windowing behaviour in the buffered output? Previously it was using the length of the assembled text to determine when we had assembled enough characters to emit the oldest token from the buffer, since assembled text will keep growing this condition will always be true so we will emit one token each time. I am not entirely certain that this is the correct behaviour.

For example if the first token (T1) is longer than the buffer length (L) (and does not contain the stop sequence) this will subsequently cause the emission of all tokens as they are generated (as len(str(T1) + str(...x)) > L for all later token sequences x). This would be incorrect as it would be unable to parse any stop sequences than are not contained within a single token.

To remedy this we could remove the emitted token's text from the start of assembled_text as it is emitted? This would resolve both issues but I don't know the performance impacts compared to the initial implementation.

Mar 12 '25 11:03 joshuacoles