polymath icon indicating copy to clipboard operation
polymath copied to clipboard

Embedding length calculation errors

Open uglyrobot opened this issue 2 years ago • 4 comments

I'm getting this on a convert of nakedlibrary (split on new lines)

openai.Embedding.create error: This model's maximum context length is 8191 tokens, however you requested 9501 tokens (9501 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
Retrying in 20 seconds ... (many times)
Fetching token_count for 2b66e37ecae038884e0f3e825fe13349c7ba16f03a50fb428015667cac0cdb6c
Token indices sequence length is longer than the specified maximum sequence length for this model (9956 > 1024). Running this sequence through the model will result in indexing errors
Saving prematurely due to crash:  2b66e37ecae038884e0f3e825fe13349c7ba16f03a50fb428015667cac0cdb6c had the wrong length of embedding, expected 1536
Traceback (most recent call last):
  File "/Users/aaron/Documents/polymath/convert/main.py", line 157, in <module>
    result.insert_bit(bit)
  File "/Users/aaron/Documents/polymath/polymath/library.py", line 722, in insert_bit
    bit._set_library(self)
  File "/Users/aaron/Documents/polymath/polymath/library.py", line 216, in _set_library
    self.validate()
  File "/Users/aaron/Documents/polymath/polymath/library.py", line 179, in validate

uglyrobot avatar Feb 07 '23 05:02 uglyrobot

Yikes! Yeah, we probably need to talk about sizes and limits for the nakedlibrary. Or maybe nakedlibrary import needs to use chunker?

As a way out of this particular situation, I would highly recommend running the content through chunker first to get the right-sized chunks.

The example usage is here: https://github.com/dglazkov/polymath/blob/main/convert/markdown.py#L59

dglazkov avatar Feb 07 '23 05:02 dglazkov

Hmmm nakedlibrary importer does run the text through generate_chunks.

@uglyrobot I'm guessing that one of your chunks of text as a single line that is extraordinarily long? Can you confirm?

@dglazkov that implies to me that generate_chunks should forcibly break up content that is very long into multiple chunks, perhaps breaking at sentence boundaries first and then failing that at a word boundary and failing that just hard breaking it in the middle of a run of characters?

jkomoros avatar Feb 07 '23 13:02 jkomoros

Yes I broke it up on newlines. It was a long one but more importantly that didn't fail gracefully.

  • It retried 10 times with sleep even though it was a permanent error
  • Even though get_embedding returned None, the bit was not skipped in convert.main (kindof weird python error) so it got stuck importing that bit over and over.

uglyrobot avatar Feb 07 '23 15:02 uglyrobot

I'd love to see the input. Would you be up for sharing?

dglazkov avatar Feb 08 '23 04:02 dglazkov