Embedding length calculation errors
I'm getting this on a convert of nakedlibrary (split on new lines)
openai.Embedding.create error: This model's maximum context length is 8191 tokens, however you requested 9501 tokens (9501 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
Retrying in 20 seconds ... (many times)
Fetching token_count for 2b66e37ecae038884e0f3e825fe13349c7ba16f03a50fb428015667cac0cdb6c
Token indices sequence length is longer than the specified maximum sequence length for this model (9956 > 1024). Running this sequence through the model will result in indexing errors
Saving prematurely due to crash: 2b66e37ecae038884e0f3e825fe13349c7ba16f03a50fb428015667cac0cdb6c had the wrong length of embedding, expected 1536
Traceback (most recent call last):
File "/Users/aaron/Documents/polymath/convert/main.py", line 157, in <module>
result.insert_bit(bit)
File "/Users/aaron/Documents/polymath/polymath/library.py", line 722, in insert_bit
bit._set_library(self)
File "/Users/aaron/Documents/polymath/polymath/library.py", line 216, in _set_library
self.validate()
File "/Users/aaron/Documents/polymath/polymath/library.py", line 179, in validate
Yikes! Yeah, we probably need to talk about sizes and limits for the nakedlibrary. Or maybe nakedlibrary import needs to use chunker?
As a way out of this particular situation, I would highly recommend running the content through chunker first to get the right-sized chunks.
The example usage is here: https://github.com/dglazkov/polymath/blob/main/convert/markdown.py#L59
Hmmm nakedlibrary importer does run the text through generate_chunks.
@uglyrobot I'm guessing that one of your chunks of text as a single line that is extraordinarily long? Can you confirm?
@dglazkov that implies to me that generate_chunks should forcibly break up content that is very long into multiple chunks, perhaps breaking at sentence boundaries first and then failing that at a word boundary and failing that just hard breaking it in the middle of a run of characters?
Yes I broke it up on newlines. It was a long one but more importantly that didn't fail gracefully.
- It retried 10 times with sleep even though it was a permanent error
- Even though get_embedding returned None, the bit was not skipped in convert.main (kindof weird python error) so it got stuck importing that bit over and over.
I'd love to see the input. Would you be up for sharing?