Open-Assistant
Open-Assistant copied to clipboard
Implementation of 'use_cache' across generate calls for decoder model
When generating with decoder models, we can cache intermediate activations to avoid recomputing them. This is done by default in the transformers implementation when generating multiple new tokens.
In our case, we can have longer conversations, consisting of multiple prompts-replies. We could then need to cache across calls to the generate call.
It is not trivial if caching activations in this case is desired, due to memory overhead. We could have a timeout on each connection after which cached results are dumped. For this to work, repeated calls of the same conversation need to be served by the same machine.