kormann
kormann
it seems there are two possible solutions **swap idea:** while using the model take half of the input tokens and start building inference over them in the background. When out...
@tjohnman first of im just a layman too:) 'building inference' is just me trying to describe that you need to feed the previous tokens into the model again. I understand...
when the context gets full during recording i create new context with only text from current frame. I wanna make sure to only do this once to avoid loop. this...
working on speed optimization rn now at about 8.5x realtime on single GPU whisper large ``` TIMESTAMPS=1 MODEL=large python examples/whisper.py https://media.blubrry.com/takeituneasy/content.blubrry.com/takeituneasy/lex_ai_balaji_srinivasan.mp3 ```
generalized to vectorize(gep(val) * n ) -> val
some more digging i found that clangcompiler seems to produce n programs of size n
```python from tinygrad import nn, Tensor, Device from tinygrad.engine.realize import method_cache from tinygrad.helpers import DEBUG T = 80 Device.DEFAULT = "CLANG" DEBUG.value = 3 method_cache.clear() x = Tensor.rand(2) for _...
create_schedule creates ast that dont recompute over and over As i understand because 1. kernel can only store once 3. reduce op must be last operation? ``` python DEVICE =...
will only print if difference is more than 10% and more than 10 ms
time results are super off for some kernels must be missing some caching or optimization