mooff
mooff
Hey, no worries. I found it tricky to get everything compiled right. I see the Docker based automation but I didn't have resources (mainly disk space) to use it, so...
Hey, mostly just housekeeping. Added the gem changes to the build config for all platforms, then did a full rebuild using Docker. I'd squash / elide some of the commits,...
- 'Server MAY leave connection open pending registration via the web page, and proceed once success or failure has been reported by the backend' - 'Clients SHOULD play elevator music...
So the [abstract from the Longformer paper](https://arxiv.org/abs/2004.05150) answers some questions about computational complexity: > Transformer-based models are unable to process long sequences due to their self-attention operation, which scales **quadratically**...
I woke up thinking about interpolating input vectors, and/or the early representations that they make - whatever is created when the attention matrices are applied to the input embedding. For...
> the huge matrixes are multiplied but no matter the size of the context, the output is still the embedding size that is passed to the next layer or is...
> Except it doesn't work, once you go past some token number, everything breaks down and the output becomes more and more garbage. I don't know what the reason is,...
Could you model position with something like a signed distance function? Or maybe some kind of modular arithmetic. So you can fractally divide relative distances in self-contained modular units. I...
> I was thinking that maybe the K and V buffers can be manipuleted using some image processing techniques like maxpool or convolutions. Sounds like there are many more experiments...
You almost want a dedicated tool to visualise various possible generation buckets. To get a sense for the range with a certain function applied whether it's temperature or something more...