mooff comments

Results 42 comments of


                                            mooff

Set of Taylor changes

Hey, no worries. I found it tricky to get everything compiled right. I see the Docker based automation but I didn't have resources (mainly disk space) to use it, so...

Set of Taylor changes

Hey, mostly just housekeeping. Added the gem changes to the build config for all platforms, then did a full rebuild using Docker. I'd squash / elide some of the commits,...

CAP proposal for external registration method `draft/external-reg`

- 'Server MAY leave connection open pending registration via the web page, and proceed once success or failure has been reported by the backend' - 'Clients SHOULD play elevator music...

What would it take to 100x the context window?

So the [abstract from the Longformer paper](https://arxiv.org/abs/2004.05150) answers some questions about computational complexity: > Transformer-based models are unable to process long sequences due to their self-attention operation, which scales **quadratically**...

What would it take to 100x the context window?

I woke up thinking about interpolating input vectors, and/or the early representations that they make - whatever is created when the attention matrices are applied to the input embedding. For...

What would it take to 100x the context window?

> the huge matrixes are multiplied but no matter the size of the context, the output is still the embedding size that is passed to the next layer or is...

What would it take to 100x the context window?

> Except it doesn't work, once you go past some token number, everything breaks down and the output becomes more and more garbage. I don't know what the reason is,...

What would it take to 100x the context window?

Could you model position with something like a signed distance function? Or maybe some kind of modular arithmetic. So you can fractally divide relative distances in self-contained modular units. I...

What would it take to 100x the context window?

> I was thinking that maybe the K and V buffers can be manipuleted using some image processing techniques like maxpool or convolutions. Sounds like there are many more experiments...

What would it take to 100x the context window?

You almost want a dedicated tool to visualise various possible generation buckets. To get a sense for the range with a certain function applied whether it's temperature or something more...