Max Ryabinin

Results 21 issues of Max Ryabinin

This PR integrates blockwise quantization from [bitsandbytes](https://github.com/facebookresearch/bitsandbytes) as a new compression mechanism of Hivemind. The important part is that it is an *optional* compression protocol: the user should only install...

Currently, all interfaces with libp2p only gloss over the inner workings of this library, which might not be very helpful for our contributors that want to understand the design decisions...

documentation

Right now, if one decides to train with DecentralizedOptimizer using multiple GPUs with something like DistributedDataParallel from PyTorch, they might face an excessive amount of likely redundant network traffic, since...

enhancement

Given that we're only requesting one expert in the server at a time, it might be possible to keep many experts in CPU memory and to process larger batches in...

enhancement

Currently it's possible to load experts with a mismatched pattern from the checkpoint directory during the server start. We need to validate each expert UID at initialization.

invalid
server

Right now, it's tricky to start a Server with a custom expert and to change optimizer/scheduler parameters easily without modifying the code. - [ ] Implement hierarchical YAML configuration that...

enhancement
help wanted

Right now, we don't fully utilize the Tensor Core capabilities of modern NVIDIA GPUs due to making all server-side computations in mixed precision. It might be possible to switch to...

enhancement

Larger Transformer models are trained with larger batches, it's probably beneficial to accumulate gradients from several backward requests before making a step. It can be implemented in `ExpertBackend.apply_gradients()`, and the...

enhancement

As of now, `forward/backward_timeout` arguments correspond only to timeouts for Server interactions. However, this is not the only possible cause of freezes: for example, beam search might take too long...

enhancement
mixture-of-experts

**Describe the bug** While working on https://github.com/learning-at-home/hivemind/pull/490, I found that if I have bitsandbytes installed in a GPU-enabled environment, I get an error when running [test_adaptive_compression](https://github.com/learning-at-home/hivemind/blob/master/tests/test_compression.py#L152), which happens to be...

bug
ci