Nick Hill
Nick Hill
Chained/recursive async `ListenableFuture` transformations (e.g. via `Futures.transformAsync`/`catchingAsync`/`scheduleAsync`/... or `SettableFuture.setFuture`) currently cause indefinite growth of live objects, with intermediate futures not eligible for collection until the entire chain/tree is completed, even...
@nitsanw @franz1981 would be interested in your thoughts on this experiment to make the array-based queues more progressive... I was playing around with it a while back after @belliottsmith's comments...
Hi, we would like to build mleap-serving images using a different base image (RedHat UBI-based rather than Ubuntu), and so were wondering if the Dockerfile and any other build logic...
I've seen that TorchServe [now supports](https://github.com/pytorch/serve/pull/1190) the [KServe V2 Prediction API](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2) but as far as I can see, this is only the REST flavour of it, via the `kservev2` service...
# What does this PR do? Identical inputs to GPT-J and CodeGen models will currently generate different outputs if they are padded differently (for example in a batch of variable...
#### Motivation Currently to avoid OOM you must set a "worst case" max batch size based on the desired max sequence length. This means that (a) throughput is unnecessarily limited...
Currently, position_ids are always maintained/updated in the CausalLM case but this is unnecessary for models like BLOOM which don't use them.
It seems to work fine and loads 4-10x faster for me depending on the storage/page cache (non-sharded 20B parameter model). However, when loaded this way, inference appears to be 10-15%...
Benefits: - Centralizes this logic that's on the critical inference loop path and does it in rust instead of python - Simplifies python side of the code, decoupling next-token generation...
See https://github.com/huggingface/transformers/pull/24453. I didn't add validation to the `__init__` method since it's not done for other values/warpers.