DeepSpeed-MII can not test with restful

Great job!

But I have some problems with restful_apt test, hope to get some help here.

test with commit : https://github.com/microsoft/DeepSpeed-MII/commit/ddbc6fc11b914abc2f166f346845f2476f61bfe7 gpu: NVidia A10

launch service

import mii

model_name_or_path = "/dataset/huggyllama/llama-7b"
max_model_length = 2048

mii.serve(
    model_name_or_path=model_name_or_path,
    max_length=max_model_length,
    deployment_name="mii_test",
    tensor_parallel=1,
    replica_num=1,
    enable_restful_api=True,
    restful_api_port=8000,
    )

test with curl

curl --header "Content-Type: application/json" --request POST  -d '{"prompts": "[DeepSpeed is]", "max_length": 128}' http://127.0.0.1:8000/mii/mii_test

And I got error

[2023-11-15 10:54:46,067] ERROR in app: Exception on /mii/mii_test [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 867, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 852, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 489, in wrapper
    resp = resource(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/flask/views.py", line 109, in view
    return current_app.ensure_sync(self.dispatch_request)(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 604, in dispatch_request
    resp = meth(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/mii/grpc_related/restful_gateway.py", line 31, in post
KeyError: 'request'
127.0.0.1 - - [15/Nov/2023 10:54:46] "POST /mii/mii_test HTTP/1.1" 500 -

And the python script is the same result


import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output)

just wonder if I'm missing any hyper parameters setting?

Nov 15 '23 11:11 irasin

The same issue, do not know where is wrong.

Nov 16 '23 11:11 ChristineSeven

Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of mii/grpc_related/restful_gateway.py indicates you are trying to get the request key from the dictionary, but this was changed in https://github.com/microsoft/DeepSpeed-MII/commit/ddbc6fc11b914abc2f166f346845f2476f61bfe7:

Can you please update to the latest source build of DeepSpeed and DeepSpeed-MII?

pip uninstall deepspeed deepspeed-mii -y
pip install git+https://github.com/microsoft/deepspeed.git
pip install git+https://github.com/microsoft/deepspeed-mii.git

Nov 16 '23 19:11 mrwyattii

@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0

Nov 17 '23 03:11 ChristineSeven

Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of mii/grpc_related/restful_gateway.py indicates you are trying to get the request key from the dictionary, but this was changed in ddbc6fc:

Can you please update to the latest source build of DeepSpeed and DeepSpeed-MII?
pip uninstall deepspeed deepspeed-mii -y
pip install git+https://github.com/microsoft/deepspeed.git
pip install git+https://github.com/microsoft/deepspeed-mii.git

Hi, @mrwyattii, many thanks for your reply.

After using the latest source build of DeepSpeed and DeepSpeed-MII, now it works with restful api now. But maybe because the reuslt contains some escape characters, it can not be parsed to json format. Here is the example

launch service

import mii

model_name_or_path = /dataset/huggyllama/llama-7b"
max_model_length = 2048


mii.serve(
    model_name_or_path=model_name_or_path,
    max_length=max_model_length,
    deployment_name="mii_test",
    tensor_parallel=1,
    replica_num=1,
    enable_restful_api=True,
    restful_api_port=8000,
    )

test with python

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

text = output.text
print(text)
json_res = json.loads(text) 
assert isinstance(json_res, str) ## it's still a string because some escape characters?
print(json_res)

the result is as below, json_res is still a string.

"{\n  \"response\": [\n    \"the solution for low speed, high-current IGBT switching applications that involve controlling high power from a series of IGBT modules, such as output inverters for PV, wind, motor drives, UPS, or Xenon lighting applications.\\nThe platform provides an open and modular solution for achieving fast switching times, meeting the rapid rise in demand for higher power modules. This is enabled through the modular design of the DeepSpeed core, which offers high-speed operation, reducing the number of components and improving size and cost.\\nDeepSpeed is fully compli\",\n    \"I've had the pleasure of knowing, through the virtual ether, for over 15 years. I have also been fortunate enough to visit Seattle on several occasions over the years as well as being able to collaborate and visit artists' studios in the Northwest. When opportunity knocked and the folks at Art Informel extended an invitation to show at their space, I felt the stars were aligned, that this was meant to be. I hope you'll join me in Seattle for the opening this Saturday, December 10th, from 5-9PM, at\"\n  ]\n}"

{
  "response": [
    "the solution for low speed, high-current IGBT switching applications that involve controlling high power from a series of IGBT modules, such as output inverters for PV, wind, motor drives, UPS, or Xenon lighting applications.\nThe platform provides an open and modular solution for achieving fast switching times, meeting the rapid rise in demand for higher power modules. This is enabled through the modular design of the DeepSpeed core, which offers high-speed operation, reducing the number of components and improving size and cost.\nDeepSpeed is fully compli",
    "I've had the pleasure of knowing, through the virtual ether, for over 15 years. I have also been fortunate enough to visit Seattle on several occasions over the years as well as being able to collaborate and visit artists' studios in the Northwest. When opportunity knocked and the folks at Art Informel extended an invitation to show at their space, I felt the stars were aligned, that this was meant to be. I hope you'll join me in Seattle for the opening this Saturday, December 10th, from 5-9PM, at"
  ]
}

Hope to get answer again.

Nov 20 '23 02:11 irasin

@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0

@ChristineSeven can you share the full script that you are using to deploy MII? Specifically, I would like to know what model, tensor parallel settings, etc.

Nov 21 '23 19:11 mrwyattii

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

Nov 21 '23 19:11 mrwyattii

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

Hi, @mrwyattii , the results are the same.

Nov 22 '23 01:11 irasin

i also face the same question, internel error please help us

Nov 22 '23 02:11 cableyang

@irasin can you please try the following instead?

import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())

Hi, @mrwyattii , the results are the same.

@irasin Is your Flask version <3.0.0? If so, I think I have the solution in #328. Can you try with that PR? You can install it with pip install git+https://github.com/Microsoft/DeepSpeed-MII@mrwyattii/threaded-rest-api

Nov 27 '23 23:11 mrwyattii

i also face the same question, internel error please help us

@cableyang can you please share the full script that you are running so that I can try to reproduce the error? Thanks

Nov 27 '23 23:11 mrwyattii

@irasin can you please try the following instead?
import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
    url, data=json_params, headers={"Content-Type": "application/json"}
)

print(output.json())
Hi, @mrwyattii , the results are the same.
@irasin Is your Flask version <3.0.0? If so, I think I have the solution in #328. Can you try with that PR? You can install it with pip install git+https://github.com/Microsoft/DeepSpeed-MII@mrwyattii/threaded-rest-api

With the latest DeepSpeed-MII commit, I can get the json format output now. Thanks a lot, @mrwyattii

BTW, I wonder where can I get the benchmark scripts you used in the Performance Evaluation of https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md. It seems that DeepSpeed-MII has much higher throughput and lower latency than vllm, which is very amazing. If possible, I would like to test some other models in my local env.

I test with benchmark_server.py script in vllm repo, which sends 1000 request to the server in the same time, and I keep getting SYN flood error messages in demsg output like

[1021332.329430] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Dropping request.  Check SNMP counters.

I'm curious if there is any limit on the maximum number of connections on the server side or restful_api

Nov 28 '23 02:11 irasin

@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0

@ChristineSeven can you share the full script that you are using to deploy MII? Specifically, I would like to know what model, tensor parallel settings, etc.

sorry for late reply.

import argparse
import asyncio
import json
import random
import time
from typing import AsyncGenerator, List, Tuple, Union

import aiohttp
import numpy as np
import codecs
from time import sleep

global token_num
token_num=0


def sample_requests() -> List[Tuple[str, dict]]:
    # Load the dataset.
    content_list = []
    num_all=0
    with open("457.json","r",encoding='utf-8') as f:
        lines = f.readlines()
        print(len(lines))
        for line in lines:
            if line:
                data = json.loads(line)
                content_list.append(data)
    print(num_all)
    print(len(content_list))
    print(content_list[0])
    print("read data set finish")
    prompts = [content['question'] for content in content_list]
    
    tokenized_dataset = []
    for i in range(len(content_list)):
        tokenized_dataset.append((prompts[i], content_list[i]))

    return tokenized_dataset



async def send_request(
    prompt: str,
    origin_json: dict
) -> None:
    global token_num
    request_start_time = time.time()
    headers = {'Content-Type': 'application/json'}
    headers = {"User-Agent": "Benchmark Client"}
    url = "http://10.10.10.10:28093/mii/mistral-deployment" 
    output_list = []
    params = {"prompts": [prompt], "max_length": 4096}
    json_params = json.dumps(params)

    timeout = aiohttp.ClientTimeout(total=3 * 3600)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        while True:
            async with session.post(url, headers=headers, data=json_params) as response:
                chunks = []
                async for chunk, _ in response.content.iter_chunks():
                    chunks.append(chunk)
            output = b"".join(chunks).decode("utf-8")
            print(output)
            try:
                result=json.loads(output.json())
                origin_json['model_answer'] = result['response'][0]
            except:
                origin_json['model_answer'] = ''
            token_num+=1
            print(token_num)
            if "error" not in output:
                break
    return origin_json


async def batchmark(
    input_requests: List[Tuple[str, dict]],
) -> None:
    tasks: List[asyncio.Task] = []
    async for request in get_request(input_requests):
        prompt, origin_json = request
        task = asyncio.create_task(send_request(prompt,
                                                origin_json))
        tasks.append(task)
    results=await asyncio.gather(*tasks)
    return results


def main(args: argparse.Namespace):
    print(args)
    random.seed(args.seed)
    np.random.seed(args.seed)
    input_requests = sample_requests()

    batch_start_time = time.time()
    for i in range(0, len(input_requests), 50):
        total_results=asyncio.run(batchmark(input_requests[i:i+50]))
        with open('457_deepspeed_out.json', 'a+', encoding='utf-8') as f1:
            for origin_json in total_results:            
                json_data = json.dumps(origin_json, ensure_ascii=False)
                f1.write(json_data + "\n")
                f1.flush()

    batch_end_time = time.time()
    print(batch_end_time-batch_start_time)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Batchmark the online serving throughput.")
    parser.add_argument("--seed", type=int, default=0)
    args = parser.parse_args()
    main(args)

Nov 28 '23 09:11 ChristineSeven

the server code is like this:

client = mii.serve(
    "mistralai/Mistral-7B-v0.1",
    deployment_name="mistral-deployment",
    enable_restful_api=True,
    restful_api_port=28080,
)

Nov 28 '23 10:11 ChristineSeven

With the latest DeepSpeed-MII commit, I can get the json format output now. Thanks a lot, @mrwyattii

BTW, I wonder where can I get the benchmark scripts you used in the Performance Evaluation of https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md. It seems that DeepSpeed-MII has much higher throughput and lower latency than vllm, which is very amazing. If possible, I would like to test some other models in my local env.

I test with benchmark_server.py script in vllm repo, which sends 1000 request to the server in the same time, and I keep getting SYN flood error messages in demsg output like
[1021332.329430] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Dropping request.  Check SNMP counters.
I'm curious if there is any limit on the maximum number of connections on the server side or restful_api

@irasin

The benchmarks we ran to collect data for our FastGen blog post can be found here: https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/mii

Note that we did not use the RESTful API in our benchmarks and instead use the python API (i.e., mii.client). I imagine that sending 1000 requests at once is overloading the flask server we stand up for the RESTful API. I will investigate how we might be able to better handle a large number of requests like this.

Nov 28 '23 17:11 mrwyattii

can not test with restful_api