can not test with restful_api
Great job!
But I have some problems with restful_apt test, hope to get some help here.
test with commit : https://github.com/microsoft/DeepSpeed-MII/commit/ddbc6fc11b914abc2f166f346845f2476f61bfe7 gpu: NVidia A10
launch service
import mii
model_name_or_path = "/dataset/huggyllama/llama-7b"
max_model_length = 2048
mii.serve(
model_name_or_path=model_name_or_path,
max_length=max_model_length,
deployment_name="mii_test",
tensor_parallel=1,
replica_num=1,
enable_restful_api=True,
restful_api_port=8000,
)
test with curl
curl --header "Content-Type: application/json" --request POST -d '{"prompts": "[DeepSpeed is]", "max_length": 128}' http://127.0.0.1:8000/mii/mii_test
And I got error
[2023-11-15 10:54:46,067] ERROR in app: Exception on /mii/mii_test [POST]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 867, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/dist-packages/flask/app.py", line 852, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 489, in wrapper
resp = resource(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/flask/views.py", line 109, in view
return current_app.ensure_sync(self.dispatch_request)(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/flask_restful/__init__.py", line 604, in dispatch_request
resp = meth(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/mii/grpc_related/restful_gateway.py", line 31, in post
KeyError: 'request'
127.0.0.1 - - [15/Nov/2023 10:54:46] "POST /mii/mii_test HTTP/1.1" 500 -
And the python script is the same result
import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
url, data=json_params, headers={"Content-Type": "application/json"}
)
print(output)
just wonder if I'm missing any hyper parameters setting?
The same issue, do not know where is wrong.
Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of mii/grpc_related/restful_gateway.py indicates you are trying to get the request key from the dictionary, but this was changed in https://github.com/microsoft/DeepSpeed-MII/commit/ddbc6fc11b914abc2f166f346845f2476f61bfe7:
Can you please update to the latest source build of DeepSpeed and DeepSpeed-MII?
pip uninstall deepspeed deepspeed-mii -y
pip install git+https://github.com/microsoft/deepspeed.git
pip install git+https://github.com/microsoft/deepspeed-mii.git
@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0
Hi @irasin it looks like you are using an older version of MII. Your error message for line 31 of
mii/grpc_related/restful_gateway.pyindicates you are trying to get therequestkey from the dictionary, but this was changed in ddbc6fc:Can you please update to the latest source build of DeepSpeed and DeepSpeed-MII?
pip uninstall deepspeed deepspeed-mii -y pip install git+https://github.com/microsoft/deepspeed.git pip install git+https://github.com/microsoft/deepspeed-mii.git
Hi, @mrwyattii, many thanks for your reply.
After using the latest source build of DeepSpeed and DeepSpeed-MII, now it works with restful api now. But maybe because the reuslt contains some escape characters, it can not be parsed to json format. Here is the example
- launch service
import mii
model_name_or_path = /dataset/huggyllama/llama-7b"
max_model_length = 2048
mii.serve(
model_name_or_path=model_name_or_path,
max_length=max_model_length,
deployment_name="mii_test",
tensor_parallel=1,
replica_num=1,
enable_restful_api=True,
restful_api_port=8000,
)
- test with python
import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
url, data=json_params, headers={"Content-Type": "application/json"}
)
text = output.text
print(text)
json_res = json.loads(text)
assert isinstance(json_res, str) ## it's still a string because some escape characters?
print(json_res)
the result is as below, json_res is still a string.
"{\n \"response\": [\n \"the solution for low speed, high-current IGBT switching applications that involve controlling high power from a series of IGBT modules, such as output inverters for PV, wind, motor drives, UPS, or Xenon lighting applications.\\nThe platform provides an open and modular solution for achieving fast switching times, meeting the rapid rise in demand for higher power modules. This is enabled through the modular design of the DeepSpeed core, which offers high-speed operation, reducing the number of components and improving size and cost.\\nDeepSpeed is fully compli\",\n \"I've had the pleasure of knowing, through the virtual ether, for over 15 years. I have also been fortunate enough to visit Seattle on several occasions over the years as well as being able to collaborate and visit artists' studios in the Northwest. When opportunity knocked and the folks at Art Informel extended an invitation to show at their space, I felt the stars were aligned, that this was meant to be. I hope you'll join me in Seattle for the opening this Saturday, December 10th, from 5-9PM, at\"\n ]\n}"
{
"response": [
"the solution for low speed, high-current IGBT switching applications that involve controlling high power from a series of IGBT modules, such as output inverters for PV, wind, motor drives, UPS, or Xenon lighting applications.\nThe platform provides an open and modular solution for achieving fast switching times, meeting the rapid rise in demand for higher power modules. This is enabled through the modular design of the DeepSpeed core, which offers high-speed operation, reducing the number of components and improving size and cost.\nDeepSpeed is fully compli",
"I've had the pleasure of knowing, through the virtual ether, for over 15 years. I have also been fortunate enough to visit Seattle on several occasions over the years as well as being able to collaborate and visit artists' studios in the Northwest. When opportunity knocked and the folks at Art Informel extended an invitation to show at their space, I felt the stars were aligned, that this was meant to be. I hope you'll join me in Seattle for the opening this Saturday, December 10th, from 5-9PM, at"
]
}
Hope to get answer again.
@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0
@ChristineSeven can you share the full script that you are using to deploy MII? Specifically, I would like to know what model, tensor parallel settings, etc.
@irasin can you please try the following instead?
import json
import requests
url = f"http://localhost:8000/mii/mii_test"
params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128}
json_params = json.dumps(params)
output = requests.post(
url, data=json_params, headers={"Content-Type": "application/json"}
)
print(output.json())
@irasin can you please try the following instead?
import json import requests url = f"http://localhost:8000/mii/mii_test" params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128} json_params = json.dumps(params) output = requests.post( url, data=json_params, headers={"Content-Type": "application/json"} ) print(output.json())
Hi, @mrwyattii , the results are the same.
i also face the same question, internel error please help us
@irasin can you please try the following instead?
import json import requests url = f"http://localhost:8000/mii/mii_test" params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128} json_params = json.dumps(params) output = requests.post( url, data=json_params, headers={"Content-Type": "application/json"} ) print(output.json())Hi, @mrwyattii , the results are the same.
@irasin Is your Flask version <3.0.0? If so, I think I have the solution in #328. Can you try with that PR? You can install it with pip install git+https://github.com/Microsoft/DeepSpeed-MII@mrwyattii/threaded-rest-api
i also face the same question, internel error please help us
@cableyang can you please share the full script that you are running so that I can try to reproduce the error? Thanks
@irasin can you please try the following instead?
import json import requests url = f"http://localhost:8000/mii/mii_test" params = {"prompts": ["DeepSpeed is", "Seattle is a place"], "max_length": 128} json_params = json.dumps(params) output = requests.post( url, data=json_params, headers={"Content-Type": "application/json"} ) print(output.json())Hi, @mrwyattii , the results are the same.
@irasin Is your Flask version <3.0.0? If so, I think I have the solution in #328. Can you try with that PR? You can install it with
pip install git+https://github.com/Microsoft/DeepSpeed-MII@mrwyattii/threaded-rest-api
With the latest DeepSpeed-MII commit, I can get the json format output now. Thanks a lot, @mrwyattii
BTW, I wonder where can I get the benchmark scripts you used in the Performance Evaluation of https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md. It seems that DeepSpeed-MII has much higher throughput and lower latency than vllm, which is very amazing. If possible, I would like to test some other models in my local env.
I test with benchmark_server.py script in vllm repo, which sends 1000 request to the server in the same time, and I keep getting SYN flood error messages in demsg output like
[1021332.329430] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Dropping request. Check SNMP counters.
I'm curious if there is any limit on the maximum number of connections on the server side or restful_api
@mrwyattii yes, this solved! But when I do requests, another issue cames. Would you help to check this? Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner Exception in thread Thread-1: Traceback (most recent call last): File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self.run() File "/usr/lib/python3.8/threading.py", line 870, in run self._target(*self._args, **self._kwargs) self._target(*self._args, **self._kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 812, in call self.generate() self.generate() File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper File "/usr/local/lib/python3.8/dist-packages/mii/batching/utils.py", line 31, in wrapper return func(self, *args, **kwargs) return func(self, *args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 379, in generate next_token_logits = self.put( next_token_logits = self.put( File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put File "/usr/local/lib/python3.8/dist-packages/mii/batching/ragged_batching.py", line 717, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put return self.inference_engine.put(uids, tokenized_input) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/engine_v2.py", line 127, in put self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv self.model.maybe_allocate_kv(host_seq_desc, tokens.numel()) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/model_implementations/inference_transformer_base.py", line 357, in maybe_allocate_kv sequence.extend_kv_cache(new_blocks) sequence.extend_kv_cache(new_blocks) File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache File "/usr/local/lib/python3.8/dist-packages/deepspeed/inference/v2/ragged/sequence_descriptor.py", line 259, in extend_kv_cache shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) shadow_alloc_group[cur_blocks:cur_blocks + new_blocks].copy(new_group_ids) RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0 RuntimeError: The size of tensor a (8) must match the size of tensor b (12) at non-singleton dimension 0
@ChristineSeven can you share the full script that you are using to deploy MII? Specifically, I would like to know what model, tensor parallel settings, etc.
sorry for late reply.
import argparse
import asyncio
import json
import random
import time
from typing import AsyncGenerator, List, Tuple, Union
import aiohttp
import numpy as np
import codecs
from time import sleep
global token_num
token_num=0
def sample_requests() -> List[Tuple[str, dict]]:
# Load the dataset.
content_list = []
num_all=0
with open("457.json","r",encoding='utf-8') as f:
lines = f.readlines()
print(len(lines))
for line in lines:
if line:
data = json.loads(line)
content_list.append(data)
print(num_all)
print(len(content_list))
print(content_list[0])
print("read data set finish")
prompts = [content['question'] for content in content_list]
tokenized_dataset = []
for i in range(len(content_list)):
tokenized_dataset.append((prompts[i], content_list[i]))
return tokenized_dataset
async def send_request(
prompt: str,
origin_json: dict
) -> None:
global token_num
request_start_time = time.time()
headers = {'Content-Type': 'application/json'}
headers = {"User-Agent": "Benchmark Client"}
url = "http://10.10.10.10:28093/mii/mistral-deployment"
output_list = []
params = {"prompts": [prompt], "max_length": 4096}
json_params = json.dumps(params)
timeout = aiohttp.ClientTimeout(total=3 * 3600)
async with aiohttp.ClientSession(timeout=timeout) as session:
while True:
async with session.post(url, headers=headers, data=json_params) as response:
chunks = []
async for chunk, _ in response.content.iter_chunks():
chunks.append(chunk)
output = b"".join(chunks).decode("utf-8")
print(output)
try:
result=json.loads(output.json())
origin_json['model_answer'] = result['response'][0]
except:
origin_json['model_answer'] = ''
token_num+=1
print(token_num)
if "error" not in output:
break
return origin_json
async def batchmark(
input_requests: List[Tuple[str, dict]],
) -> None:
tasks: List[asyncio.Task] = []
async for request in get_request(input_requests):
prompt, origin_json = request
task = asyncio.create_task(send_request(prompt,
origin_json))
tasks.append(task)
results=await asyncio.gather(*tasks)
return results
def main(args: argparse.Namespace):
print(args)
random.seed(args.seed)
np.random.seed(args.seed)
input_requests = sample_requests()
batch_start_time = time.time()
for i in range(0, len(input_requests), 50):
total_results=asyncio.run(batchmark(input_requests[i:i+50]))
with open('457_deepspeed_out.json', 'a+', encoding='utf-8') as f1:
for origin_json in total_results:
json_data = json.dumps(origin_json, ensure_ascii=False)
f1.write(json_data + "\n")
f1.flush()
batch_end_time = time.time()
print(batch_end_time-batch_start_time)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Batchmark the online serving throughput.")
parser.add_argument("--seed", type=int, default=0)
args = parser.parse_args()
main(args)
the server code is like this:
client = mii.serve(
"mistralai/Mistral-7B-v0.1",
deployment_name="mistral-deployment",
enable_restful_api=True,
restful_api_port=28080,
)
With the latest DeepSpeed-MII commit, I can get the json format output now. Thanks a lot, @mrwyattii
BTW, I wonder where can I get the benchmark scripts you used in the Performance Evaluation of https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-fastgen/README.md. It seems that DeepSpeed-MII has much higher throughput and lower latency than vllm, which is very amazing. If possible, I would like to test some other models in my local env.
I test with benchmark_server.py script in vllm repo, which sends 1000 request to the server in the same time, and I keep getting SYN flood error messages in demsg output like
[1021332.329430] TCP: request_sock_TCP: Possible SYN flooding on port 8000. Dropping request. Check SNMP counters.I'm curious if there is any limit on the maximum number of connections on the server side or restful_api
@irasin
The benchmarks we ran to collect data for our FastGen blog post can be found here: https://github.com/microsoft/DeepSpeedExamples/tree/master/benchmarks/inference/mii
Note that we did not use the RESTful API in our benchmarks and instead use the python API (i.e., mii.client). I imagine that sending 1000 requests at once is overloading the flask server we stand up for the RESTful API. I will investigate how we might be able to better handle a large number of requests like this.
