Describe the bug

Thank you for creating this great repo! I am running simple "openllm start dolly-v2" and getting a 500 internal server error related to "model_kwargs". Not sure how to proceed. See reproduce and errors below.

To reproduce

Created docker container from this image and ran on a T4 instance: FROM nvidia/cuda:11.0.3-runtime-ubuntu20.04

ENV BENTOML_HOME="/model_store/" ENV CUDA_VISIBLE_DEVICES=0

Update apt-get and install pip

RUN apt-get update && apt-get install -y python3-pip RUN pip3 install openllm EXPOSE 3000

ENTRYPOINT [ "openllm", "start" ] CMD [ "dolly-v2", "--model-id", "databricks/dolly-v2-3b", "--device", "0", "-p", "3000", "--verbose"]

When I go to localhost:3000 and post to the inference endpoint I get a 500 internal error.

Logs

2023-06-27T04:29:13+0000 [ERROR] [runner:llm-dolly-v2-runner:1] Exception on runner 'llm-dolly-v2-runner' method 'generate' (trace=666892bd204b5873d08d4ffe96808e94,span=4d1d367c3ded45cb,sampled=1,service.name=llm-dolly-v2-runner)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/runner_app.py", line 352, in _run
    ret = await runner_method.async_run(*params.args, **params.kwargs)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 59, in async_run_method
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runnable.py", line 140, in method
    return self.func(obj, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/_llm.py", line 1241, in generate
    return self.generate(prompt, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/models/dolly_v2/modeling_dolly_v2.py", line 273, in generate
    return self.model(
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1120, in __call__
    return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1127, in run_single
    model_outputs = self.forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.8/dist-packages/transformers/pipelines/base.py", line 1026, in forward
    model_outputs = self._forward(model_inputs, **forward_params)
  File "/usr/local/lib/python3.8/dist-packages/openllm/models/dolly_v2/modeling_dolly_v2.py", line 114, in _forward
    generated_sequence = self.model.generate(
  File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1271, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/usr/local/lib/python3.8/dist-packages/transformers/generation/utils.py", line 1144, in _validate_model_kwargs
    raise ValueError(
ValueError: The following `model_kwargs` are not used by the model: ['accelerator'] (note: typos in the generate arguments will also show up in this list)
2023-06-27T04:29:13+0000 [INFO] [runner:llm-dolly-v2-runner:1]  - "POST /generate HTTP/1.1" 500 (trace=666892bd204b5873d08d4ffe96808e94,span=13c7890ba80845b0,sampled=1,service.name=llm-dolly-v2-runner)
2023-06-27T04:29:13+0000 [INFO] [runner:llm-dolly-v2-runner:1] _ (scheme=http,method=POST,path=/generate,type=application/octet-stream,length=888) (status=500,type=text/plain,length=0) 4.769ms (trace=666892bd204b5873d08d4ffe96808e94,span=4d1d367c3ded45cb,sampled=1,service.name=llm-dolly-v2-runner)
2023-06-27T04:29:13+0000 [ERROR] [api_server:4] Exception on /v1/generate [POST] (trace=666892bd204b5873d08d4ffe96808e94,span=2aa0dd91ca45f3c4,sampled=1,service.name=llm-dolly-v2-service)
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/http_app.py", line 341, in api_func
    output = await api.func(*args)
  File "/usr/local/lib/python3.8/dist-packages/openllm/_service.py", line 86, in generate_v1
    responses = await runner.generate.async_run(qa_inputs.prompt, **config)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 55, in async_run
    return await self.runner._runner_handle.async_run_method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 246, in async_run_method
    raise RemoteException(
bentoml.exceptions.RemoteException: An unexpected exception occurred in remote runner llm-dolly-v2-runner: [500]

Environment

openllm, version 0.1.14

System information (Optional)

Dockerfile (executing on a T4 with 4 GPUs): FROM nvidia/cuda:11.0.3-runtime-ubuntu20.04

ENV BENTOML_HOME="/model_store/" ENV CUDA_VISIBLE_DEVICES=0

Update apt-get and install pip

RUN apt-get update && apt-get install -y python3-pip RUN pip3 install openllm EXPOSE 3000

ENTRYPOINT [ "openllm", "start" ] CMD [ "dolly-v2", "--model-id", "databricks/dolly-v2-3b", "--device", "0", "-p", "3000", "--verbose"]

Jun 27 '23 04:06 kenleejr

Dockerfile formatted weird: FROM nvidia/cuda:11.0.3-runtime-ubuntu20.04

ENV BENTOML_HOME="/model_store/" ENV CUDA_VISIBLE_DEVICES=0

RUN apt-get update && apt-get install -y python3-pip RUN pip3 install openllm EXPOSE 3000

ENTRYPOINT [ "openllm", "start" ] CMD [ "dolly-v2", "--model-id", "databricks/dolly-v2-3b", "--device", "0", "-p", "3000", "--verbose"]

Jun 27 '23 04:06 kenleejr

Oh yes I have fixed this with 0.1.16.

Jun 27 '23 04:06 aarnphm

also it is recommended to use openllm build and let BentoML handles the packaging for you instead of writing custom dockerfile like this

Jun 27 '23 05:06 aarnphm

Thank you! I will try that. For these LLMs are there any recommended base images for GPUs or any recommended settings to pass to "bentoml containerize" for these models in particular?

Jun 27 '23 14:06 kenleejr

bentoml containerize should just works. the purpose of containerize is to create the container for the bento.

In terms of how to run the container, see https://github.com/NVIDIA/nvidia-docker for leveraging GPU in OCI-compatible container (BentoContainer is OCI-compatible)

Jun 27 '23 16:06 aarnphm

Can you try out with 0.1.17?

Jun 27 '23 21:06 aarnphm

It worked with 0.1.15 and trying it now for 0.1.17.

Jun 27 '23 22:06 kenleejr

dolly-v2 worked with 0.1.17, but running into an issue downloading falcon referenced in another issue:

To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
Enabling debug mode for current BentoML session
Make sure to have the following dependencies available: ['einops', 'xformers']
Downloading (…)lve/main/config.json: 100%|██████████| 950/950 [00:00<00:00, 113kB/s]
Downloading (…)/configuration_RW.py: 100%|██████████| 2.61k/2.61k [00:00<00:00, 368kB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-7b:
- configuration_RW.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (…)okenizer_config.json: 100%|██████████| 220/220 [00:00<00:00, 138kB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████| 2.73M/2.73M [00:00<00:00, 36.4MB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 281/281 [00:00<00:00, 193kB/s]
Downloading (…)main/modelling_RW.py: 100%|██████████| 47.6k/47.6k [00:00<00:00, 26.2MB/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 1557, in download_models
    _ref = bentoml.transformers.get(model.tag)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/frameworks/transformers.py", line 292, in get
    model = bentoml.models.get(tag_like)
  File "/usr/local/lib/python3.8/dist-packages/simple_di/__init__.py", line 139, in _
    return func(*_inject_args(bind.args), **_inject_kwargs(bind.kwargs))
  File "/usr/local/lib/python3.8/dist-packages/bentoml/models.py", line 42, in get
    return _model_store.get(tag)
  File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/store.py", line 146, in get
    raise NotFound(
bentoml.exceptions.NotFound: Model 'pt-tiiuae-falcon-7b:2f5c3cd4eace6be6c0f12981f377fb35e5bf6ee5' is not found in BentoML store <osfs '/model_store/models'>

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.8/dist-packages/openllm/__main__.py", line 26, in <module>
    cli()
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 380, in wrapper
    return func(*args, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 353, in wrapper
    return_value = func(*args, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 328, in wrapper
    return f(*args, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 1584, in download_models
    _ref = model.import_model(
  File "/usr/local/lib/python3.8/dist-packages/openllm/models/falcon/modeling_falcon.py", line 64, in import_model
    model = transformers.AutoModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.8/dist-packages/transformers/models/auto/auto_factory.py", line 475, in from_pretrained
    model_class = get_class_from_dynamic_module(
  File "/usr/local/lib/python3.8/dist-packages/transformers/dynamic_module_utils.py", line 431, in get_class_from_dynamic_module
    final_module = get_cached_module_file(
  File "/usr/local/lib/python3.8/dist-packages/transformers/dynamic_module_utils.py", line 268, in get_cached_module_file
    modules_needed = check_imports(resolved_module_file)
  File "/usr/local/lib/python3.8/dist-packages/transformers/dynamic_module_utils.py", line 151, in check_imports
    raise ImportError(
ImportError: This modeling file requires the following packages that were not found in your environment: einops. Run `pip install einops`
Traceback (most recent call last):
  File "/usr/local/bin/openllm", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.8/dist-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 380, in wrapper
    return func(*args, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 353, in wrapper
    return_value = func(*args, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 328, in wrapper
    return f(*args, **attrs)
  File "/usr/local/lib/python3.8/dist-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/openllm/cli.py", line 935, in start_cmd
    llm = t.cast(
  File "/usr/local/lib/python3.8/dist-packages/openllm/models/auto/factory.py", line 135, in for_model
    llm.ensure_model_id_exists()
  File "/usr/local/lib/python3.8/dist-packages/openllm/_llm.py", line 943, in ensure_model_id_exists
    output = subprocess.check_output(
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-m', 'openllm', 'download', 'falcon', '--model-id', 'tiiuae/falcon-7b', '--machine', '--implementation', 'pt']' returned non-zero exit status 1.

Jun 27 '23 23:06 kenleejr

As the exception said, you need to install einops

You need to do pip install "openllm[falcon]"

Jun 28 '23 03:06 aarnphm

Feel free to open a different issue if you are still running into this

Jun 28 '23 03:06 aarnphm

bug: Getting 500 with running dolly-v2

Describe the bug

To reproduce

Update apt-get and install pip

Logs

Environment

System information (Optional)

Update apt-get and install pip