FLARE Benchmark in Google Colab: VLLM dependency and testing other models
Dear PIXIU team,
thank you so much for your contribution to the open source community and congratulations for being accepted to the renowned NEURIPS Conference. I am trying to follow proposed steps to run FLARE benchmark of the model. I follow the steps on the Google Colab T4 Instance. Here are the steps:
- !git clone https://github.com/chancefocus/PIXIU.git --recursive
- !pip install -r PIXIU/requirements.txt
- !pip install -e ./PIXIU/src/financial-evaluation[multilingual]
- !sh /content/PIXIU/scripts/run_evaluation.sh
where run_evaluation.sh is:
pixiu_path='/content/PIXIU'
export PYTHONPATH="$pixiu_path/src:$pixiu_path/src/financial-evaluation:$pixiu_path/src/metrics/BARTScore"
echo $PYTHONPATH
export CUDA_VISIBLE_DEVICES="0"
python ./PIXIU/src/eval.py \
--model hf-causal-llama \
--tasks flare_edtsum,flare_ectsum \
--model_args use_accelerate=True,pretrained=chancefocus/finma-7b-full,tokenizer=chancefocus/finma-7b-full,use_fast=False,max_gen_toks=1024,dtype=float16 \
--no_cache \
--batch_size 4 \
--model_prompt 'finma_prompt' \
--num_fewshot 0 \
--write_out
The output is:
/content/PIXIU/src:/content/PIXIU/src/financial-evaluation:/content/PIXIU/src/metrics/BARTScore
2024-01-21 09:50:38.484367: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-21 09:50:38.484426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-21 09:50:38.485793: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-21 09:50:39.709652: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/content/./PIXIU/src/eval.py", line 8, in <module>
import evaluator
File "/content/PIXIU/src/evaluator.py", line 8, in <module>
import lm_eval.models
File "/content/PIXIU/src/financial-evaluation/lm_eval/models/__init__.py", line 4, in <module>
from . import huggingface
File "/content/PIXIU/src/financial-evaluation/lm_eval/models/huggingface.py", line 12, in <module>
from vllm import LLM, SamplingParams
ModuleNotFoundError: No module named 'vllm'
There are several associated questions:
- Versions of packages in PIXIU/requirements.txt are not fixed that will very probably lead to version incompatibilities over time. Moreover, "vllm" is not listed there. IT it bossible to fix the versions there? That would improve reproducibility and readiness for future changes.
- I try to evaluate a simple TinyLLama model that does not require large GPU instance. Even after installing vllm (which also changes some versions of the packages), I get an error for the evaluation:
!pip install vllm
!sh /content/PIXIU/scripts/run_evaluation.sh
!sh /content/PIXIU/scripts/run_evaluation.sh
with run_evaluation.sh:
pixiu_path='/content/PIXIU'
export PYTHONPATH="$pixiu_path/src:$pixiu_path/src/financial-evaluation:$pixiu_path/src/metrics/BARTScore"
echo $PYTHONPATH
export CUDA_VISIBLE_DEVICES="0"
python ./PIXIU/src/eval.py \
--model hf-causal \
--tasks flare_australian \
--model_args pretrained=PY007/TinyLlama-1.1B-Chat-v0.1,dtype="float32" \
--no_cache
results in :
/content/PIXIU/src:/content/PIXIU/src/financial-evaluation:/content/PIXIU/src/metrics/BARTScore
2024-01-21 10:10:46.733456: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-21 10:10:46.733512: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-21 10:10:46.735055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-21 10:10:48.082461: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[dynet] random seed: 1234
[dynet] allocating memory: 32MB
[dynet] memory allocation done.
Selected Tasks: ['flare_australian']
Using device 'cuda'
config.json: 100% 652/652 [00:00<00:00, 2.81MB/s]
model.safetensors: 100% 4.40G/4.40G [00:34<00:00, 126MB/s]
generation_config.json: 100% 63.0/63.0 [00:00<00:00, 316kB/s]
tokenizer_config.json: 100% 762/762 [00:00<00:00, 3.84MB/s]
tokenizer.model: 100% 500k/500k [00:00<00:00, 402MB/s]
tokenizer.json: 100% 1.84M/1.84M [00:00<00:00, 3.73MB/s]
added_tokens.json: 100% 21.0/21.0 [00:00<00:00, 87.8kB/s]
special_tokens_map.json: 100% 438/438 [00:00<00:00, 1.78MB/s]
Downloading readme: 100% 641/641 [00:00<00:00, 4.14MB/s]
Downloading data: 100% 65.3k/65.3k [00:02<00:00, 31.2kB/s]
Downloading data: 100% 25.2k/25.2k [00:01<00:00, 14.0kB/s]
Downloading data: 100% 16.8k/16.8k [00:01<00:00, 10.3kB/s]
Generating train split: 100% 482/482 [00:00<00:00, 3767.86 examples/s]
Generating test split: 100% 139/139 [00:00<00:00, 57843.86 examples/s]
Generating valid split: 100% 69/69 [00:00<00:00, 32771.71 examples/s]
Task: flare_australian; number of docs: 139
Task: flare_australian; document 0; context prompt (starting on next line):
Assess the creditworthiness of a customer using the following table attributes for financial status. Respond with either 'good' or 'bad'. And all the table attribute names including 8 categorical attributes and 6 numerical attributes and values have been changed to meaningless symbols to protect confidentiality of the data. For instance, 'The client has attributes: A1: 0, A2: 21.67, A3: 11.5, A4: 1, A5: 5, A6: 3, A7: 0, A8: 1, A9: 1, A10: 11, A11: 1, A12: 2, A13: 0, A14: 1.', should be classified as 'good'.
Text: The client has attributes: A1: 1.0, A2: 18.67, A3: 5.0, A4: 2.0, A5: 11.0, A6: 4.0, A7: 0.375, A8: 1.0, A9: 1.0, A10: 2.0, A11: 0.0, A12: 2.0, A13: 0.0, A14: 39.0.
(end of prompt on previous line)
Requests: Req_greedy_until("Assess the creditworthiness of a customer using the following table attributes for financial status. Respond with either 'good' or 'bad'. And all the table attribute names including 8 categorical attributes and 6 numerical attributes and values have been changed to meaningless symbols to protect confidentiality of the data. For instance, 'The client has attributes: A1: 0, A2: 21.67, A3: 11.5, A4: 1, A5: 5, A6: 3, A7: 0, A8: 1, A9: 1, A10: 11, A11: 1, A12: 2, A13: 0, A14: 1.', should be classified as 'good'. \n Text: The client has attributes: A1: 1.0, A2: 18.67, A3: 5.0, A4: 2.0, A5: 11.0, A6: 4.0, A7: 0.375, A8: 1.0, A9: 1.0, A10: 2.0, A11: 0.0, A12: 2.0, A13: 0.0, A14: 39.0. \n", {'until': None})[None]
Running greedy_until requests
Maximum 0 turns
Running 0th turn
0% 0/139 [00:00<?, ?it/s]Both `max_new_tokens` (=32) and `max_length`(=575) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
0% 0/139 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/content/./PIXIU/src/eval.py", line 97, in <module>
main()
File "/content/./PIXIU/src/eval.py", line 62, in main
results = evaluator.simple_evaluate(
File "/content/PIXIU/src/financial-evaluation/lm_eval/utils.py", line 243, in _wrapper
return fn(*args, **kwargs)
File "/content/PIXIU/src/evaluator.py", line 102, in simple_evaluate
results = evaluate(
File "/content/PIXIU/src/financial-evaluation/lm_eval/utils.py", line 243, in _wrapper
return fn(*args, **kwargs)
File "/content/PIXIU/src/evaluator.py", line 327, in evaluate
resps = getattr(lm, reqtype)([req.args for req in reqs])
File "/content/PIXIU/src/financial-evaluation/lm_eval/base.py", line 459, in greedy_until
for term in until:
TypeError: 'NoneType' object is not iterable
Could you please help with debugging? Providing a replicable example of evaluation of some other simple model would be helpful.
- Finally, I wonder why the financial evaluation of FLARE is done as a modified fork of https://github.com/EleutherAI/lm-evaluation-harness. It would be helpful to know what caused the fork. Is it imaginable to integrate the FLARE tests in the original evaluation framework to make all tests in one framework?
Thank you in advance for your help!
@ASCRX Could you please take a look at this issue?
Hello paveles:
-
Yes. We indeed use a certain version of vllm, which is vllm 0.2.7. Vllm supports most of the current models. Try the following step in colab enviroment: !pip install bert_score !pip install vllm==0.2.7
-
Please make sure you have downloaded BART checkpoint, and check all required arguments are correctly specified.
-
@jiminHuang can help with this problem.
Please check our latest notebook https://colab.research.google.com/drive/1ogcCmhMc5lPhUamCk6512H3PJwPEaBZN?usp=sharing. All issues should be addressed.