PIXIU FLARE Benchmark in Google Colab: VLLM dependency and testing other models

Dear PIXIU team,

thank you so much for your contribution to the open source community and congratulations for being accepted to the renowned NEURIPS Conference. I am trying to follow proposed steps to run FLARE benchmark of the model. I follow the steps on the Google Colab T4 Instance. Here are the steps:

!git clone https://github.com/chancefocus/PIXIU.git --recursive
!pip install -r PIXIU/requirements.txt
!pip install -e ./PIXIU/src/financial-evaluation[multilingual]
!sh /content/PIXIU/scripts/run_evaluation.sh

where run_evaluation.sh is:

pixiu_path='/content/PIXIU'
export PYTHONPATH="$pixiu_path/src:$pixiu_path/src/financial-evaluation:$pixiu_path/src/metrics/BARTScore"
echo $PYTHONPATH
export CUDA_VISIBLE_DEVICES="0"

python ./PIXIU/src/eval.py \
    --model hf-causal-llama \
    --tasks flare_edtsum,flare_ectsum \
    --model_args use_accelerate=True,pretrained=chancefocus/finma-7b-full,tokenizer=chancefocus/finma-7b-full,use_fast=False,max_gen_toks=1024,dtype=float16 \
    --no_cache \
    --batch_size 4 \
    --model_prompt 'finma_prompt' \
    --num_fewshot 0 \
    --write_out

The output is:

/content/PIXIU/src:/content/PIXIU/src/financial-evaluation:/content/PIXIU/src/metrics/BARTScore
2024-01-21 09:50:38.484367: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-21 09:50:38.484426: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-21 09:50:38.485793: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-21 09:50:39.709652: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
  File "/content/./PIXIU/src/eval.py", line 8, in <module>
    import evaluator
  File "/content/PIXIU/src/evaluator.py", line 8, in <module>
    import lm_eval.models
  File "/content/PIXIU/src/financial-evaluation/lm_eval/models/__init__.py", line 4, in <module>
    from . import huggingface
  File "/content/PIXIU/src/financial-evaluation/lm_eval/models/huggingface.py", line 12, in <module>
    from vllm import LLM, SamplingParams
ModuleNotFoundError: No module named 'vllm'

There are several associated questions:

Versions of packages in PIXIU/requirements.txt are not fixed that will very probably lead to version incompatibilities over time. Moreover, "vllm" is not listed there. IT it bossible to fix the versions there? That would improve reproducibility and readiness for future changes.
I try to evaluate a simple TinyLLama model that does not require large GPU instance. Even after installing vllm (which also changes some versions of the packages), I get an error for the evaluation:

!pip install vllm
!sh /content/PIXIU/scripts/run_evaluation.sh
!sh /content/PIXIU/scripts/run_evaluation.sh

with run_evaluation.sh:

pixiu_path='/content/PIXIU'
export PYTHONPATH="$pixiu_path/src:$pixiu_path/src/financial-evaluation:$pixiu_path/src/metrics/BARTScore"
echo $PYTHONPATH
export CUDA_VISIBLE_DEVICES="0"

python ./PIXIU/src/eval.py \
    --model hf-causal \
    --tasks flare_australian \
    --model_args pretrained=PY007/TinyLlama-1.1B-Chat-v0.1,dtype="float32" \
    --no_cache

results in :

/content/PIXIU/src:/content/PIXIU/src/financial-evaluation:/content/PIXIU/src/metrics/BARTScore
2024-01-21 10:10:46.733456: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-21 10:10:46.733512: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-21 10:10:46.735055: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-21 10:10:48.082461: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[dynet] random seed: 1234
[dynet] allocating memory: 32MB
[dynet] memory allocation done.
Selected Tasks: ['flare_australian']
Using device 'cuda'
config.json: 100% 652/652 [00:00<00:00, 2.81MB/s]
model.safetensors: 100% 4.40G/4.40G [00:34<00:00, 126MB/s]
generation_config.json: 100% 63.0/63.0 [00:00<00:00, 316kB/s]
tokenizer_config.json: 100% 762/762 [00:00<00:00, 3.84MB/s]
tokenizer.model: 100% 500k/500k [00:00<00:00, 402MB/s]
tokenizer.json: 100% 1.84M/1.84M [00:00<00:00, 3.73MB/s]
added_tokens.json: 100% 21.0/21.0 [00:00<00:00, 87.8kB/s]
special_tokens_map.json: 100% 438/438 [00:00<00:00, 1.78MB/s]
Downloading readme: 100% 641/641 [00:00<00:00, 4.14MB/s]
Downloading data: 100% 65.3k/65.3k [00:02<00:00, 31.2kB/s]
Downloading data: 100% 25.2k/25.2k [00:01<00:00, 14.0kB/s]
Downloading data: 100% 16.8k/16.8k [00:01<00:00, 10.3kB/s]
Generating train split: 100% 482/482 [00:00<00:00, 3767.86 examples/s]
Generating test split: 100% 139/139 [00:00<00:00, 57843.86 examples/s]
Generating valid split: 100% 69/69 [00:00<00:00, 32771.71 examples/s]
Task: flare_australian; number of docs: 139
Task: flare_australian; document 0; context prompt (starting on next line):
Assess the creditworthiness of a customer using the following table attributes for financial status. Respond with either 'good' or 'bad'. And all the table attribute names including 8 categorical attributes and 6 numerical attributes and values have been changed to meaningless symbols to protect confidentiality of the data. For instance, 'The client has attributes: A1: 0, A2: 21.67, A3: 11.5, A4: 1, A5: 5, A6: 3, A7: 0, A8: 1, A9: 1, A10: 11, A11: 1, A12: 2, A13: 0, A14: 1.', should be classified as 'good'. 
 Text: The client has attributes: A1: 1.0, A2: 18.67, A3: 5.0, A4: 2.0, A5: 11.0, A6: 4.0, A7: 0.375, A8: 1.0, A9: 1.0, A10: 2.0, A11: 0.0, A12: 2.0, A13: 0.0, A14: 39.0. 

(end of prompt on previous line)
Requests: Req_greedy_until("Assess the creditworthiness of a customer using the following table attributes for financial status. Respond with either 'good' or 'bad'. And all the table attribute names including 8 categorical attributes and 6 numerical attributes and values have been changed to meaningless symbols to protect confidentiality of the data. For instance, 'The client has attributes: A1: 0, A2: 21.67, A3: 11.5, A4: 1, A5: 5, A6: 3, A7: 0, A8: 1, A9: 1, A10: 11, A11: 1, A12: 2, A13: 0, A14: 1.', should be classified as 'good'. \n Text: The client has attributes: A1: 1.0, A2: 18.67, A3: 5.0, A4: 2.0, A5: 11.0, A6: 4.0, A7: 0.375, A8: 1.0, A9: 1.0, A10: 2.0, A11: 0.0, A12: 2.0, A13: 0.0, A14: 39.0. \n", {'until': None})[None]

Running greedy_until requests
Maximum 0 turns
Running 0th turn
  0% 0/139 [00:00<?, ?it/s]Both `max_new_tokens` (=32) and `max_length`(=575) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
  0% 0/139 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/content/./PIXIU/src/eval.py", line 97, in <module>
    main()
  File "/content/./PIXIU/src/eval.py", line 62, in main
    results = evaluator.simple_evaluate(
  File "/content/PIXIU/src/financial-evaluation/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/content/PIXIU/src/evaluator.py", line 102, in simple_evaluate
    results = evaluate(
  File "/content/PIXIU/src/financial-evaluation/lm_eval/utils.py", line 243, in _wrapper
    return fn(*args, **kwargs)
  File "/content/PIXIU/src/evaluator.py", line 327, in evaluate
    resps = getattr(lm, reqtype)([req.args for req in reqs])
  File "/content/PIXIU/src/financial-evaluation/lm_eval/base.py", line 459, in greedy_until
    for term in until:
TypeError: 'NoneType' object is not iterable

Could you please help with debugging? Providing a replicable example of evaluation of some other simple model would be helpful.

Finally, I wonder why the financial evaluation of FLARE is done as a modified fork of https://github.com/EleutherAI/lm-evaluation-harness. It would be helpful to know what caused the fork. Is it imaginable to integrate the FLARE tests in the original evaluation framework to make all tests in one framework?

Thank you in advance for your help!

Jan 21 '24 10:01 paveles

@ASCRX Could you please take a look at this issue?

Feb 22 '24 18:02 jiminHuang

Hello paveles:

Yes. We indeed use a certain version of vllm, which is vllm 0.2.7. Vllm supports most of the current models. Try the following step in colab enviroment: !pip install bert_score !pip install vllm==0.2.7
Please make sure you have downloaded BART checkpoint, and check all required arguments are correctly specified.
@jiminHuang can help with this problem.

Feb 22 '24 23:02 ASCRX

Please check our latest notebook https://colab.research.google.com/drive/1ogcCmhMc5lPhUamCk6512H3PJwPEaBZN?usp=sharing. All issues should be addressed.

Jun 17 '24 10:06 jiminHuang