torchchat Eval script fails on CPU on model generated by ExecuTorch

🐛 Describe the bug

I am using ET and generating the quantized version of the model as shown in the README.

python torchchat.py export llama3.1 --quantize config/data/mobile.json --output-pte-path llama3.1.pte

Then we when I tried to evaluate the model using the python runtime on Desktop , it fails

python torchchat.py eval llama3.1 --pte-path llama3.1.pte --limit 5
NumExpr defaulting to 16 threads.
PyTorch version 2.5.0.dev20240716+cpu available.
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cpu
Loading model...
Time to load model: 0.05 seconds
Loading custom ops library: /home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/executorch/examples/models/llama2/custom_ops/libcustom_ops_aot_lib.so
I 00:00:00.004209 executorch:program.cpp:133] InternalConsistency verification requested but not available
-----------------------------------------------------------
Using device 'cpu'
/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/huggingface_hub/file_download.py:1150: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
[Task: wikitext] metric word_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
[Task: wikitext] metric word_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
[Task: wikitext] metric byte_perplexity is defined, but aggregation is not. using default aggregation=weighted_perplexity
[Task: wikitext] metric byte_perplexity is defined, but higher_is_better is not. using default higher_is_better=False
[Task: wikitext] metric bits_per_byte is defined, but aggregation is not. using default aggregation=bits_per_byte
[Task: wikitext] metric bits_per_byte is defined, but higher_is_better is not. using default higher_is_better=False
Downloading builder script: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.7k/10.7k [00:00<00:00, 46.8MB/s]
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.78k/7.78k [00:00<00:00, 39.3MB/s]
Repo card metadata block was not found. Setting CardData to empty.
Repo card metadata block was not found. Setting CardData to empty.
Downloading data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.72M/4.72M [00:00<00:00, 64.4MB/s]
Generating test split: 62 examples [00:00, 1903.53 examples/s]
Generating train split: 629 examples [00:00, 5131.04 examples/s]
Generating validation split: 60 examples [00:00, 7172.82 examples/s]
Building contexts for wikitext on rank 0...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 796.73it/s]
Running loglikelihood_rolling requests
  0%|                                                                                                                                                               | 0/5 [00:00<?, ?it/s]E 00:00:31.679303 executorch:tensor_impl.cpp:86] Attempted to resize a static tensor to a new shape at dimension 1 old_size: 1 new_size: 1263
E 00:00:31.679320 executorch:method.cpp:824] Error setting input 0: 0x10
  0%|                                                                                                                                                               | 0/5 [00:00<?, ?it/s]
Time to run eval: 6.75s.
Traceback (most recent call last):
  File "/home/ubuntu/torchchat/torchchat.py", line 92, in <module>
    eval_main(args)
  File "/home/ubuntu/torchchat/eval.py", line 252, in main
    result = eval(
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/torchchat/eval.py", line 198, in eval
    eval_results = evaluate(
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/lm_eval/utils.py", line 288, in _wrapper
    return fn(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/lm_eval/evaluator.py", line 373, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/lm_eval/models/huggingface.py", line 840, in loglikelihood_rolling
    string_nll = self._loglikelihood_tokens(
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/lm_eval/models/huggingface.py", line 1033, in _loglikelihood_tokens
    self._model_call(batched_inps, **call_kwargs), dim=-1
  File "/home/ubuntu/torchchat/eval.py", line 146, in _model_call
    logits = self._model_forward(x, input_pos)
  File "/home/ubuntu/torchchat/eval.py", line 240, in <lambda>
    model_forward = lambda x, input_pos: model(x, input_pos)  # noqa
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1716, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1727, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/torchchat/build/model_et.py", line 23, in forward
    logits = self.model_.forward(forward_inputs)
RuntimeError: method->set_inputs() for method 'forward' failed with error 0x12

Versions

Collecting environment information...
PyTorch version: 2.5.0.dev20240716+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.30.2
Libc version: glibc-2.35

Python version: 3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.5.0-1014-aws-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      46 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             16
On-line CPU(s) list:                0-15
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
Stepping:                           6
BogoMIPS:                           5799.93
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd ida arat avx512vbmi pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear flush_l1d arch_capabilities
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          384 KiB (8 instances)
L1i cache:                          256 KiB (8 instances)
L2 cache:                           10 MiB (8 instances)
L3 cache:                           54 MiB (1 instance)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] executorch==0.4.0a0+c757499
[pip3] numpy==1.26.4
[pip3] torch==2.5.0.dev20240716+cpu
[pip3] torchao==0.3.1
[pip3] torchaudio==2.4.0.dev20240716+cpu
[pip3] torchsr==1.0.4
[pip3] torchvision==0.20.0.dev20240716+cpu
[conda] executorch                0.4.0a0+c757499          pypi_0    pypi
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.5.0.dev20240716+cpu          pypi_0    pypi
[conda] torchao                   0.3.1                    pypi_0    pypi
[conda] torchaudio                2.4.0.dev20240716+cpu          pypi_0    pypi
[conda] torchsr                   1.0.4                    pypi_0    pypi
[conda] torchvision               0.20.0.dev20240716+cpu          pypi_0    pypi
(torchchat) ubuntu@ip-172-31-7-68:~/torchchat$

Aug 08 '24 02:08 agunapal

Thanks for flagging @agunapal

This is a known issue that we'll be looking at https://github.com/pytorch/torchchat/issues/938 Since eval is using the same concept as generate for evaluation, most likely there's a wiring bug

Aug 08 '24 15:08 Jack-Khuu

Interestingly, generate works for me

 python torchchat.py generate llama3.1 --device cpu --pte-path llama3.1.pte --prompt "Hello my name is"
NumExpr defaulting to 16 threads.
PyTorch version 2.5.0.dev20240716+cpu available.
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Warning: checkpoint path ignored because an exported DSO or PTE path specified
Using device=cpu Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Loading model...
Time to load model: 0.05 seconds
Loading custom ops library: /home/ubuntu/anaconda3/envs/torchchat/lib/python3.10/site-packages/executorch/examples/models/llama2/custom_ops/libcustom_ops_aot_lib.so
I 00:00:00.006447 executorch:program.cpp:133] InternalConsistency verification requested but not available
-----------------------------------------------------------
Hello my name is Mark, I am a retired U.S. Marine, live in Ocean City, Maryland. I am a Licensed Real Estate Salesperson. My specialty is assisting buyers, sellers, and investors of residential, commercial and waterfront properties. My goal is to provide you with exceptional service, expertise, and understanding of local and state real estate market trends.
The beauty of the Delaware Coast, Ocean City, Maryland and surrounding areas offer a variety of properties and opportunities for investors, buyers, and real estate professionals. If you are searching for your dream home, wish to invest in a rental property, or seek assistance with buying, selling, or renting a property, I am here to help.
My commitment is to treat each client with professionalism, honesty, integrity, and respect. I promise to be available at your convenience, work diligently on your behalf, and continually inform you regarding market conditions, and information that may be beneficial in making informed decisions.
My business hours are Monday through Friday, from 8:00
Time for inference 1: 118.21 sec total, time to first token 2.92 sec with sequential prefill, 199 tokens, 1.68 tokens/sec, 594.00 ms/token
Bandwidth achieved: 0.00 GB/s
*** This first iteration will include cold start effects for dynamic import, hardware caches. ***

========================================

Average tokens/sec: 1.68
Memory used: 0.00 GB

Aug 08 '24 17:08 agunapal

Thanks again for flagging. We have @vmpuri looking at this

Tracking in umbrella issue

Aug 19 '24 18:08 Jack-Khuu