Getting seg faults intermittently prior to streaming generation

Open ekcrisp opened this issue 11 months ago • 0 comments

Physical (or virtual) hardware you are using, e.g. for Linux:

user@dev0:~$ lscpu
Architecture:           aarch64
  CPU op-mode(s):       32-bit, 64-bit
  Byte Order:           Little Endian
CPU(s):                 8
  On-line CPU(s) list:  0-7
Vendor ID:              ARM
  Model name:           -
    Model:              Rockchip RK3588S
    Thread(s) per core: 0
    Core(s) per socket: 0
    Socket(s):          0
    Stepping:           0x2
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp

Operating System, e.g. for Linux:

user@dev0:~$ uname -a
Linux rock-5c 6.1.43-20-rk2312 #3e26818dc SMP Tue Nov 19 07:21:24 UTC 2024 aarch64 GNU/Linux

SDK version, e.g. for Linux:

user@dev0:~$ python3 --version
Python 3.11.2
user@dev0:~$ make --version
GNU Make 4.3
Built for aarch64-unknown-linux-gnu
user@dev0:~$ g++ --version
g++ (Debian 12.2.0-14) 12.2.0

Failure Information (for bugs)

In my app I am streaming user input to the model and using generate with max_tokens=1 to load data into the cache (I am using this instead of llama.eval because inputs can be edited, eval is strictly additive from my understanding). After user finishes input then the actual model inference happens. Interestingly, the seg faults are only happening for the full generation, not the generations called with max_tokens=1 for caching. Below is output where user edited input, then after submitting for full generation a seg fault happens. This is an intermittent issue, it happens every 10-20 generations.

Llama.generate: 6573 prefix-match hit, remaining 15 prompt tokens to eval
llama_perf_context_print:        load time =    2135.45 ms
llama_perf_context_print: prompt eval time =    1698.20 ms /    15 tokens (  113.21 ms per token,     8.83 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    1714.09 ms /    16 tokens
Llama.generate: 6572 prefix-match hit, remaining 22 prompt tokens to eval
Llama.generate: 6572 prefix-match hit, remaining 38 prompt tokens to eval
Segmentation fault

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Instantiate model

        self.llm = Llama(
            model_path=model_path, #Llama 3.2 3B Q4_0
            seed=1,
            n_threads=4,
            n_ctx=10000,
            temperature=1,
            verbose=True
        )

Generate with max_tokens=1 until user is finished inputting

            for output in self.llm( 
                my_context,
                max_tokens=1,
                stream=True, 
                temperature=1,
            ):
                log.debug(f"generated single token to cache input")

Attempt full generation, segmentation fault occurs before output is generated

    def output_generator():
            log.info("Starting output generation : def output_generator()")
            for output in self.llm(
                my_context, 
                seed=random.randint(1, 1000000),
                stream=True, 
                max_tokens=650,
                temperature=1,
                stop=stop_generation_strings,
            ):
                token = output['choices'][0]['text']
                yield token

Any ideas for potential workarounds or solutions are welcome, thanks

Feb 17 '25 10:02 ekcrisp