llama-stack Context retrieval only works for first user message

llama-stack install from source:https://github.com/meta-llama/llama-stack/tree/cherrypick-working

System Info

python -m "torch.utils.collect_env" /home/kaiwu/miniconda3/envs/llama/lib/python3.10/runpy.py:126: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: CentOS Stream 9 (x86_64) GCC version: (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2) Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.34

Python version: 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-6.4.3-0_fbk14_zion_2601_gcd42476b84e9-x86_64-with-glibc2.34 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA H100 GPU 1: NVIDIA H100 GPU 2: NVIDIA H100 GPU 3: NVIDIA H100 GPU 4: NVIDIA H100 GPU 5: NVIDIA H100 GPU 6: NVIDIA H100 GPU 7: NVIDIA H100

Nvidia driver version: 535.154.05 cuDNN version: Probably one of the following: /usr/lib64/libcudnn.so.8.9.2 /usr/lib64/libcudnn_adv_infer.so.8.9.2 /usr/lib64/libcudnn_adv_train.so.8.9.2 /usr/lib64/libcudnn_cnn_infer.so.8.9.2 /usr/lib64/libcudnn_cnn_train.so.8.9.2 /usr/lib64/libcudnn_ops_infer.so.8.9.2 /usr/lib64/libcudnn_ops_train.so.8.9.2 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 384 On-line CPU(s) list: 0-383 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU(s) scaling MHz: 82% CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4792.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 6 MiB (192 instances) L1i cache: 6 MiB (192 instances) L2 cache: 192 MiB (192 instances) L3 cache: 768 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-95,192-287 NUMA node1 CPU(s): 96-191,288-383 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Vulnerable: eIBRS with unprivileged eBPF Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.26.4 [pip3] onnx==1.16.2 [pip3] onnxruntime==1.19.2 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi

Information

[ ] The official example scripts
[X] My own modified scripts

🐛 Describe the bug

There is a llama3.1 model card and llama3.2 model card in the database, and I tried to ask

user_prompts = [
        "What is the name of the llama model released on October 24, 2024?",
        "What about Llama 3.1 model, what is the release date for it?",
    ]

The RAG only retrieve the context from llama3.2 model card for first message but did not do retrieval for the second message, the context is still llama3.2 model card from first message. It will be great if we can have Context Retrieval for every User message.

My code is here and use python rag_main.py localhost 5000 ./example_data/ to start this example

Error logs

Inserted 3 documents into bank: rag_agent_docs Created bank: rag_agent_docs Found 2 models [ModelDefWithProvider(identifier='Llama3.2-11B-Vision-Instruct', llama_model='Llama3.2-11B-Vision-Instruct', metadata={}, provider_id='meta-reference', type='model'), ModelDefWithProvider(identifier='Llama-Guard-3-1B', llama_model='Llama-Guard-3-1B', metadata={}, provider_id='meta1', type='model')] Use model: Llama3.2-11B-Vision-Instruct Generating response for: What is the name of the llama model released on October 24, 2024? messages [{'role': 'user', 'content': 'What is the name of the llama model released on October 24, 2024?'}] ----input_query------- What is the name of the llama model released on October 24, 2024? Turn(input_messages=[UserMessage(content='What is the name of the llama model released on October 24, 2024?', role='user', context="Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family: Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\n\n=== END-RETRIEVED-CONTEXT ===\n")], output_attachments=[], output_message=CompletionMessage(content='The name of the llama model released on October 24, 2024, is not explicitly mentioned in the provided documents. However, the document mentions that the model is "Llama 3.2", but it does not indicate if "Llama 3.2" is the name of the specific model released on October 24, 2024, or if it is a version or variant of the model.\n\nIt does mention the Model Release Date as Oct 24, 2024, but this refers to the release of Llama 3.2, not the name of the specific model.\n\nTo answer your question accurately, I don't know the name of the llama model released on October 24, 2024, as this information is not explicitly mentioned in the provided documents.', role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='de83a6c2-5643-42b0-9c89-01640439b524', started_at=datetime.datetime(2024, 11, 13, 9, 48, 44, 297982), steps=[MemoryRetrievalStep(inserted_context=['Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n', "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use", "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use", '\n=== END-RETRIEVED-CONTEXT ===\n'], memory_bank_ids=['rag_agent_docs'], step_id='d916a947-4dee-42e2-ac1a-410d54c7da3d', step_type='memory_retrieval', turn_id='4efeaab0-d7f1-495f-b653-3fd173a59db3', completed_at=None, started_at=None), InferenceStep(inference_model_response=CompletionMessage(content='The name of the llama model released on October 24, 2024, is not explicitly mentioned in the provided documents. However, the document mentions that the model is "Llama 3.2", but it does not indicate if "Llama 3.2" is the name of the specific model released on October 24, 2024, or if it is a version or variant of the model.\n\nIt does mention the Model Release Date as Oct 24, 2024, but this refers to the release of Llama 3.2, not the name of the specific model.\n\nTo answer your question accurately, I don't know the name of the llama model released on October 24, 2024, as this information is not explicitly mentioned in the provided documents.', role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='603d12ab-f127-46de-9ccb-4e07bdccc7e3', step_type='inference', turn_id='4efeaab0-d7f1-495f-b653-3fd173a59db3', completed_at=None, started_at=None)], turn_id='4efeaab0-d7f1-495f-b653-3fd173a59db3', completed_at=datetime.datetime(2024, 11, 13, 9, 48, 50, 996089)) Generating response for: What about Llama 3.1 model, what is the release date for it? messages [{'role': 'user', 'content': 'What about Llama 3.1 model, what is the release date for it?'}] ----input_query------- What about Llama 3.1 model, what is the release date for it? Turn(input_messages=[UserMessage(content='What about Llama 3.1 model, what is the release date for it?', role='user', context="Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\nid:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use\n\n=== END-RETRIEVED-CONTEXT ===\n")], output_attachments=[], output_message=CompletionMessage(content="The release date for Llama 3.1 model is not mentioned in the provided documents. However, there is information about Llama 3.2 model's release date, which is October 24, 2024.\n\nIt appears that there is no information about the Llama 3.1 model in the provided documents.", role='assistant', stop_reason='end_of_turn', tool_calls=[]), session_id='de83a6c2-5643-42b0-9c89-01640439b524', started_at=datetime.datetime(2024, 11, 13, 9, 48, 51, 113170), steps=[MemoryRetrievalStep(inserted_context=['Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n', "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\nTraining Energy Use", "id:llama_3.2.md; content:. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\nLlama 3.2 Model Family:** Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\nModel Release Date: Oct 24, 2024\n\nStatus: This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety.\n\nLicense: Use of Llama 3.2 is governed by the Llama 3.2 Community License (a custom, commercial license agreement).\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models README. For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go here.\n\n## Intended Use\n\nIntended Use Cases: Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pretrained models can be adapted for a variety of additional natural language generation tasks. Similarly, quantized models can be adapted for a variety of on-device use-cases with limited compute resources.\n\nOut of Scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Hardware and Software\n\nTraining Factors: We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pretraining. Fine-tuning, quantization, annotation, and evaluation were also performed on production infrastructure.\n\n**Training Energy Use", '\n=== END-RETRIEVED-CONTEXT ===\n'], memory_bank_ids=['rag_agent_docs'], step_id='e41a178b-182c-444c-8cb6-544979d75a17', step_type='memory_retrieval', turn_id='5b91a548-219f-4805-833f-5535b84abe29', completed_at=None, started_at=None), InferenceStep(inference_model_response=CompletionMessage(content="The release date for Llama 3.1 model is not mentioned in the provided documents. However, there is information about Llama 3.2 model's release date, which is October 24, 2024.\n\nIt appears that there is no information about the Llama 3.1 model in the provided documents.", role='assistant', stop_reason='end_of_turn', tool_calls=[]), step_id='dc72b93c-8f17-44e4-b50f-5f272b11327a', step_type='inference', turn_id='5b91a548-219f-4805-833f-5535b84abe29', completed_at=None, started_at=None)], turn_id='5b91a548-219f-4805-833f-5535b84abe29', completed_at=datetime.datetime(2024, 11, 13, 9, 48, 54, 441075)) The name of the llama model released on October 24, 2024, is not explicitly mentioned in the provided documents. However, the document mentions that the model is "Llama 3.2", but it does not indicate if "Llama 3.2" is the name of the specific model released on October 24, 2024, or if it is a version or variant of the model.

It does mention the Model Release Date as Oct 24, 2024, but this refers to the release of Llama 3.2, not the name of the specific model.

To answer your question accurately, I don't know the name of the llama model released on October 24, 2024, as this information is not explicitly mentioned in the provided documents. The release date for Llama 3.1 model is not mentioned in the provided documents. However, there is information about Llama 3.2 model's release date, which is October 24, 2024.

It appears that there is no information about the Llama 3.1 model in the provided documents.

Expected behavior

It will be great if we can have Context Retrieval for every User message.

Nov 13 '24 17:11 wukaixingxp

@dineshyv this is the RAG issue @init27 was mentioning earlier

Nov 19 '24 01:11 ashwinb

@wukaixingxp, @ashwinb I've just had a look at this.

tl;dr - iiuc, the problem isn't that context retrieval only works for the first user message - but that search results are poor

I've done a bit of testing and the RAG query that is generated actually joins together all the messages:

In your case for messages:

user_prompts = [
    "What is the name of the llama model released on October 24, 2024?",
    "What about Llama 3.1 model, what is the release date for it?",
]

it generates:

query: You are a helpful assistant that can answer questions based on provided documents. Return your answer short and concise, less than 50 words. What is the name of the llama model released on October 24, 2024? What about Llama 3.1 model, what is the release date for it?

(I added this print statement. My llama-stack-apps code is here)

In some cases, your example works for me:

query:  You are a helpful assistant that can answer questions based on provided documents. Return your answer short and concise, less than 50 words. What is the name of the llama model released on October 24, 2024? What about Llama 3.1 model, what is the release date for it?
Batches: 100% 1/1 [00:00<00:00, 180.31it/s]
05:19:03.638 [ERROR] [/alpha/agents/turn/create.retrieve_rag_context] Using 3 chunks; reached max tokens in context: 400
05:19:03.649 [INFO] [/alpha/agents/turn/create] role='user' content='What about Llama 3.1 model, what is the release date for it?' context='Here are the retrieved documents for relevant context:\n=== START-RETRIEVED-CONTEXT ===\n\nid:llama_3.1.md; content:_1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)\n\nFeedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technica...<more>...id:llama_3.1.md; content: for improved inference scalability.\n\nModel Release Date: July 23, 2024.\n\nStatus: This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.\n\nLicense: A custom commercial license, the Llama 3.1 Community License, is available at: [https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE](https://github.com/meta-\n\n=== END-RETRIEVED-CONTEXT ===\n'
05:19:05.048 [INFO] [/alpha/agents/turn/create] Assistant: According to the documents, Llama 3.1 model was released on July 23, 2024.

(branched off of your branch here. Print statement in llama-stack here)

iiuc, the problem here is that the search results are inconsistent or a bit poor.

Testing Search Results

I've ran some of my own queries against the faiss index and they're a bit inconsistent:

Query: "Llama 3.2 3B Instruct"

Top 2 results are:

Index 8:
Content: _1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go [here](https://

Index 97:
Content: .com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https

Query: "What are some small Llama models I can run on small devices like my phone?"

Index 175:
Content:  the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use L

Index 8:
Content: _1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.1 in applications, please go [here](https://

Index 97:
Content: .com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).

Feedback: Instructions on how to provide feedback or comments on the model can be found in the Llama Models [README](https://github.com/meta-llama/llama-models/blob/main/README.md). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https

Index 152:
Content: B and 3B models are expected to be deployed in highly constrained environments, such as mobile devices. LLM Systems using smaller models will have a different alignment profile and safety/helpfulness tradeoff than more complex, larger systems. Developers should ensure the safety of their system meets the requirements of their use case. We recommend using lighter system safeguards for such use cases, like Llama Guard 3-1B or its mobile-optimized version.

(Last result is relevant but the first 3 aren't that useful)

source

Next Steps

If I have a bit of time I might see how we could improve them. Maybe adding keyword search [1], trying different/bigger embedding models [2], different chunking schemes [3] might help here?

Nov 30 '24 05:11 aidando73

@aidando73 Thanks for the detailed feedback! Yes, it is a general issue with the gap in how we perform retrieval. We are working on making improvements to RAG performance and addressing gaps in indexing / retrieval, and will be open about our plans in the upcoming weeks. Let's sync up on Discord on how we can collaborate.

Jan 10 '25 19:01 yanxi0830

Has there been any progress on this issue? I see it has been open for a while now. I would like to better understand the status and the plans in this area if possible.

Feb 07 '25 18:02 jwm4

Now this has been fixed in this PR

Mar 11 '25 18:03 wukaixingxp