/ docs-website / versioned_docs / version-2.27 / pipeline-components / generators / llamacppchatgenerator.mdx
llamacppchatgenerator.mdx
1 --- 2 title: "LlamaCppChatGenerator" 3 id: llamacppchatgenerator 4 slug: "/llamacppchatgenerator" 5 description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp." 6 --- 7 8 # LlamaCppChatGenerator 9 10 `LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx) | 17 | **Mandatory init variables** | `model`: The path of the model to use | 18 | **Mandatory run variables** | `messages`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) instances representing the input messages | 19 | **Output variables** | `replies`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) instances with all the replies generated by the LLM | 20 | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp | 22 23 </div> 24 25 ## Overview 26 27 [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs). 28 29 `Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp` by taking the path to the locally saved GGUF file as `model` parameter at initialization. 30 31 ### Tool Support 32 33 `LlamaCppChatGenerator` supports function calling through the `tools` parameter, which accepts flexible tool configurations: 34 35 - **A list of Tool objects**: Pass individual tools as a list 36 - **A single Toolset**: Pass an entire Toolset directly 37 - **Mixed Tools and Toolsets**: Combine multiple Toolsets with standalone tools in a single list 38 39 This allows you to organize related tools into logical groups while also including standalone tools as needed. 40 41 ```python 42 from haystack.tools import Tool, Toolset 43 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 44 45 # Create individual tools 46 weather_tool = Tool(name="weather", description="Get weather info", ...) 47 news_tool = Tool(name="news", description="Get latest news", ...) 48 49 # Group related tools into a toolset 50 math_toolset = Toolset([add_tool, subtract_tool, multiply_tool]) 51 52 # Pass mixed tools and toolsets to the generator 53 generator = LlamaCppChatGenerator( 54 model="/path/to/model.gguf", 55 tools=[math_toolset, weather_tool, news_tool] # Mix of Toolset and Tool objects 56 ) 57 ``` 58 59 For more details on working with tools, see the [Tool](../../tools/tool.mdx) and [Toolset](../../tools/toolset.mdx) documentation. 60 61 ## Installation 62 63 Install the `llama-cpp-haystack` package to use this integration: 64 65 ```shell 66 pip install llama-cpp-haystack 67 ``` 68 69 ### Using a different compute backend 70 71 The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends: 72 73 1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend. 74 2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above. 75 76 For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands: 77 78 ```shell 79 export GGML_CUDA=1 80 CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python 81 pip install llama-cpp-haystack 82 ``` 83 84 ## Usage 85 86 1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). 87 2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters: 88 89 ```python 90 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 91 92 generator = LlamaCppChatGenerator( 93 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 94 n_ctx=512, 95 n_batch=128, 96 model_kwargs={"n_gpu_layers": -1}, 97 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 98 ) 99 generator.warm_up() 100 messages = [ChatMessage.from_user("Who is the best American actor?")] 101 result = generator.run(messages) 102 ``` 103 104 ### Passing additional model parameters 105 106 The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter. 107 108 The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters. 109 110 See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments. 111 112 **Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter. 113 114 For example, to offload the model to GPU during initialization: 115 116 ```python 117 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 118 from haystack.dataclasses import ChatMessage 119 120 generator = LlamaCppChatGenerator( 121 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 122 n_ctx=512, 123 n_batch=128, 124 model_kwargs={"n_gpu_layers": -1}, 125 ) 126 messages = [ChatMessage.from_user("Who is the best American actor?")] 127 result = generator.run(messages, generation_kwargs={"max_tokens": 128}) 128 generated_reply = result["replies"][0].text 129 print(generated_reply) 130 ``` 131 132 ### Passing text generation parameters 133 134 The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference. 135 136 See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments. 137 138 **Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them. 139 140 For example, to set the `max_tokens` and `temperature`: 141 142 ```python 143 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 144 from haystack.dataclasses import ChatMessage 145 146 generator = LlamaCppChatGenerator( 147 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 148 n_ctx=512, 149 n_batch=128, 150 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 151 ) 152 messages = [ChatMessage.from_user("Who is the best American actor?")] 153 result = generator.run(messages) 154 ``` 155 156 ### With multimodal (image + text) inputs 157 158 ```python 159 from haystack.dataclasses import ChatMessage, ImageContent 160 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 161 162 # Initialize with multimodal support 163 llm = LlamaCppChatGenerator( 164 model="llava-v1.5-7b-q4_0.gguf", 165 chat_handler_name="Llava15ChatHandler", # Use llava-1-5 handler 166 model_clip_path="mmproj-model-f16.gguf", # CLIP model 167 n_ctx=4096, # Larger context for image processing 168 ) 169 170 image = ImageContent.from_file_path("apple.jpg") 171 user_message = ChatMessage.from_user( 172 content_parts=["What does the image show? Max 5 words.", image], 173 ) 174 175 response = llm.run([user_message])["replies"][0].text 176 print(response) 177 178 # Red apple on straw. 179 ``` 180 181 The `generation_kwargs` can also be passed to the `run` method of the generator directly: 182 183 ```python 184 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 185 from haystack.dataclasses import ChatMessage 186 187 generator = LlamaCppChatGenerator( 188 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 189 n_ctx=512, 190 n_batch=128, 191 ) 192 messages = [ChatMessage.from_user("Who is the best American actor?")] 193 result = generator.run( 194 messages, 195 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 196 ) 197 ``` 198 199 ### In a pipeline 200 201 We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM. 202 203 Load the dataset: 204 205 ```python 206 ## Install HuggingFace Datasets using "pip install datasets" 207 from datasets import load_dataset 208 from haystack import Document, Pipeline 209 from haystack.components.builders.answer_builder import AnswerBuilder 210 from haystack.components.builders import ChatPromptBuilder 211 from haystack.components.embedders import ( 212 SentenceTransformersDocumentEmbedder, 213 SentenceTransformersTextEmbedder, 214 ) 215 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 216 from haystack.components.writers import DocumentWriter 217 from haystack.document_stores.in_memory import InMemoryDocumentStore 218 from haystack.dataclasses import ChatMessage 219 220 ## Import LlamaCppChatGenerator 221 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 222 223 ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace 224 dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]") 225 226 docs = [ 227 Document( 228 content=doc["text"], 229 meta={ 230 "title": doc["title"], 231 "url": doc["url"], 232 }, 233 ) 234 for doc in dataset 235 ] 236 ``` 237 238 Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`: 239 240 ```python 241 doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine") 242 ## Install sentence transformers using "pip install sentence-transformers" 243 doc_embedder = SentenceTransformersDocumentEmbedder( 244 model="sentence-transformers/all-MiniLM-L6-v2", 245 ) 246 247 ## Indexing Pipeline 248 indexing_pipeline = Pipeline() 249 indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder") 250 indexing_pipeline.add_component( 251 instance=DocumentWriter(document_store=doc_store), 252 name="DocWriter", 253 ) 254 indexing_pipeline.connect("DocEmbedder", "DocWriter") 255 256 indexing_pipeline.run({"DocEmbedder": {"documents": docs}}) 257 ``` 258 259 Create the RAG pipeline and add the `LlamaCppChatGenerator` to it: 260 261 ```python 262 system_message = ChatMessage.from_system( 263 """ 264 Answer the question using the provided context. 265 Context: 266 {% for doc in documents %} 267 {{ doc.content }} 268 {% endfor %} 269 """, 270 ) 271 user_message = ChatMessage.from_user("Question: {{question}}") 272 assistent_message = ChatMessage.from_assistant("Answer: ") 273 274 chat_template = [system_message, user_message, assistent_message] 275 276 rag_pipeline = Pipeline() 277 278 text_embedder = SentenceTransformersTextEmbedder( 279 model="sentence-transformers/all-MiniLM-L6-v2", 280 ) 281 282 ## Load the LLM using LlamaCppChatGenerator 283 model_path = "openchat-3.5-1210.Q3_K_S.gguf" 284 generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128) 285 286 rag_pipeline.add_component( 287 instance=text_embedder, 288 name="text_embedder", 289 ) 290 rag_pipeline.add_component( 291 instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), 292 name="retriever", 293 ) 294 rag_pipeline.add_component( 295 instance=ChatPromptBuilder(template=chat_template), 296 name="prompt_builder", 297 ) 298 rag_pipeline.add_component(instance=generator, name="llm") 299 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 300 301 rag_pipeline.connect("text_embedder", "retriever") 302 rag_pipeline.connect("retriever", "prompt_builder.documents") 303 rag_pipeline.connect("prompt_builder", "llm") 304 rag_pipeline.connect("llm", "answer_builder") 305 rag_pipeline.connect("retriever", "answer_builder.documents") 306 ``` 307 308 Run the pipeline: 309 310 ```python 311 question = "Which year did the Joker movie release?" 312 result = rag_pipeline.run( 313 { 314 "text_embedder": {"text": question}, 315 "prompt_builder": {"question": question}, 316 "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}}, 317 "answer_builder": {"query": question}, 318 }, 319 ) 320 321 generated_answer = result["answer_builder"]["answers"][0] 322 print(generated_answer.data) 323 ## The Joker movie was released on October 4, 2019. 324 ```