/ docs-website / versioned_docs / version-2.21 / pipeline-components / generators / llamacppchatgenerator.mdx
llamacppchatgenerator.mdx
1 --- 2 title: "LlamaCppChatGenerator" 3 id: llamacppchatgenerator 4 slug: "/llamacppchatgenerator" 5 description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp." 6 --- 7 8 # LlamaCppChatGenerator 9 10 `LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx) | 17 | **Mandatory init variables** | `model`: The path of the model to use | 18 | **Mandatory run variables** | `messages`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) instances representing the input messages | 19 | **Output variables** | `replies`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) instances with all the replies generated by the LLM | 20 | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp | 22 23 </div> 24 25 ## Overview 26 27 [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs). 28 29 `Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp` by taking the path to the locally saved GGUF file as `model` parameter at initialization. 30 31 ### Tool Support 32 33 `LlamaCppChatGenerator` supports function calling through the `tools` parameter, which accepts flexible tool configurations: 34 35 - **A list of Tool objects**: Pass individual tools as a list 36 - **A single Toolset**: Pass an entire Toolset directly 37 - **Mixed Tools and Toolsets**: Combine multiple Toolsets with standalone tools in a single list 38 39 This allows you to organize related tools into logical groups while also including standalone tools as needed. 40 41 ```python 42 from haystack.tools import Tool, Toolset 43 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 44 45 # Create individual tools 46 weather_tool = Tool(name="weather", description="Get weather info", ...) 47 news_tool = Tool(name="news", description="Get latest news", ...) 48 49 # Group related tools into a toolset 50 math_toolset = Toolset([add_tool, subtract_tool, multiply_tool]) 51 52 # Pass mixed tools and toolsets to the generator 53 generator = LlamaCppChatGenerator( 54 model="/path/to/model.gguf", 55 tools=[math_toolset, weather_tool, news_tool] # Mix of Toolset and Tool objects 56 ) 57 ``` 58 59 For more details on working with tools, see the [Tool](../../tools/tool.mdx) and [Toolset](../../tools/toolset.mdx) documentation. 60 61 ## Installation 62 63 Install the `llama-cpp-haystack` package to use this integration: 64 65 ```shell 66 pip install llama-cpp-haystack 67 ``` 68 69 ### Using a different compute backend 70 71 The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends: 72 73 1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend. 74 2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above. 75 76 For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands: 77 78 ```shell 79 export GGML_CUDA=1 80 CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python 81 pip install llama-cpp-haystack 82 ``` 83 84 ## Usage 85 86 1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). 87 2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters: 88 89 ```python 90 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 91 92 generator = LlamaCppChatGenerator( 93 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 94 n_ctx=512, 95 n_batch=128, 96 model_kwargs={"n_gpu_layers": -1}, 97 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 98 ) 99 generator.warm_up() 100 messages = [ChatMessage.from_user("Who is the best American actor?")] 101 result = generator.run(messages) 102 ``` 103 104 ### Passing additional model parameters 105 106 The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter. 107 108 The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters. 109 110 See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments. 111 112 **Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter. 113 114 For example, to offload the model to GPU during initialization: 115 116 ```python 117 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 118 from haystack.dataclasses import ChatMessage 119 120 generator = LlamaCppChatGenerator( 121 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 122 n_ctx=512, 123 n_batch=128, 124 model_kwargs={"n_gpu_layers": -1}, 125 ) 126 generator.warm_up() 127 messages = [ChatMessage.from_user("Who is the best American actor?")] 128 result = generator.run(messages, generation_kwargs={"max_tokens": 128}) 129 generated_reply = result["replies"][0].content 130 print(generated_reply) 131 ``` 132 133 ### Passing text generation parameters 134 135 The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference. 136 137 See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments. 138 139 **Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them. 140 141 For example, to set the `max_tokens` and `temperature`: 142 143 ```python 144 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 145 from haystack.dataclasses import ChatMessage 146 147 generator = LlamaCppChatGenerator( 148 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 149 n_ctx=512, 150 n_batch=128, 151 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 152 ) 153 generator.warm_up() 154 messages = [ChatMessage.from_user("Who is the best American actor?")] 155 result = generator.run(messages) 156 ``` 157 158 ### With multimodal (image + text) inputs 159 160 ```python 161 from haystack.dataclasses import ChatMessage, ImageContent 162 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 163 164 # Initialize with multimodal support 165 llm = LlamaCppChatGenerator( 166 model="llava-v1.5-7b-q4_0.gguf", 167 chat_handler_name="Llava15ChatHandler", # Use llava-1-5 handler 168 model_clip_path="mmproj-model-f16.gguf", # CLIP model 169 n_ctx=4096, # Larger context for image processing 170 ) 171 llm.warm_up() 172 173 image = ImageContent.from_file_path("apple.jpg") 174 user_message = ChatMessage.from_user( 175 content_parts=["What does the image show? Max 5 words.", image], 176 ) 177 178 response = llm.run([user_message])["replies"][0].text 179 print(response) 180 181 # Red apple on straw. 182 ``` 183 184 The `generation_kwargs` can also be passed to the `run` method of the generator directly: 185 186 ```python 187 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 188 from haystack.dataclasses import ChatMessage 189 190 generator = LlamaCppChatGenerator( 191 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 192 n_ctx=512, 193 n_batch=128, 194 ) 195 generator.warm_up() 196 messages = [ChatMessage.from_user("Who is the best American actor?")] 197 result = generator.run( 198 messages, 199 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 200 ) 201 ``` 202 203 ### In a pipeline 204 205 We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM. 206 207 Load the dataset: 208 209 ```python 210 ## Install HuggingFace Datasets using "pip install datasets" 211 from datasets import load_dataset 212 from haystack import Document, Pipeline 213 from haystack.components.builders.answer_builder import AnswerBuilder 214 from haystack.components.builders import ChatPromptBuilder 215 from haystack.components.embedders import ( 216 SentenceTransformersDocumentEmbedder, 217 SentenceTransformersTextEmbedder, 218 ) 219 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 220 from haystack.components.writers import DocumentWriter 221 from haystack.document_stores.in_memory import InMemoryDocumentStore 222 from haystack.dataclasses import ChatMessage 223 224 ## Import LlamaCppChatGenerator 225 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 226 227 ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace 228 dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]") 229 230 docs = [ 231 Document( 232 content=doc["text"], 233 meta={ 234 "title": doc["title"], 235 "url": doc["url"], 236 }, 237 ) 238 for doc in dataset 239 ] 240 ``` 241 242 Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`: 243 244 ```python 245 doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine") 246 ## Install sentence transformers using "pip install sentence-transformers" 247 doc_embedder = SentenceTransformersDocumentEmbedder( 248 model="sentence-transformers/all-MiniLM-L6-v2", 249 ) 250 251 ## Indexing Pipeline 252 indexing_pipeline = Pipeline() 253 indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder") 254 indexing_pipeline.add_component( 255 instance=DocumentWriter(document_store=doc_store), 256 name="DocWriter", 257 ) 258 indexing_pipeline.connect("DocEmbedder", "DocWriter") 259 260 indexing_pipeline.run({"DocEmbedder": {"documents": docs}}) 261 ``` 262 263 Create the RAG pipeline and add the `LlamaCppChatGenerator` to it: 264 265 ```python 266 system_message = ChatMessage.from_system( 267 """ 268 Answer the question using the provided context. 269 Context: 270 {% for doc in documents %} 271 {{ doc.content }} 272 {% endfor %} 273 """, 274 ) 275 user_message = ChatMessage.from_user("Question: {{question}}") 276 assistent_message = ChatMessage.from_assistant("Answer: ") 277 278 chat_template = [system_message, user_message, assistent_message] 279 280 rag_pipeline = Pipeline() 281 282 text_embedder = SentenceTransformersTextEmbedder( 283 model="sentence-transformers/all-MiniLM-L6-v2", 284 ) 285 286 ## Load the LLM using LlamaCppChatGenerator 287 model_path = "openchat-3.5-1210.Q3_K_S.gguf" 288 generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128) 289 290 rag_pipeline.add_component( 291 instance=text_embedder, 292 name="text_embedder", 293 ) 294 rag_pipeline.add_component( 295 instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), 296 name="retriever", 297 ) 298 rag_pipeline.add_component( 299 instance=ChatPromptBuilder(template=chat_template), 300 name="prompt_builder", 301 ) 302 rag_pipeline.add_component(instance=generator, name="llm") 303 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 304 305 rag_pipeline.connect("text_embedder", "retriever") 306 rag_pipeline.connect("retriever", "prompt_builder.documents") 307 rag_pipeline.connect("prompt_builder", "llm") 308 rag_pipeline.connect("llm", "answer_builder") 309 rag_pipeline.connect("retriever", "answer_builder.documents") 310 ``` 311 312 Run the pipeline: 313 314 ```python 315 question = "Which year did the Joker movie release?" 316 result = rag_pipeline.run( 317 { 318 "text_embedder": {"text": question}, 319 "prompt_builder": {"question": question}, 320 "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}}, 321 "answer_builder": {"query": question}, 322 }, 323 ) 324 325 generated_answer = result["answer_builder"]["answers"][0] 326 print(generated_answer.data) 327 ## The Joker movie was released on October 4, 2019. 328 ```