llamacppchatgenerator.mdx
  1  ---
  2  title: "LlamaCppChatGenerator"
  3  id: llamacppchatgenerator
  4  slug: "/llamacppchatgenerator"
  5  description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp."
  6  ---
  7  
  8  # LlamaCppChatGenerator
  9  
 10  `LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx)                                                                    |
 17  | **Mandatory init variables**           | `model`: The path of the model to use                                                                                     |
 18  | **Mandatory run variables**            | `messages`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx)  instances representing the input messages          |
 19  | **Output variables**                   | `replies`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx)  instances with all the replies generated by the LLM |
 20  | **API reference**                      | [Llama.cpp](/reference/integrations-llama-cpp)                                                                                   |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp                               |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
 28  
 29  `Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp`  by taking the path to the locally saved GGUF file as `model` parameter at initialization.
 30  
 31  ### Tool Support
 32  
 33  `LlamaCppChatGenerator` supports function calling through the `tools` parameter, which accepts flexible tool configurations:
 34  
 35  - **A list of Tool objects**: Pass individual tools as a list
 36  - **A single Toolset**: Pass an entire Toolset directly
 37  - **Mixed Tools and Toolsets**: Combine multiple Toolsets with standalone tools in a single list
 38  
 39  This allows you to organize related tools into logical groups while also including standalone tools as needed.
 40  
 41  ```python
 42  from haystack.tools import Tool, Toolset
 43  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 44  
 45  # Create individual tools
 46  weather_tool = Tool(name="weather", description="Get weather info", ...)
 47  news_tool = Tool(name="news", description="Get latest news", ...)
 48  
 49  # Group related tools into a toolset
 50  math_toolset = Toolset([add_tool, subtract_tool, multiply_tool])
 51  
 52  # Pass mixed tools and toolsets to the generator
 53  generator = LlamaCppChatGenerator(
 54      model="/path/to/model.gguf",
 55      tools=[math_toolset, weather_tool, news_tool]  # Mix of Toolset and Tool objects
 56  )
 57  ```
 58  
 59  For more details on working with tools, see the [Tool](../../tools/tool.mdx) and [Toolset](../../tools/toolset.mdx) documentation.
 60  
 61  ## Installation
 62  
 63  Install the `llama-cpp-haystack` package to use this integration:
 64  
 65  ```shell
 66  pip install llama-cpp-haystack
 67  ```
 68  
 69  ### Using a different compute backend
 70  
 71  The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
 72  
 73  1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend.
 74  2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above.
 75  
 76  For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands:
 77  
 78  ```shell
 79  export GGML_CUDA=1
 80  CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
 81  pip install llama-cpp-haystack
 82  ```
 83  
 84  ## Usage
 85  
 86  1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).
 87  2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters:
 88  
 89  ```python
 90  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 91  
 92  generator = LlamaCppChatGenerator(
 93      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 94      n_ctx=512,
 95      n_batch=128,
 96      model_kwargs={"n_gpu_layers": -1},
 97      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
 98  )
 99  generator.warm_up()
100  messages = [ChatMessage.from_user("Who is the best American actor?")]
101  result = generator.run(messages)
102  ```
103  
104  ### Passing additional model parameters
105  
106  The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
107  
108  The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
109  
110  See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
111  
112  **Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter.
113  
114  For example, to offload the model to GPU during initialization:
115  
116  ```python
117  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
118  from haystack.dataclasses import ChatMessage
119  
120  generator = LlamaCppChatGenerator(
121      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
122      n_ctx=512,
123      n_batch=128,
124      model_kwargs={"n_gpu_layers": -1},
125  )
126  generator.warm_up()
127  messages = [ChatMessage.from_user("Who is the best American actor?")]
128  result = generator.run(messages, generation_kwargs={"max_tokens": 128})
129  generated_reply = result["replies"][0].content
130  print(generated_reply)
131  ```
132  
133  ### Passing text generation parameters
134  
135  The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
136  
137  See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments.
138  
139  **Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them.
140  
141  For example, to set the `max_tokens` and `temperature`:
142  
143  ```python
144  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
145  from haystack.dataclasses import ChatMessage
146  
147  generator = LlamaCppChatGenerator(
148      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
149      n_ctx=512,
150      n_batch=128,
151      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
152  )
153  generator.warm_up()
154  messages = [ChatMessage.from_user("Who is the best American actor?")]
155  result = generator.run(messages)
156  ```
157  
158  ### With multimodal (image + text) inputs
159  
160  ```python
161  from haystack.dataclasses import ChatMessage, ImageContent
162  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
163  
164  # Initialize with multimodal support
165  llm = LlamaCppChatGenerator(
166      model="llava-v1.5-7b-q4_0.gguf",
167      chat_handler_name="Llava15ChatHandler",  # Use llava-1-5 handler
168      model_clip_path="mmproj-model-f16.gguf",  # CLIP model
169      n_ctx=4096,  # Larger context for image processing
170  )
171  llm.warm_up()
172  
173  image = ImageContent.from_file_path("apple.jpg")
174  user_message = ChatMessage.from_user(
175      content_parts=["What does the image show? Max 5 words.", image],
176  )
177  
178  response = llm.run([user_message])["replies"][0].text
179  print(response)
180  
181  # Red apple on straw.
182  ```
183  
184  The `generation_kwargs` can also be passed to the `run` method of the generator directly:
185  
186  ```python
187  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
188  from haystack.dataclasses import ChatMessage
189  
190  generator = LlamaCppChatGenerator(
191      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
192      n_ctx=512,
193      n_batch=128,
194  )
195  generator.warm_up()
196  messages = [ChatMessage.from_user("Who is the best American actor?")]
197  result = generator.run(
198      messages,
199      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
200  )
201  ```
202  
203  ### In a pipeline
204  
205  We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
206  
207  Load the dataset:
208  
209  ```python
210  ## Install HuggingFace Datasets using "pip install datasets"
211  from datasets import load_dataset
212  from haystack import Document, Pipeline
213  from haystack.components.builders.answer_builder import AnswerBuilder
214  from haystack.components.builders import ChatPromptBuilder
215  from haystack.components.embedders import (
216      SentenceTransformersDocumentEmbedder,
217      SentenceTransformersTextEmbedder,
218  )
219  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
220  from haystack.components.writers import DocumentWriter
221  from haystack.document_stores.in_memory import InMemoryDocumentStore
222  from haystack.dataclasses import ChatMessage
223  
224  ## Import LlamaCppChatGenerator
225  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
226  
227  ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
228  dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
229  
230  docs = [
231      Document(
232          content=doc["text"],
233          meta={
234              "title": doc["title"],
235              "url": doc["url"],
236          },
237      )
238      for doc in dataset
239  ]
240  ```
241  
242  Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
243  
244  ```python
245  doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
246  ## Install sentence transformers using "pip install sentence-transformers"
247  doc_embedder = SentenceTransformersDocumentEmbedder(
248      model="sentence-transformers/all-MiniLM-L6-v2",
249  )
250  
251  ## Indexing Pipeline
252  indexing_pipeline = Pipeline()
253  indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
254  indexing_pipeline.add_component(
255      instance=DocumentWriter(document_store=doc_store),
256      name="DocWriter",
257  )
258  indexing_pipeline.connect("DocEmbedder", "DocWriter")
259  
260  indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
261  ```
262  
263  Create the RAG pipeline and add the `LlamaCppChatGenerator` to it:
264  
265  ```python
266  system_message = ChatMessage.from_system(
267      """
268      Answer the question using the provided context.
269      Context:
270      {% for doc in documents %}
271          {{ doc.content }}
272      {% endfor %}
273      """,
274  )
275  user_message = ChatMessage.from_user("Question: {{question}}")
276  assistent_message = ChatMessage.from_assistant("Answer: ")
277  
278  chat_template = [system_message, user_message, assistent_message]
279  
280  rag_pipeline = Pipeline()
281  
282  text_embedder = SentenceTransformersTextEmbedder(
283      model="sentence-transformers/all-MiniLM-L6-v2",
284  )
285  
286  ## Load the LLM using LlamaCppChatGenerator
287  model_path = "openchat-3.5-1210.Q3_K_S.gguf"
288  generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)
289  
290  rag_pipeline.add_component(
291      instance=text_embedder,
292      name="text_embedder",
293  )
294  rag_pipeline.add_component(
295      instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3),
296      name="retriever",
297  )
298  rag_pipeline.add_component(
299      instance=ChatPromptBuilder(template=chat_template),
300      name="prompt_builder",
301  )
302  rag_pipeline.add_component(instance=generator, name="llm")
303  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
304  
305  rag_pipeline.connect("text_embedder", "retriever")
306  rag_pipeline.connect("retriever", "prompt_builder.documents")
307  rag_pipeline.connect("prompt_builder", "llm")
308  rag_pipeline.connect("llm", "answer_builder")
309  rag_pipeline.connect("retriever", "answer_builder.documents")
310  ```
311  
312  Run the pipeline:
313  
314  ```python
315  question = "Which year did the Joker movie release?"
316  result = rag_pipeline.run(
317      {
318          "text_embedder": {"text": question},
319          "prompt_builder": {"question": question},
320          "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
321          "answer_builder": {"query": question},
322      },
323  )
324  
325  generated_answer = result["answer_builder"]["answers"][0]
326  print(generated_answer.data)
327  ## The Joker movie was released on October 4, 2019.
328  ```