llamacppchatgenerator.mdx
  1  ---
  2  title: "LlamaCppChatGenerator"
  3  id: llamacppchatgenerator
  4  slug: "/llamacppchatgenerator"
  5  description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp."
  6  ---
  7  
  8  # LlamaCppChatGenerator
  9  
 10  `LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx)                                                                    |
 17  | **Mandatory init variables**           | `model`: The path of the model to use                                                                                     |
 18  | **Mandatory run variables**            | `messages`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx)  instances representing the input messages          |
 19  | **Output variables**                   | `replies`: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx)  instances with all the replies generated by the LLM |
 20  | **API reference**                      | [Llama.cpp](/reference/integrations-llama-cpp)                                                                                   |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp                               |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
 28  
 29  `Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp`  by taking the path to the locally saved GGUF file as `model` parameter at initialization.
 30  
 31  ### Tool Support
 32  
 33  `LlamaCppChatGenerator` supports function calling through the `tools` parameter, which accepts flexible tool configurations:
 34  
 35  - **A list of Tool objects**: Pass individual tools as a list
 36  - **A single Toolset**: Pass an entire Toolset directly
 37  - **Mixed Tools and Toolsets**: Combine multiple Toolsets with standalone tools in a single list
 38  
 39  This allows you to organize related tools into logical groups while also including standalone tools as needed.
 40  
 41  ```python
 42  from haystack.tools import Tool, Toolset
 43  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 44  
 45  # Create individual tools
 46  weather_tool = Tool(name="weather", description="Get weather info", ...)
 47  news_tool = Tool(name="news", description="Get latest news", ...)
 48  
 49  # Group related tools into a toolset
 50  math_toolset = Toolset([add_tool, subtract_tool, multiply_tool])
 51  
 52  # Pass mixed tools and toolsets to the generator
 53  generator = LlamaCppChatGenerator(
 54      model="/path/to/model.gguf",
 55      tools=[math_toolset, weather_tool, news_tool]  # Mix of Toolset and Tool objects
 56  )
 57  ```
 58  
 59  For more details on working with tools, see the [Tool](../../tools/tool.mdx) and [Toolset](../../tools/toolset.mdx) documentation.
 60  
 61  ## Installation
 62  
 63  Install the `llama-cpp-haystack` package to use this integration:
 64  
 65  ```shell
 66  pip install llama-cpp-haystack
 67  ```
 68  
 69  ### Using a different compute backend
 70  
 71  The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
 72  
 73  1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend.
 74  2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above.
 75  
 76  For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands:
 77  
 78  ```shell
 79  export GGML_CUDA=1
 80  CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
 81  pip install llama-cpp-haystack
 82  ```
 83  
 84  ## Usage
 85  
 86  1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).
 87  2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters:
 88  
 89  ```python
 90  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 91  
 92  generator = LlamaCppChatGenerator(
 93      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 94      n_ctx=512,
 95      n_batch=128,
 96      model_kwargs={"n_gpu_layers": -1},
 97      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
 98  )
 99  generator.warm_up()
100  messages = [ChatMessage.from_user("Who is the best American actor?")]
101  result = generator.run(messages)
102  ```
103  
104  ### Passing additional model parameters
105  
106  The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
107  
108  The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
109  
110  See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
111  
112  **Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter.
113  
114  For example, to offload the model to GPU during initialization:
115  
116  ```python
117  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
118  from haystack.dataclasses import ChatMessage
119  
120  generator = LlamaCppChatGenerator(
121      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
122      n_ctx=512,
123      n_batch=128,
124      model_kwargs={"n_gpu_layers": -1},
125  )
126  messages = [ChatMessage.from_user("Who is the best American actor?")]
127  result = generator.run(messages, generation_kwargs={"max_tokens": 128})
128  generated_reply = result["replies"][0].text
129  print(generated_reply)
130  ```
131  
132  ### Passing text generation parameters
133  
134  The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
135  
136  See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments.
137  
138  **Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them.
139  
140  For example, to set the `max_tokens` and `temperature`:
141  
142  ```python
143  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
144  from haystack.dataclasses import ChatMessage
145  
146  generator = LlamaCppChatGenerator(
147      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
148      n_ctx=512,
149      n_batch=128,
150      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
151  )
152  messages = [ChatMessage.from_user("Who is the best American actor?")]
153  result = generator.run(messages)
154  ```
155  
156  ### With multimodal (image + text) inputs
157  
158  ```python
159  from haystack.dataclasses import ChatMessage, ImageContent
160  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
161  
162  # Initialize with multimodal support
163  llm = LlamaCppChatGenerator(
164      model="llava-v1.5-7b-q4_0.gguf",
165      chat_handler_name="Llava15ChatHandler",  # Use llava-1-5 handler
166      model_clip_path="mmproj-model-f16.gguf",  # CLIP model
167      n_ctx=4096,  # Larger context for image processing
168  )
169  
170  image = ImageContent.from_file_path("apple.jpg")
171  user_message = ChatMessage.from_user(
172      content_parts=["What does the image show? Max 5 words.", image],
173  )
174  
175  response = llm.run([user_message])["replies"][0].text
176  print(response)
177  
178  # Red apple on straw.
179  ```
180  
181  The `generation_kwargs` can also be passed to the `run` method of the generator directly:
182  
183  ```python
184  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
185  from haystack.dataclasses import ChatMessage
186  
187  generator = LlamaCppChatGenerator(
188      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
189      n_ctx=512,
190      n_batch=128,
191  )
192  messages = [ChatMessage.from_user("Who is the best American actor?")]
193  result = generator.run(
194      messages,
195      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
196  )
197  ```
198  
199  ### In a pipeline
200  
201  We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
202  
203  Load the dataset:
204  
205  ```python
206  ## Install HuggingFace Datasets using "pip install datasets"
207  from datasets import load_dataset
208  from haystack import Document, Pipeline
209  from haystack.components.builders.answer_builder import AnswerBuilder
210  from haystack.components.builders import ChatPromptBuilder
211  from haystack.components.embedders import (
212      SentenceTransformersDocumentEmbedder,
213      SentenceTransformersTextEmbedder,
214  )
215  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
216  from haystack.components.writers import DocumentWriter
217  from haystack.document_stores.in_memory import InMemoryDocumentStore
218  from haystack.dataclasses import ChatMessage
219  
220  ## Import LlamaCppChatGenerator
221  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
222  
223  ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
224  dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
225  
226  docs = [
227      Document(
228          content=doc["text"],
229          meta={
230              "title": doc["title"],
231              "url": doc["url"],
232          },
233      )
234      for doc in dataset
235  ]
236  ```
237  
238  Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
239  
240  ```python
241  doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
242  ## Install sentence transformers using "pip install sentence-transformers"
243  doc_embedder = SentenceTransformersDocumentEmbedder(
244      model="sentence-transformers/all-MiniLM-L6-v2",
245  )
246  
247  ## Indexing Pipeline
248  indexing_pipeline = Pipeline()
249  indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
250  indexing_pipeline.add_component(
251      instance=DocumentWriter(document_store=doc_store),
252      name="DocWriter",
253  )
254  indexing_pipeline.connect("DocEmbedder", "DocWriter")
255  
256  indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
257  ```
258  
259  Create the RAG pipeline and add the `LlamaCppChatGenerator` to it:
260  
261  ```python
262  system_message = ChatMessage.from_system(
263      """
264      Answer the question using the provided context.
265      Context:
266      {% for doc in documents %}
267          {{ doc.content }}
268      {% endfor %}
269      """,
270  )
271  user_message = ChatMessage.from_user("Question: {{question}}")
272  assistent_message = ChatMessage.from_assistant("Answer: ")
273  
274  chat_template = [system_message, user_message, assistent_message]
275  
276  rag_pipeline = Pipeline()
277  
278  text_embedder = SentenceTransformersTextEmbedder(
279      model="sentence-transformers/all-MiniLM-L6-v2",
280  )
281  
282  ## Load the LLM using LlamaCppChatGenerator
283  model_path = "openchat-3.5-1210.Q3_K_S.gguf"
284  generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)
285  
286  rag_pipeline.add_component(
287      instance=text_embedder,
288      name="text_embedder",
289  )
290  rag_pipeline.add_component(
291      instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3),
292      name="retriever",
293  )
294  rag_pipeline.add_component(
295      instance=ChatPromptBuilder(template=chat_template),
296      name="prompt_builder",
297  )
298  rag_pipeline.add_component(instance=generator, name="llm")
299  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
300  
301  rag_pipeline.connect("text_embedder", "retriever")
302  rag_pipeline.connect("retriever", "prompt_builder.documents")
303  rag_pipeline.connect("prompt_builder", "llm")
304  rag_pipeline.connect("llm", "answer_builder")
305  rag_pipeline.connect("retriever", "answer_builder.documents")
306  ```
307  
308  Run the pipeline:
309  
310  ```python
311  question = "Which year did the Joker movie release?"
312  result = rag_pipeline.run(
313      {
314          "text_embedder": {"text": question},
315          "prompt_builder": {"question": question},
316          "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
317          "answer_builder": {"query": question},
318      },
319  )
320  
321  generated_answer = result["answer_builder"]["answers"][0]
322  print(generated_answer.data)
323  ## The Joker movie was released on October 4, 2019.
324  ```