llamacppchatgenerator.mdx
  1  ---
  2  title: "LlamaCppChatGenerator"
  3  id: llamacppchatgenerator
  4  slug: "/llamacppchatgenerator"
  5  description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp."
  6  ---
  7  
  8  # LlamaCppChatGenerator
  9  
 10  `LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp.
 11  
 12  |                                        |                                                                                                                           |
 13  | :------------------------------------- | :------------------------------------------------------------------------------------------------------------------------ |
 14  | **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx)                                                                    |
 15  | **Mandatory init variables**           | "model": The path of the model to use                                                                                     |
 16  | **Mandatory run variables**            | “messages”: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx)  instances representing the input messages          |
 17  | **Output variables**                   | “replies”: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx)  instances with all the replies generated by the LLM |
 18  | **API reference**                      | [Llama.cpp](/reference/integrations-llama-cpp)                                                                                   |
 19  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp                               |
 20  
 21  ## Overview
 22  
 23  [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
 24  
 25  `Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp`  by taking the path to the locally saved GGUF file as `model` parameter at initialization.
 26  
 27  ## Installation
 28  
 29  Install the `llama-cpp-haystack` package to use this integration:
 30  
 31  ```shell
 32  pip install llama-cpp-haystack
 33  ```
 34  
 35  ### Using a different compute backend
 36  
 37  The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
 38  
 39  1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend.
 40  2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above.
 41  
 42  For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands:
 43  
 44  ```shell
 45  export GGML_CUDA=1
 46  CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
 47  pip install llama-cpp-haystack
 48  ```
 49  
 50  ## Usage
 51  
 52  1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).
 53  2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters:
 54  
 55  ```python
 56  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 57  
 58  generator = LlamaCppChatGenerator(
 59      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 60      n_ctx=512,
 61      n_batch=128,
 62      model_kwargs={"n_gpu_layers": -1},
 63      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
 64  )
 65  generator.warm_up()
 66  messages = [ChatMessage.from_user("Who is the best American actor?")]
 67  result = generator.run(messages)
 68  ```
 69  
 70  ### Passing additional model parameters
 71  
 72  The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
 73  
 74  The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
 75  
 76  See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
 77  
 78  **Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter.
 79  
 80  For example, to offload the model to GPU during initialization:
 81  
 82  ```python
 83  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 84  from haystack.dataclasses import ChatMessage
 85  
 86  generator = LlamaCppChatGenerator(
 87      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 88      n_ctx=512,
 89      n_batch=128,
 90      model_kwargs={"n_gpu_layers": -1},
 91  )
 92  generator.warm_up()
 93  messages = [ChatMessage.from_user("Who is the best American actor?")]
 94  result = generator.run(messages, generation_kwargs={"max_tokens": 128})
 95  generated_reply = result["replies"][0].content
 96  print(generated_reply)
 97  ```
 98  
 99  ### Passing text generation parameters
100  
101  The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
102  
103  See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments.
104  
105  **Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them.
106  
107  For example, to set the `max_tokens` and `temperature`:
108  
109  ```python
110  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
111  from haystack.dataclasses import ChatMessage
112  
113  generator = LlamaCppChatGenerator(
114      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
115      n_ctx=512,
116      n_batch=128,
117      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
118  )
119  generator.warm_up()
120  messages = [ChatMessage.from_user("Who is the best American actor?")]
121  result = generator.run(messages)
122  ```
123  
124  The `generation_kwargs` can also be passed to the `run` method of the generator directly:
125  
126  ```python
127  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
128  from haystack.dataclasses import ChatMessage
129  
130  generator = LlamaCppChatGenerator(
131      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
132      n_ctx=512,
133      n_batch=128,
134  )
135  generator.warm_up()
136  messages = [ChatMessage.from_user("Who is the best American actor?")]
137  result = generator.run(
138      messages,
139      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
140  )
141  ```
142  
143  ### In a pipeline
144  
145  We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
146  
147  Load the dataset:
148  
149  ```python
150  ## Install HuggingFace Datasets using "pip install datasets"
151  from datasets import load_dataset
152  from haystack import Document, Pipeline
153  from haystack.components.builders.answer_builder import AnswerBuilder
154  from haystack.components.builders import ChatPromptBuilder
155  from haystack.components.embedders import (
156      SentenceTransformersDocumentEmbedder,
157      SentenceTransformersTextEmbedder,
158  )
159  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
160  from haystack.components.writers import DocumentWriter
161  from haystack.document_stores.in_memory import InMemoryDocumentStore
162  from haystack.dataclasses import ChatMessage
163  
164  ## Import LlamaCppChatGenerator
165  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
166  
167  ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
168  dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
169  
170  docs = [
171      Document(
172          content=doc["text"],
173          meta={
174              "title": doc["title"],
175              "url": doc["url"],
176          },
177      )
178      for doc in dataset
179  ]
180  ```
181  
182  Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
183  
184  ```python
185  doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
186  ## Install sentence transformers using "pip install sentence-transformers"
187  doc_embedder = SentenceTransformersDocumentEmbedder(
188      model="sentence-transformers/all-MiniLM-L6-v2",
189  )
190  
191  ## Indexing Pipeline
192  indexing_pipeline = Pipeline()
193  indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
194  indexing_pipeline.add_component(
195      instance=DocumentWriter(document_store=doc_store),
196      name="DocWriter",
197  )
198  indexing_pipeline.connect("DocEmbedder", "DocWriter")
199  
200  indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
201  ```
202  
203  Create the RAG pipeline and add the `LlamaCppChatGenerator` to it:
204  
205  ```python
206  system_message = ChatMessage.from_system(
207      """
208      Answer the question using the provided context.
209      Context:
210      {% for doc in documents %}
211          {{ doc.content }}
212      {% endfor %}
213      """,
214  )
215  user_message = ChatMessage.from_user("Question: {{question}}")
216  assistent_message = ChatMessage.from_assistant("Answer: ")
217  
218  chat_template = [system_message, user_message, assistent_message]
219  
220  rag_pipeline = Pipeline()
221  
222  text_embedder = SentenceTransformersTextEmbedder(
223      model="sentence-transformers/all-MiniLM-L6-v2",
224  )
225  
226  ## Load the LLM using LlamaCppChatGenerator
227  model_path = "openchat-3.5-1210.Q3_K_S.gguf"
228  generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128)
229  
230  rag_pipeline.add_component(
231      instance=text_embedder,
232      name="text_embedder",
233  )
234  rag_pipeline.add_component(
235      instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3),
236      name="retriever",
237  )
238  rag_pipeline.add_component(
239      instance=ChatPromptBuilder(template=chat_template),
240      name="prompt_builder",
241  )
242  rag_pipeline.add_component(instance=generator, name="llm")
243  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
244  
245  rag_pipeline.connect("text_embedder", "retriever")
246  rag_pipeline.connect("retriever", "prompt_builder.documents")
247  rag_pipeline.connect("prompt_builder", "llm")
248  rag_pipeline.connect("llm", "answer_builder")
249  rag_pipeline.connect("retriever", "answer_builder.documents")
250  ```
251  
252  Run the pipeline:
253  
254  ```python
255  question = "Which year did the Joker movie release?"
256  result = rag_pipeline.run(
257      {
258          "text_embedder": {"text": question},
259          "prompt_builder": {"question": question},
260          "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
261          "answer_builder": {"query": question},
262      },
263  )
264  
265  generated_answer = result["answer_builder"]["answers"][0]
266  print(generated_answer.data)
267  ## The Joker movie was released on October 4, 2019.
268  ```