llamacppgenerator.mdx
  1  ---
  2  title: "LlamaCppGenerator"
  3  id: llamacppgenerator
  4  slug: "/llamacppgenerator"
  5  description: "`LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp."
  6  ---
  7  
  8  # LlamaCppGenerator
  9  
 10  `LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | After a [`PromptBuilder`](../builders/promptbuilder.mdx) |
 17  | **Mandatory init variables** | `model`: The path of the model to use |
 18  | **Mandatory run variables** | `prompt`: A string containing the prompt for the LLM |
 19  | **Output variables** | `replies`: A list of strings with all the replies generated by the LLM  <br /> <br />`meta`: A list of dictionaries with the metadata associated with each reply, such as token count and others |
 20  | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) |
 21  | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
 28  
 29  `Llama.cpp` uses the quantized binary file of the LLM in GGUF format that can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).  `LlamaCppGenerator` supports models running on `Llama.cpp`  by taking the path to the locally saved GGUF file as `model` parameter at initialization.
 30  
 31  ## Installation
 32  
 33  Install the `llama-cpp-haystack` package:
 34  
 35  ```bash
 36  pip install llama-cpp-haystack
 37  ```
 38  
 39  ### Using a different compute backend
 40  
 41  The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
 42  
 43  1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend.
 44  2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above.
 45  
 46  For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands:
 47  
 48  ```bash
 49  export GGML_CUDA=1
 50  CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
 51  pip install llama-cpp-haystack
 52  ```
 53  
 54  ## Usage
 55  
 56  1. You need to download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).
 57  2. Initialize a `LlamaCppGenerator` with the path to the GGUF file and also specify the required model and text generation parameters:
 58  
 59  ```python
 60  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
 61  
 62  generator = LlamaCppGenerator(
 63      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 64      n_ctx=512,
 65      n_batch=128,
 66      model_kwargs={"n_gpu_layers": -1},
 67      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
 68  )
 69  generator.warm_up()
 70  prompt = f"Who is the best American actor?"
 71  result = generator.run(prompt)
 72  ```
 73  
 74  ### Passing additional model parameters
 75  
 76  The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
 77  
 78  The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
 79  
 80  See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
 81  
 82  For example, to offload the model to GPU during initialization:
 83  
 84  ```python
 85  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
 86  
 87  generator = LlamaCppGenerator(
 88      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 89      n_ctx=512,
 90      n_batch=128,
 91      model_kwargs={"n_gpu_layers": -1},
 92  )
 93  prompt = f"Who is the best American actor?"
 94  result = generator.run(prompt, generation_kwargs={"max_tokens": 128})
 95  generated_text = result["replies"][0]
 96  print(generated_text)
 97  ```
 98  
 99  ### Passing text generation parameters
100  
101  The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
102  
103  See [Llama.cpp's Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments.
104  
105  For example, to set the `max_tokens` and `temperature`:
106  
107  ```python
108  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
109  
110  generator = LlamaCppGenerator(
111      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
112      n_ctx=512,
113      n_batch=128,
114      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
115  )
116  prompt = f"Who is the best American actor?"
117  result = generator.run(prompt)
118  ```
119  
120  The `generation_kwargs` can also be passed to the `run` method of the generator directly:
121  
122  ```python
123  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
124  
125  generator = LlamaCppGenerator(
126      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
127      n_ctx=512,
128      n_batch=128,
129  )
130  prompt = f"Who is the best American actor?"
131  result = generator.run(
132      prompt,
133      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
134  )
135  ```
136  
137  ### Using in a Pipeline
138  
139  We use the `LlamaCppGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
140  
141  Load the dataset:
142  
143  ```python
144  ## Install HuggingFace Datasets using "pip install datasets"
145  from datasets import load_dataset
146  from haystack import Document, Pipeline
147  from haystack.components.builders.answer_builder import AnswerBuilder
148  from haystack.components.builders.prompt_builder import PromptBuilder
149  from haystack.components.embedders import (
150      SentenceTransformersDocumentEmbedder,
151      SentenceTransformersTextEmbedder,
152  )
153  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
154  from haystack.components.writers import DocumentWriter
155  from haystack.document_stores.in_memory import InMemoryDocumentStore
156  
157  ## Import LlamaCppGenerator
158  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
159  
160  ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
161  dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
162  
163  docs = [
164      Document(
165          content=doc["text"],
166          meta={
167              "title": doc["title"],
168              "url": doc["url"],
169          },
170      )
171      for doc in dataset
172  ]
173  ```
174  
175  Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
176  
177  ```python
178  doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
179  doc_embedder = SentenceTransformersDocumentEmbedder(
180      model="sentence-transformers/all-MiniLM-L6-v2",
181  )
182  
183  ## Indexing Pipeline
184  indexing_pipeline = Pipeline()
185  indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
186  indexing_pipeline.add_component(
187      instance=DocumentWriter(document_store=doc_store),
188      name="DocWriter",
189  )
190  indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")
191  
192  indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
193  ```
194  
195  Create the Retrieval Augmented Generation (RAG) pipeline and add the `LlamaCppGenerator` to it:
196  
197  ```python
198  ## Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM
199  prompt_template = """GPT4 Correct User: Answer the question using the provided context.
200  Question: {{question}}
201  Context:
202  {% for doc in documents %}
203      {{ doc.content }}
204  {% endfor %}
205  <|end_of_turn|>
206  GPT4 Correct Assistant:
207  """
208  
209  rag_pipeline = Pipeline()
210  
211  text_embedder = SentenceTransformersTextEmbedder(
212      model="sentence-transformers/all-MiniLM-L6-v2",
213  )
214  
215  ## Load the LLM using LlamaCppGenerator
216  model_path = "openchat-3.5-1210.Q3_K_S.gguf"
217  generator = LlamaCppGenerator(model=model_path, n_ctx=4096, n_batch=128)
218  
219  rag_pipeline.add_component(
220      instance=text_embedder,
221      name="text_embedder",
222  )
223  rag_pipeline.add_component(
224      instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3),
225      name="retriever",
226  )
227  rag_pipeline.add_component(
228      instance=PromptBuilder(template=prompt_template),
229      name="prompt_builder",
230  )
231  rag_pipeline.add_component(instance=generator, name="llm")
232  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
233  
234  rag_pipeline.connect("text_embedder", "retriever")
235  rag_pipeline.connect("retriever", "prompt_builder.documents")
236  rag_pipeline.connect("prompt_builder", "llm")
237  rag_pipeline.connect("llm.replies", "answer_builder.replies")
238  rag_pipeline.connect("retriever", "answer_builder.documents")
239  ```
240  
241  Run the pipeline:
242  
243  ```python
244  question = "Which year did the Joker movie release?"
245  result = rag_pipeline.run(
246      {
247          "text_embedder": {"text": question},
248          "prompt_builder": {"question": question},
249          "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
250          "answer_builder": {"query": question},
251      },
252  )
253  
254  generated_answer = result["answer_builder"]["answers"][0]
255  print(generated_answer.data)
256  ## The Joker movie was released on October 4, 2019.
257  ```