llamacppgenerator.mdx
  1  ---
  2  title: "LlamaCppGenerator"
  3  id: llamacppgenerator
  4  slug: "/llamacppgenerator"
  5  description: "`LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp."
  6  ---
  7  
  8  # LlamaCppGenerator
  9  
 10  `LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | After a [`PromptBuilder`](../builders/promptbuilder.mdx) |
 17  | **Mandatory init variables** | `model`: The path of the model to use |
 18  | **Mandatory run variables** | `prompt`: A string containing the prompt for the LLM |
 19  | **Output variables** | `replies`: A list of strings with all the replies generated by the LLM  <br /> <br />`meta`: A list of dictionaries with the metadata associated with each reply, such as token count and others |
 20  | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) |
 21  | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs).
 28  
 29  `Llama.cpp` uses the quantized binary file of the LLM in GGUF format that can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).  `LlamaCppGenerator` supports models running on `Llama.cpp`  by taking the path to the locally saved GGUF file as `model` parameter at initialization.
 30  
 31  ## Installation
 32  
 33  Install the `llama-cpp-haystack` package:
 34  
 35  ```bash
 36  pip install llama-cpp-haystack
 37  ```
 38  
 39  ### Using a different compute backend
 40  
 41  The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends:
 42  
 43  1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend.
 44  2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above.
 45  
 46  For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands:
 47  
 48  ```bash
 49  export GGML_CUDA=1
 50  CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python
 51  pip install llama-cpp-haystack
 52  ```
 53  
 54  ## Usage
 55  
 56  1. You need to download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf).
 57  2. Initialize a `LlamaCppGenerator` with the path to the GGUF file and also specify the required model and text generation parameters:
 58  
 59  ```python
 60  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
 61  
 62  generator = LlamaCppGenerator(
 63      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 64      n_ctx=512,
 65      n_batch=128,
 66      model_kwargs={"n_gpu_layers": -1},
 67      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
 68  )
 69  generator.warm_up()
 70  prompt = f"Who is the best American actor?"
 71  result = generator.run(prompt)
 72  ```
 73  
 74  ### Passing additional model parameters
 75  
 76  The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter.
 77  
 78  The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters.
 79  
 80  See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments.
 81  
 82  For example, to offload the model to GPU during initialization:
 83  
 84  ```python
 85  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
 86  
 87  generator = LlamaCppGenerator(
 88      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
 89      n_ctx=512,
 90      n_batch=128,
 91      model_kwargs={"n_gpu_layers": -1},
 92  )
 93  generator.warm_up()
 94  prompt = f"Who is the best American actor?"
 95  result = generator.run(prompt, generation_kwargs={"max_tokens": 128})
 96  generated_text = result["replies"][0]
 97  print(generated_text)
 98  ```
 99  
100  ### Passing text generation parameters
101  
102  The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference.
103  
104  See [Llama.cpp's Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments.
105  
106  For example, to set the `max_tokens` and `temperature`:
107  
108  ```python
109  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
110  
111  generator = LlamaCppGenerator(
112      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
113      n_ctx=512,
114      n_batch=128,
115      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
116  )
117  generator.warm_up()
118  prompt = f"Who is the best American actor?"
119  result = generator.run(prompt)
120  ```
121  
122  The `generation_kwargs` can also be passed to the `run` method of the generator directly:
123  
124  ```python
125  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
126  
127  generator = LlamaCppGenerator(
128      model="/content/openchat-3.5-1210.Q3_K_S.gguf",
129      n_ctx=512,
130      n_batch=128,
131  )
132  generator.warm_up()
133  prompt = f"Who is the best American actor?"
134  result = generator.run(
135      prompt,
136      generation_kwargs={"max_tokens": 128, "temperature": 0.1},
137  )
138  ```
139  
140  ### Using in a Pipeline
141  
142  We use the `LlamaCppGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM.
143  
144  Load the dataset:
145  
146  ```python
147  ## Install HuggingFace Datasets using "pip install datasets"
148  from datasets import load_dataset
149  from haystack import Document, Pipeline
150  from haystack.components.builders.answer_builder import AnswerBuilder
151  from haystack.components.builders.prompt_builder import PromptBuilder
152  from haystack.components.embedders import (
153      SentenceTransformersDocumentEmbedder,
154      SentenceTransformersTextEmbedder,
155  )
156  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
157  from haystack.components.writers import DocumentWriter
158  from haystack.document_stores.in_memory import InMemoryDocumentStore
159  
160  ## Import LlamaCppGenerator
161  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
162  
163  ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace
164  dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]")
165  
166  docs = [
167      Document(
168          content=doc["text"],
169          meta={
170              "title": doc["title"],
171              "url": doc["url"],
172          },
173      )
174      for doc in dataset
175  ]
176  ```
177  
178  Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`:
179  
180  ```python
181  doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
182  doc_embedder = SentenceTransformersDocumentEmbedder(
183      model="sentence-transformers/all-MiniLM-L6-v2",
184  )
185  
186  ## Indexing Pipeline
187  indexing_pipeline = Pipeline()
188  indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder")
189  indexing_pipeline.add_component(
190      instance=DocumentWriter(document_store=doc_store),
191      name="DocWriter",
192  )
193  indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter")
194  
195  indexing_pipeline.run({"DocEmbedder": {"documents": docs}})
196  ```
197  
198  Create the Retrieval Augmented Generation (RAG) pipeline and add the `LlamaCppGenerator` to it:
199  
200  ```python
201  ## Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM
202  prompt_template = """GPT4 Correct User: Answer the question using the provided context.
203  Question: {{question}}
204  Context:
205  {% for doc in documents %}
206      {{ doc.content }}
207  {% endfor %}
208  <|end_of_turn|>
209  GPT4 Correct Assistant:
210  """
211  
212  rag_pipeline = Pipeline()
213  
214  text_embedder = SentenceTransformersTextEmbedder(
215      model="sentence-transformers/all-MiniLM-L6-v2",
216  )
217  
218  ## Load the LLM using LlamaCppGenerator
219  model_path = "openchat-3.5-1210.Q3_K_S.gguf"
220  generator = LlamaCppGenerator(model=model_path, n_ctx=4096, n_batch=128)
221  
222  rag_pipeline.add_component(
223      instance=text_embedder,
224      name="text_embedder",
225  )
226  rag_pipeline.add_component(
227      instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3),
228      name="retriever",
229  )
230  rag_pipeline.add_component(
231      instance=PromptBuilder(template=prompt_template),
232      name="prompt_builder",
233  )
234  rag_pipeline.add_component(instance=generator, name="llm")
235  rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")
236  
237  rag_pipeline.connect("text_embedder", "retriever")
238  rag_pipeline.connect("retriever", "prompt_builder.documents")
239  rag_pipeline.connect("prompt_builder", "llm")
240  rag_pipeline.connect("llm.replies", "answer_builder.replies")
241  rag_pipeline.connect("retriever", "answer_builder.documents")
242  ```
243  
244  Run the pipeline:
245  
246  ```python
247  question = "Which year did the Joker movie release?"
248  result = rag_pipeline.run(
249      {
250          "text_embedder": {"text": question},
251          "prompt_builder": {"question": question},
252          "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}},
253          "answer_builder": {"query": question},
254      },
255  )
256  
257  generated_answer = result["answer_builder"]["answers"][0]
258  print(generated_answer.data)
259  ## The Joker movie was released on October 4, 2019.
260  ```