/ docs-website / versioned_docs / version-2.28 / pipeline-components / generators / llamacppgenerator.mdx
llamacppgenerator.mdx
1 --- 2 title: "LlamaCppGenerator" 3 id: llamacppgenerator 4 slug: "/llamacppgenerator" 5 description: "`LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp." 6 --- 7 8 # LlamaCppGenerator 9 10 `LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | After a [`PromptBuilder`](../builders/promptbuilder.mdx) | 17 | **Mandatory init variables** | `model`: The path of the model to use | 18 | **Mandatory run variables** | `prompt`: A string containing the prompt for the LLM | 19 | **Output variables** | `replies`: A list of strings with all the replies generated by the LLM <br /> <br />`meta`: A list of dictionaries with the metadata associated with each reply, such as token count and others | 20 | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp | 22 23 </div> 24 25 ## Overview 26 27 [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs). 28 29 `Llama.cpp` uses the quantized binary file of the LLM in GGUF format that can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppGenerator` supports models running on `Llama.cpp` by taking the path to the locally saved GGUF file as `model` parameter at initialization. 30 31 ## Installation 32 33 Install the `llama-cpp-haystack` package: 34 35 ```bash 36 pip install llama-cpp-haystack 37 ``` 38 39 ### Using a different compute backend 40 41 The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends: 42 43 1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend. 44 2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above. 45 46 For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands: 47 48 ```bash 49 export GGML_CUDA=1 50 CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python 51 pip install llama-cpp-haystack 52 ``` 53 54 ## Usage 55 56 1. You need to download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). 57 2. Initialize a `LlamaCppGenerator` with the path to the GGUF file and also specify the required model and text generation parameters: 58 59 ```python 60 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 61 62 generator = LlamaCppGenerator( 63 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 64 n_ctx=512, 65 n_batch=128, 66 model_kwargs={"n_gpu_layers": -1}, 67 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 68 ) 69 generator.warm_up() 70 prompt = f"Who is the best American actor?" 71 result = generator.run(prompt) 72 ``` 73 74 ### Passing additional model parameters 75 76 The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter. 77 78 The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters. 79 80 See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments. 81 82 For example, to offload the model to GPU during initialization: 83 84 ```python 85 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 86 87 generator = LlamaCppGenerator( 88 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 89 n_ctx=512, 90 n_batch=128, 91 model_kwargs={"n_gpu_layers": -1}, 92 ) 93 prompt = f"Who is the best American actor?" 94 result = generator.run(prompt, generation_kwargs={"max_tokens": 128}) 95 generated_text = result["replies"][0] 96 print(generated_text) 97 ``` 98 99 ### Passing text generation parameters 100 101 The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference. 102 103 See [Llama.cpp's Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments. 104 105 For example, to set the `max_tokens` and `temperature`: 106 107 ```python 108 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 109 110 generator = LlamaCppGenerator( 111 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 112 n_ctx=512, 113 n_batch=128, 114 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 115 ) 116 prompt = f"Who is the best American actor?" 117 result = generator.run(prompt) 118 ``` 119 120 The `generation_kwargs` can also be passed to the `run` method of the generator directly: 121 122 ```python 123 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 124 125 generator = LlamaCppGenerator( 126 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 127 n_ctx=512, 128 n_batch=128, 129 ) 130 prompt = f"Who is the best American actor?" 131 result = generator.run( 132 prompt, 133 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 134 ) 135 ``` 136 137 ### Using in a Pipeline 138 139 We use the `LlamaCppGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM. 140 141 Load the dataset: 142 143 ```python 144 ## Install HuggingFace Datasets using "pip install datasets" 145 from datasets import load_dataset 146 from haystack import Document, Pipeline 147 from haystack.components.builders.answer_builder import AnswerBuilder 148 from haystack.components.builders.prompt_builder import PromptBuilder 149 from haystack.components.embedders import ( 150 SentenceTransformersDocumentEmbedder, 151 SentenceTransformersTextEmbedder, 152 ) 153 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 154 from haystack.components.writers import DocumentWriter 155 from haystack.document_stores.in_memory import InMemoryDocumentStore 156 157 ## Import LlamaCppGenerator 158 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 159 160 ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace 161 dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]") 162 163 docs = [ 164 Document( 165 content=doc["text"], 166 meta={ 167 "title": doc["title"], 168 "url": doc["url"], 169 }, 170 ) 171 for doc in dataset 172 ] 173 ``` 174 175 Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`: 176 177 ```python 178 doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine") 179 doc_embedder = SentenceTransformersDocumentEmbedder( 180 model="sentence-transformers/all-MiniLM-L6-v2", 181 ) 182 183 ## Indexing Pipeline 184 indexing_pipeline = Pipeline() 185 indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder") 186 indexing_pipeline.add_component( 187 instance=DocumentWriter(document_store=doc_store), 188 name="DocWriter", 189 ) 190 indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter") 191 192 indexing_pipeline.run({"DocEmbedder": {"documents": docs}}) 193 ``` 194 195 Create the Retrieval Augmented Generation (RAG) pipeline and add the `LlamaCppGenerator` to it: 196 197 ```python 198 ## Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM 199 prompt_template = """GPT4 Correct User: Answer the question using the provided context. 200 Question: {{question}} 201 Context: 202 {% for doc in documents %} 203 {{ doc.content }} 204 {% endfor %} 205 <|end_of_turn|> 206 GPT4 Correct Assistant: 207 """ 208 209 rag_pipeline = Pipeline() 210 211 text_embedder = SentenceTransformersTextEmbedder( 212 model="sentence-transformers/all-MiniLM-L6-v2", 213 ) 214 215 ## Load the LLM using LlamaCppGenerator 216 model_path = "openchat-3.5-1210.Q3_K_S.gguf" 217 generator = LlamaCppGenerator(model=model_path, n_ctx=4096, n_batch=128) 218 219 rag_pipeline.add_component( 220 instance=text_embedder, 221 name="text_embedder", 222 ) 223 rag_pipeline.add_component( 224 instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), 225 name="retriever", 226 ) 227 rag_pipeline.add_component( 228 instance=PromptBuilder(template=prompt_template), 229 name="prompt_builder", 230 ) 231 rag_pipeline.add_component(instance=generator, name="llm") 232 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 233 234 rag_pipeline.connect("text_embedder", "retriever") 235 rag_pipeline.connect("retriever", "prompt_builder.documents") 236 rag_pipeline.connect("prompt_builder", "llm") 237 rag_pipeline.connect("llm.replies", "answer_builder.replies") 238 rag_pipeline.connect("retriever", "answer_builder.documents") 239 ``` 240 241 Run the pipeline: 242 243 ```python 244 question = "Which year did the Joker movie release?" 245 result = rag_pipeline.run( 246 { 247 "text_embedder": {"text": question}, 248 "prompt_builder": {"question": question}, 249 "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}}, 250 "answer_builder": {"query": question}, 251 }, 252 ) 253 254 generated_answer = result["answer_builder"]["answers"][0] 255 print(generated_answer.data) 256 ## The Joker movie was released on October 4, 2019. 257 ```