/ docs-website / versioned_docs / version-2.21 / pipeline-components / generators / llamacppgenerator.mdx
llamacppgenerator.mdx
1 --- 2 title: "LlamaCppGenerator" 3 id: llamacppgenerator 4 slug: "/llamacppgenerator" 5 description: "`LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp." 6 --- 7 8 # LlamaCppGenerator 9 10 `LlamaCppGenerator` provides an interface to generate text using an LLM running on Llama.cpp. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | After a [`PromptBuilder`](../builders/promptbuilder.mdx) | 17 | **Mandatory init variables** | `model`: The path of the model to use | 18 | **Mandatory run variables** | `prompt`: A string containing the prompt for the LLM | 19 | **Output variables** | `replies`: A list of strings with all the replies generated by the LLM <br /> <br />`meta`: A list of dictionaries with the metadata associated with each reply, such as token count and others | 20 | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp | 22 23 </div> 24 25 ## Overview 26 27 [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs). 28 29 `Llama.cpp` uses the quantized binary file of the LLM in GGUF format that can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppGenerator` supports models running on `Llama.cpp` by taking the path to the locally saved GGUF file as `model` parameter at initialization. 30 31 ## Installation 32 33 Install the `llama-cpp-haystack` package: 34 35 ```bash 36 pip install llama-cpp-haystack 37 ``` 38 39 ### Using a different compute backend 40 41 The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends: 42 43 1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend. 44 2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above. 45 46 For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands: 47 48 ```bash 49 export GGML_CUDA=1 50 CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python 51 pip install llama-cpp-haystack 52 ``` 53 54 ## Usage 55 56 1. You need to download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). 57 2. Initialize a `LlamaCppGenerator` with the path to the GGUF file and also specify the required model and text generation parameters: 58 59 ```python 60 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 61 62 generator = LlamaCppGenerator( 63 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 64 n_ctx=512, 65 n_batch=128, 66 model_kwargs={"n_gpu_layers": -1}, 67 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 68 ) 69 generator.warm_up() 70 prompt = f"Who is the best American actor?" 71 result = generator.run(prompt) 72 ``` 73 74 ### Passing additional model parameters 75 76 The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter. 77 78 The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters. 79 80 See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments. 81 82 For example, to offload the model to GPU during initialization: 83 84 ```python 85 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 86 87 generator = LlamaCppGenerator( 88 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 89 n_ctx=512, 90 n_batch=128, 91 model_kwargs={"n_gpu_layers": -1}, 92 ) 93 generator.warm_up() 94 prompt = f"Who is the best American actor?" 95 result = generator.run(prompt, generation_kwargs={"max_tokens": 128}) 96 generated_text = result["replies"][0] 97 print(generated_text) 98 ``` 99 100 ### Passing text generation parameters 101 102 The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference. 103 104 See [Llama.cpp's Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion) for more information on the available generation arguments. 105 106 For example, to set the `max_tokens` and `temperature`: 107 108 ```python 109 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 110 111 generator = LlamaCppGenerator( 112 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 113 n_ctx=512, 114 n_batch=128, 115 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 116 ) 117 generator.warm_up() 118 prompt = f"Who is the best American actor?" 119 result = generator.run(prompt) 120 ``` 121 122 The `generation_kwargs` can also be passed to the `run` method of the generator directly: 123 124 ```python 125 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 126 127 generator = LlamaCppGenerator( 128 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 129 n_ctx=512, 130 n_batch=128, 131 ) 132 generator.warm_up() 133 prompt = f"Who is the best American actor?" 134 result = generator.run( 135 prompt, 136 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 137 ) 138 ``` 139 140 ### Using in a Pipeline 141 142 We use the `LlamaCppGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from HuggingFace and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM. 143 144 Load the dataset: 145 146 ```python 147 ## Install HuggingFace Datasets using "pip install datasets" 148 from datasets import load_dataset 149 from haystack import Document, Pipeline 150 from haystack.components.builders.answer_builder import AnswerBuilder 151 from haystack.components.builders.prompt_builder import PromptBuilder 152 from haystack.components.embedders import ( 153 SentenceTransformersDocumentEmbedder, 154 SentenceTransformersTextEmbedder, 155 ) 156 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 157 from haystack.components.writers import DocumentWriter 158 from haystack.document_stores.in_memory import InMemoryDocumentStore 159 160 ## Import LlamaCppGenerator 161 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 162 163 ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace 164 dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]") 165 166 docs = [ 167 Document( 168 content=doc["text"], 169 meta={ 170 "title": doc["title"], 171 "url": doc["url"], 172 }, 173 ) 174 for doc in dataset 175 ] 176 ``` 177 178 Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`: 179 180 ```python 181 doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine") 182 doc_embedder = SentenceTransformersDocumentEmbedder( 183 model="sentence-transformers/all-MiniLM-L6-v2", 184 ) 185 186 ## Indexing Pipeline 187 indexing_pipeline = Pipeline() 188 indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder") 189 indexing_pipeline.add_component( 190 instance=DocumentWriter(document_store=doc_store), 191 name="DocWriter", 192 ) 193 indexing_pipeline.connect(connect_from="DocEmbedder", connect_to="DocWriter") 194 195 indexing_pipeline.run({"DocEmbedder": {"documents": docs}}) 196 ``` 197 198 Create the Retrieval Augmented Generation (RAG) pipeline and add the `LlamaCppGenerator` to it: 199 200 ```python 201 ## Prompt Template for the https://huggingface.co/openchat/openchat-3.5-1210 LLM 202 prompt_template = """GPT4 Correct User: Answer the question using the provided context. 203 Question: {{question}} 204 Context: 205 {% for doc in documents %} 206 {{ doc.content }} 207 {% endfor %} 208 <|end_of_turn|> 209 GPT4 Correct Assistant: 210 """ 211 212 rag_pipeline = Pipeline() 213 214 text_embedder = SentenceTransformersTextEmbedder( 215 model="sentence-transformers/all-MiniLM-L6-v2", 216 ) 217 218 ## Load the LLM using LlamaCppGenerator 219 model_path = "openchat-3.5-1210.Q3_K_S.gguf" 220 generator = LlamaCppGenerator(model=model_path, n_ctx=4096, n_batch=128) 221 222 rag_pipeline.add_component( 223 instance=text_embedder, 224 name="text_embedder", 225 ) 226 rag_pipeline.add_component( 227 instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), 228 name="retriever", 229 ) 230 rag_pipeline.add_component( 231 instance=PromptBuilder(template=prompt_template), 232 name="prompt_builder", 233 ) 234 rag_pipeline.add_component(instance=generator, name="llm") 235 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 236 237 rag_pipeline.connect("text_embedder", "retriever") 238 rag_pipeline.connect("retriever", "prompt_builder.documents") 239 rag_pipeline.connect("prompt_builder", "llm") 240 rag_pipeline.connect("llm.replies", "answer_builder.replies") 241 rag_pipeline.connect("retriever", "answer_builder.documents") 242 ``` 243 244 Run the pipeline: 245 246 ```python 247 question = "Which year did the Joker movie release?" 248 result = rag_pipeline.run( 249 { 250 "text_embedder": {"text": question}, 251 "prompt_builder": {"question": question}, 252 "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}}, 253 "answer_builder": {"query": question}, 254 }, 255 ) 256 257 generated_answer = result["answer_builder"]["answers"][0] 258 print(generated_answer.data) 259 ## The Joker movie was released on October 4, 2019. 260 ```