/ docs-website / versioned_docs / version-2.18 / pipeline-components / generators / llamacppchatgenerator.mdx
llamacppchatgenerator.mdx
1 --- 2 title: "LlamaCppChatGenerator" 3 id: llamacppchatgenerator 4 slug: "/llamacppchatgenerator" 5 description: "`LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp." 6 --- 7 8 # LlamaCppChatGenerator 9 10 `LlamaCppGenerator` enables chat completion using an LLM running on Llama.cpp. 11 12 | | | 13 | :------------------------------------- | :------------------------------------------------------------------------------------------------------------------------ | 14 | **Most common position in a pipeline** | After a [`ChatPromptBuilder`](../builders/chatpromptbuilder.mdx) | 15 | **Mandatory init variables** | "model": The path of the model to use | 16 | **Mandatory run variables** | “messages”: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) instances representing the input messages | 17 | **Output variables** | “replies”: A list of [`ChatMessage`](../../concepts/data-classes/chatmessage.mdx) instances with all the replies generated by the LLM | 18 | **API reference** | [Llama.cpp](/reference/integrations-llama-cpp) | 19 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp | 20 21 ## Overview 22 23 [Llama.cpp](https://github.com/ggml-org/llama.cpp) is a library written in C/C++ for efficient inference of Large Language Models. It leverages the efficient quantized GGUF format, dramatically reducing memory requirements and accelerating inference. This means it is possible to run LLMs efficiently on standard machines (even without GPUs). 24 25 `Llama.cpp` uses the quantized binary file of the LLM in GGUF format, which can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). `LlamaCppChatGenerator` supports models running on `Llama.cpp` by taking the path to the locally saved GGUF file as `model` parameter at initialization. 26 27 ## Installation 28 29 Install the `llama-cpp-haystack` package to use this integration: 30 31 ```shell 32 pip install llama-cpp-haystack 33 ``` 34 35 ### Using a different compute backend 36 37 The default installation behavior is to build `llama.cpp` for CPU on Linux and Windows and use Metal on MacOS. To use other compute backends: 38 39 1. Follow instructions on the [llama.cpp installation page](https://github.com/abetlen/llama-cpp-python#installation) to install [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) for your preferred compute backend. 40 2. Install [llama-cpp-haystack](https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/llama_cpp) using the command above. 41 42 For example, to use `llama-cpp-haystack` with the **cuBLAS backend**, you have to run the following commands: 43 44 ```shell 45 export GGML_CUDA=1 46 CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python 47 pip install llama-cpp-haystack 48 ``` 49 50 ## Usage 51 52 1. Download the GGUF version of the desired LLM. The GGUF versions of popular models can be downloaded from [Hugging Face](https://huggingface.co/models?library=gguf). 53 2. Initialize `LlamaCppChatGenerator` with the path to the GGUF file and specify the required model and text generation parameters: 54 55 ```python 56 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 57 58 generator = LlamaCppChatGenerator( 59 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 60 n_ctx=512, 61 n_batch=128, 62 model_kwargs={"n_gpu_layers": -1}, 63 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 64 ) 65 generator.warm_up() 66 messages = [ChatMessage.from_user("Who is the best American actor?")] 67 result = generator.run(messages) 68 ``` 69 70 ### Passing additional model parameters 71 72 The `model`, `n_ctx`, `n_batch` arguments have been exposed for convenience and can be directly passed to the Generator during initialization as keyword arguments. Note that `model` translates to `llama.cpp`'s `model_path` parameter. 73 74 The `model_kwargs` parameter can pass additional arguments when initializing the model. In case of duplication, these parameters override the `model`, `n_ctx`, and `n_batch` initialization parameters. 75 76 See [Llama.cpp's LLM documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.__init__) for more information on the available model arguments. 77 78 **Note**: Llama.cpp automatically extracts the `chat_template` from the model metadata for applying formatting to ChatMessages. You can override the `chat_template` used by passing in a custom `chat_handler` or `chat_format` as a model parameter. 79 80 For example, to offload the model to GPU during initialization: 81 82 ```python 83 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 84 from haystack.dataclasses import ChatMessage 85 86 generator = LlamaCppChatGenerator( 87 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 88 n_ctx=512, 89 n_batch=128, 90 model_kwargs={"n_gpu_layers": -1}, 91 ) 92 generator.warm_up() 93 messages = [ChatMessage.from_user("Who is the best American actor?")] 94 result = generator.run(messages, generation_kwargs={"max_tokens": 128}) 95 generated_reply = result["replies"][0].content 96 print(generated_reply) 97 ``` 98 99 ### Passing text generation parameters 100 101 The `generation_kwargs` parameter can pass additional generation arguments like `max_tokens`, `temperature`, `top_k`, `top_p`, and others to the model during inference. 102 103 See [Llama.cpp's Chat Completion API documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_chat_completion) for more information on the available generation arguments. 104 105 **Note**: JSON mode, Function Calling, and Tools are all supported as `generation_kwargs`. Please see the [llama-cpp-python GitHub README](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#json-and-json-schema-mode) for more information on how to use them. 106 107 For example, to set the `max_tokens` and `temperature`: 108 109 ```python 110 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 111 from haystack.dataclasses import ChatMessage 112 113 generator = LlamaCppChatGenerator( 114 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 115 n_ctx=512, 116 n_batch=128, 117 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 118 ) 119 generator.warm_up() 120 messages = [ChatMessage.from_user("Who is the best American actor?")] 121 result = generator.run(messages) 122 ``` 123 124 The `generation_kwargs` can also be passed to the `run` method of the generator directly: 125 126 ```python 127 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 128 from haystack.dataclasses import ChatMessage 129 130 generator = LlamaCppChatGenerator( 131 model="/content/openchat-3.5-1210.Q3_K_S.gguf", 132 n_ctx=512, 133 n_batch=128, 134 ) 135 generator.warm_up() 136 messages = [ChatMessage.from_user("Who is the best American actor?")] 137 result = generator.run( 138 messages, 139 generation_kwargs={"max_tokens": 128, "temperature": 0.1}, 140 ) 141 ``` 142 143 ### In a pipeline 144 145 We use the `LlamaCppChatGenerator` in a Retrieval Augmented Generation pipeline on the [Simple Wikipedia](https://huggingface.co/datasets/pszemraj/simple_wikipedia) Dataset from Hugging Face and generate answers using the [OpenChat-3.5](https://huggingface.co/openchat/openchat-3.5-1210) LLM. 146 147 Load the dataset: 148 149 ```python 150 ## Install HuggingFace Datasets using "pip install datasets" 151 from datasets import load_dataset 152 from haystack import Document, Pipeline 153 from haystack.components.builders.answer_builder import AnswerBuilder 154 from haystack.components.builders import ChatPromptBuilder 155 from haystack.components.embedders import ( 156 SentenceTransformersDocumentEmbedder, 157 SentenceTransformersTextEmbedder, 158 ) 159 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 160 from haystack.components.writers import DocumentWriter 161 from haystack.document_stores.in_memory import InMemoryDocumentStore 162 from haystack.dataclasses import ChatMessage 163 164 ## Import LlamaCppChatGenerator 165 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 166 167 ## Load first 100 rows of the Simple Wikipedia Dataset from HuggingFace 168 dataset = load_dataset("pszemraj/simple_wikipedia", split="validation[:100]") 169 170 docs = [ 171 Document( 172 content=doc["text"], 173 meta={ 174 "title": doc["title"], 175 "url": doc["url"], 176 }, 177 ) 178 for doc in dataset 179 ] 180 ``` 181 182 Index the documents to the `InMemoryDocumentStore` using the `SentenceTransformersDocumentEmbedder` and `DocumentWriter`: 183 184 ```python 185 doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine") 186 ## Install sentence transformers using "pip install sentence-transformers" 187 doc_embedder = SentenceTransformersDocumentEmbedder( 188 model="sentence-transformers/all-MiniLM-L6-v2", 189 ) 190 191 ## Indexing Pipeline 192 indexing_pipeline = Pipeline() 193 indexing_pipeline.add_component(instance=doc_embedder, name="DocEmbedder") 194 indexing_pipeline.add_component( 195 instance=DocumentWriter(document_store=doc_store), 196 name="DocWriter", 197 ) 198 indexing_pipeline.connect("DocEmbedder", "DocWriter") 199 200 indexing_pipeline.run({"DocEmbedder": {"documents": docs}}) 201 ``` 202 203 Create the RAG pipeline and add the `LlamaCppChatGenerator` to it: 204 205 ```python 206 system_message = ChatMessage.from_system( 207 """ 208 Answer the question using the provided context. 209 Context: 210 {% for doc in documents %} 211 {{ doc.content }} 212 {% endfor %} 213 """, 214 ) 215 user_message = ChatMessage.from_user("Question: {{question}}") 216 assistent_message = ChatMessage.from_assistant("Answer: ") 217 218 chat_template = [system_message, user_message, assistent_message] 219 220 rag_pipeline = Pipeline() 221 222 text_embedder = SentenceTransformersTextEmbedder( 223 model="sentence-transformers/all-MiniLM-L6-v2", 224 ) 225 226 ## Load the LLM using LlamaCppChatGenerator 227 model_path = "openchat-3.5-1210.Q3_K_S.gguf" 228 generator = LlamaCppChatGenerator(model=model_path, n_ctx=4096, n_batch=128) 229 230 rag_pipeline.add_component( 231 instance=text_embedder, 232 name="text_embedder", 233 ) 234 rag_pipeline.add_component( 235 instance=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=3), 236 name="retriever", 237 ) 238 rag_pipeline.add_component( 239 instance=ChatPromptBuilder(template=chat_template), 240 name="prompt_builder", 241 ) 242 rag_pipeline.add_component(instance=generator, name="llm") 243 rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder") 244 245 rag_pipeline.connect("text_embedder", "retriever") 246 rag_pipeline.connect("retriever", "prompt_builder.documents") 247 rag_pipeline.connect("prompt_builder", "llm") 248 rag_pipeline.connect("llm", "answer_builder") 249 rag_pipeline.connect("retriever", "answer_builder.documents") 250 ``` 251 252 Run the pipeline: 253 254 ```python 255 question = "Which year did the Joker movie release?" 256 result = rag_pipeline.run( 257 { 258 "text_embedder": {"text": question}, 259 "prompt_builder": {"question": question}, 260 "llm": {"generation_kwargs": {"max_tokens": 128, "temperature": 0.1}}, 261 "answer_builder": {"query": question}, 262 }, 263 ) 264 265 generated_answer = result["answer_builder"]["answers"][0] 266 print(generated_answer.data) 267 ## The Joker movie was released on October 4, 2019. 268 ```