llama_cpp.md
1 --- 2 title: "Llama.cpp" 3 id: integrations-llama-cpp 4 description: "Llama.cpp integration for Haystack" 5 slug: "/integrations-llama-cpp" 6 --- 7 8 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator"></a> 9 10 ## Module haystack\_integrations.components.generators.llama\_cpp.chat.chat\_generator 11 12 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator"></a> 13 14 ### LlamaCppChatGenerator 15 16 Provides an interface to generate text using LLM via llama.cpp. 17 18 [llama.cpp](https://github.com/ggml-org/llama.cpp) is a project written in C/C++ for efficient inference of LLMs. 19 It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs). 20 Supports both text-only and multimodal (text + image) models like LLaVA. 21 22 Usage example: 23 ```python 24 from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator 25 user_message = [ChatMessage.from_user("Who is the best American actor?")] 26 generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512) 27 28 print(generator.run(user_message, generation_kwargs={"max_tokens": 128})) 29 # {"replies": [ChatMessage(content="John Cusack", role=<ChatRole.ASSISTANT: "assistant">, name=None, meta={...})} 30 ``` 31 32 Usage example with multimodal (image + text): 33 ```python 34 from haystack.dataclasses import ChatMessage, ImageContent 35 36 # Create an image from file path or base64 37 image_content = ImageContent.from_file_path("path/to/your/image.jpg") 38 39 # Create a multimodal message with both text and image 40 messages = [ChatMessage.from_user(content_parts=["What's in this image?", image_content])] 41 42 # Initialize with multimodal support 43 generator = LlamaCppChatGenerator( 44 model="llava-v1.5-7b-q4_0.gguf", 45 chat_handler_name="Llava15ChatHandler", # Use llava-1-5 handler 46 model_clip_path="mmproj-model-f16.gguf", # CLIP model 47 n_ctx=4096 # Larger context for image processing 48 ) 49 generator.warm_up() 50 51 result = generator.run(messages) 52 print(result) 53 ``` 54 55 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.__init__"></a> 56 57 #### LlamaCppChatGenerator.\_\_init\_\_ 58 59 ```python 60 def __init__(model: str, 61 n_ctx: int | None = 0, 62 n_batch: int | None = 512, 63 model_kwargs: dict[str, Any] | None = None, 64 generation_kwargs: dict[str, Any] | None = None, 65 *, 66 tools: ToolsType | None = None, 67 streaming_callback: StreamingCallbackT | None = None, 68 chat_handler_name: str | None = None, 69 model_clip_path: str | None = None) -> None 70 ``` 71 72 **Arguments**: 73 74 - `model`: The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf". 75 If the model path is also specified in the `model_kwargs`, this parameter will be ignored. 76 - `n_ctx`: The number of tokens in the context. When set to 0, the context will be taken from the model. 77 - `n_batch`: Prompt processing maximum batch size. 78 - `model_kwargs`: Dictionary containing keyword arguments used to initialize the LLM for text generation. 79 These keyword arguments provide fine-grained control over the model loading. 80 In case of duplication, these kwargs override `model`, `n_ctx`, and `n_batch` init parameters. 81 For more information on the available kwargs, see 82 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.__init__`). 83 - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation. 84 For more information on the available kwargs, see 85 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_chat_completion`). 86 - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. 87 Each tool should have a unique name. 88 - `streaming_callback`: A callback function that is called when a new token is received from the stream. 89 - `chat_handler_name`: Name of the chat handler for multimodal models. 90 Common options include: "Llava16ChatHandler", "MoondreamChatHandler", "Qwen25VLChatHandler". 91 For other handlers, check 92 [llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io/en/latest/`multi`-modal-models). 93 - `model_clip_path`: Path to the CLIP model for vision processing (e.g., "mmproj.bin"). 94 Required when chat_handler_name is provided for multimodal models. 95 96 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.to_dict"></a> 97 98 #### LlamaCppChatGenerator.to\_dict 99 100 ```python 101 def to_dict() -> dict[str, Any] 102 ``` 103 104 Serializes the component to a dictionary. 105 106 **Returns**: 107 108 Dictionary with serialized data. 109 110 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.from_dict"></a> 111 112 #### LlamaCppChatGenerator.from\_dict 113 114 ```python 115 @classmethod 116 def from_dict(cls, data: dict[str, Any]) -> "LlamaCppChatGenerator" 117 ``` 118 119 Deserializes the component from a dictionary. 120 121 **Arguments**: 122 123 - `data`: Dictionary to deserialize from. 124 125 **Returns**: 126 127 Deserialized component. 128 129 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.run"></a> 130 131 #### LlamaCppChatGenerator.run 132 133 ```python 134 @component.output_types(replies=list[ChatMessage]) 135 def run( 136 messages: list[ChatMessage], 137 generation_kwargs: dict[str, Any] | None = None, 138 *, 139 tools: ToolsType | None = None, 140 streaming_callback: StreamingCallbackT | None = None 141 ) -> dict[str, list[ChatMessage]] 142 ``` 143 144 Run the text generation model on the given list of ChatMessages. 145 146 **Arguments**: 147 148 - `messages`: A list of ChatMessage instances representing the input messages. 149 - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation. 150 For more information on the available kwargs, see 151 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_chat_completion`). 152 - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. 153 Each tool should have a unique name. If set, it will override the `tools` parameter set during 154 component initialization. 155 - `streaming_callback`: A callback function that is called when a new token is received from the stream. 156 If set, it will override the `streaming_callback` parameter set during component initialization. 157 158 **Returns**: 159 160 A dictionary with the following keys: 161 - `replies`: The responses from the model 162 163 <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.run_async"></a> 164 165 #### LlamaCppChatGenerator.run\_async 166 167 ```python 168 @component.output_types(replies=list[ChatMessage]) 169 async def run_async( 170 messages: list[ChatMessage], 171 generation_kwargs: dict[str, Any] | None = None, 172 *, 173 tools: ToolsType | None = None, 174 streaming_callback: StreamingCallbackT | None = None 175 ) -> dict[str, list[ChatMessage]] 176 ``` 177 178 Async version of run. Runs the text generation model on the given list of ChatMessages. 179 180 Uses a thread pool to avoid blocking the event loop, since llama-cpp-python provides 181 only synchronous inference. 182 183 **Arguments**: 184 185 - `messages`: A list of ChatMessage instances representing the input messages. 186 - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation. 187 For more information on the available kwargs, see 188 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_chat_completion`). 189 - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. 190 Each tool should have a unique name. If set, it will override the `tools` parameter set during 191 component initialization. 192 - `streaming_callback`: A callback function that is called when a new token is received from the stream. 193 If set, it will override the `streaming_callback` parameter set during component initialization. 194 195 **Returns**: 196 197 A dictionary with the following keys: 198 - `replies`: The responses from the model 199 200 <a id="haystack_integrations.components.generators.llama_cpp.generator"></a> 201 202 ## Module haystack\_integrations.components.generators.llama\_cpp.generator 203 204 <a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator"></a> 205 206 ### LlamaCppGenerator 207 208 Provides an interface to generate text using LLM via llama.cpp. 209 210 [llama.cpp](https://github.com/ggml-org/llama.cpp) is a project written in C/C++ for efficient inference of LLMs. 211 It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs). 212 213 Usage example: 214 ```python 215 from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator 216 generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512) 217 218 print(generator.run("Who is the best American actor?", generation_kwargs={"max_tokens": 128})) 219 # {'replies': ['John Cusack'], 'meta': [{"object": "text_completion", ...}]} 220 ``` 221 222 <a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.__init__"></a> 223 224 #### LlamaCppGenerator.\_\_init\_\_ 225 226 ```python 227 def __init__(model: str, 228 n_ctx: int | None = 0, 229 n_batch: int | None = 512, 230 model_kwargs: dict[str, Any] | None = None, 231 generation_kwargs: dict[str, Any] | None = None) -> None 232 ``` 233 234 **Arguments**: 235 236 - `model`: The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf". 237 If the model path is also specified in the `model_kwargs`, this parameter will be ignored. 238 - `n_ctx`: The number of tokens in the context. When set to 0, the context will be taken from the model. 239 - `n_batch`: Prompt processing maximum batch size. 240 - `model_kwargs`: Dictionary containing keyword arguments used to initialize the LLM for text generation. 241 These keyword arguments provide fine-grained control over the model loading. 242 In case of duplication, these kwargs override `model`, `n_ctx`, and `n_batch` init parameters. 243 For more information on the available kwargs, see 244 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.__init__`). 245 - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation. 246 For more information on the available kwargs, see 247 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`). 248 249 <a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.run"></a> 250 251 #### LlamaCppGenerator.run 252 253 ```python 254 @component.output_types(replies=list[str], meta=list[dict[str, Any]]) 255 def run( 256 prompt: str, 257 generation_kwargs: dict[str, Any] | None = None 258 ) -> dict[str, list[str] | list[dict[str, Any]]] 259 ``` 260 261 Run the text generation model on the given prompt. 262 263 **Arguments**: 264 265 - `prompt`: the prompt to be sent to the generative model. 266 - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation. 267 For more information on the available kwargs, see 268 [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`). 269 270 **Returns**: 271 272 A dictionary with the following keys: 273 - `replies`: the list of replies generated by the model. 274 - `meta`: metadata about the request. 275