llama_cpp.md
  1  ---
  2  title: "Llama.cpp"
  3  id: integrations-llama-cpp
  4  description: "Llama.cpp integration for Haystack"
  5  slug: "/integrations-llama-cpp"
  6  ---
  7  
  8  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator"></a>
  9  
 10  ## Module haystack\_integrations.components.generators.llama\_cpp.chat.chat\_generator
 11  
 12  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator"></a>
 13  
 14  ### LlamaCppChatGenerator
 15  
 16  Provides an interface to generate text using LLM via llama.cpp.
 17  
 18  [llama.cpp](https://github.com/ggml-org/llama.cpp) is a project written in C/C++ for efficient inference of LLMs.
 19  It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs).
 20  Supports both text-only and multimodal (text + image) models like LLaVA.
 21  
 22  Usage example:
 23  ```python
 24  from haystack_integrations.components.generators.llama_cpp import LlamaCppChatGenerator
 25  user_message = [ChatMessage.from_user("Who is the best American actor?")]
 26  generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512)
 27  
 28  print(generator.run(user_message, generation_kwargs={"max_tokens": 128}))
 29  # {"replies": [ChatMessage(content="John Cusack", role=<ChatRole.ASSISTANT: "assistant">, name=None, meta={...})}
 30  ```
 31  
 32  Usage example with multimodal (image + text):
 33  ```python
 34  from haystack.dataclasses import ChatMessage, ImageContent
 35  
 36  # Create an image from file path or base64
 37  image_content = ImageContent.from_file_path("path/to/your/image.jpg")
 38  
 39  # Create a multimodal message with both text and image
 40  messages = [ChatMessage.from_user(content_parts=["What's in this image?", image_content])]
 41  
 42  # Initialize with multimodal support
 43  generator = LlamaCppChatGenerator(
 44      model="llava-v1.5-7b-q4_0.gguf",
 45      chat_handler_name="Llava15ChatHandler",  # Use llava-1-5 handler
 46      model_clip_path="mmproj-model-f16.gguf",  # CLIP model
 47      n_ctx=4096  # Larger context for image processing
 48  )
 49  generator.warm_up()
 50  
 51  result = generator.run(messages)
 52  print(result)
 53  ```
 54  
 55  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.__init__"></a>
 56  
 57  #### LlamaCppChatGenerator.\_\_init\_\_
 58  
 59  ```python
 60  def __init__(model: str,
 61               n_ctx: int | None = 0,
 62               n_batch: int | None = 512,
 63               model_kwargs: dict[str, Any] | None = None,
 64               generation_kwargs: dict[str, Any] | None = None,
 65               *,
 66               tools: ToolsType | None = None,
 67               streaming_callback: StreamingCallbackT | None = None,
 68               chat_handler_name: str | None = None,
 69               model_clip_path: str | None = None) -> None
 70  ```
 71  
 72  **Arguments**:
 73  
 74  - `model`: The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf".
 75  If the model path is also specified in the `model_kwargs`, this parameter will be ignored.
 76  - `n_ctx`: The number of tokens in the context. When set to 0, the context will be taken from the model.
 77  - `n_batch`: Prompt processing maximum batch size.
 78  - `model_kwargs`: Dictionary containing keyword arguments used to initialize the LLM for text generation.
 79  These keyword arguments provide fine-grained control over the model loading.
 80  In case of duplication, these kwargs override `model`, `n_ctx`, and `n_batch` init parameters.
 81  For more information on the available kwargs, see
 82  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.__init__`).
 83  - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
 84  For more information on the available kwargs, see
 85  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_chat_completion`).
 86  - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
 87  Each tool should have a unique name.
 88  - `streaming_callback`: A callback function that is called when a new token is received from the stream.
 89  - `chat_handler_name`: Name of the chat handler for multimodal models.
 90  Common options include: "Llava16ChatHandler", "MoondreamChatHandler", "Qwen25VLChatHandler".
 91  For other handlers, check
 92  [llama-cpp-python documentation](https://llama-cpp-python.readthedocs.io/en/latest/`multi`-modal-models).
 93  - `model_clip_path`: Path to the CLIP model for vision processing (e.g., "mmproj.bin").
 94  Required when chat_handler_name is provided for multimodal models.
 95  
 96  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.to_dict"></a>
 97  
 98  #### LlamaCppChatGenerator.to\_dict
 99  
100  ```python
101  def to_dict() -> dict[str, Any]
102  ```
103  
104  Serializes the component to a dictionary.
105  
106  **Returns**:
107  
108  Dictionary with serialized data.
109  
110  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.from_dict"></a>
111  
112  #### LlamaCppChatGenerator.from\_dict
113  
114  ```python
115  @classmethod
116  def from_dict(cls, data: dict[str, Any]) -> "LlamaCppChatGenerator"
117  ```
118  
119  Deserializes the component from a dictionary.
120  
121  **Arguments**:
122  
123  - `data`: Dictionary to deserialize from.
124  
125  **Returns**:
126  
127  Deserialized component.
128  
129  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.run"></a>
130  
131  #### LlamaCppChatGenerator.run
132  
133  ```python
134  @component.output_types(replies=list[ChatMessage])
135  def run(
136      messages: list[ChatMessage],
137      generation_kwargs: dict[str, Any] | None = None,
138      *,
139      tools: ToolsType | None = None,
140      streaming_callback: StreamingCallbackT | None = None
141  ) -> dict[str, list[ChatMessage]]
142  ```
143  
144  Run the text generation model on the given list of ChatMessages.
145  
146  **Arguments**:
147  
148  - `messages`: A list of ChatMessage instances representing the input messages.
149  - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
150  For more information on the available kwargs, see
151  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_chat_completion`).
152  - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
153  Each tool should have a unique name. If set, it will override the `tools` parameter set during
154  component initialization.
155  - `streaming_callback`: A callback function that is called when a new token is received from the stream.
156  If set, it will override the `streaming_callback` parameter set during component initialization.
157  
158  **Returns**:
159  
160  A dictionary with the following keys:
161  - `replies`: The responses from the model
162  
163  <a id="haystack_integrations.components.generators.llama_cpp.chat.chat_generator.LlamaCppChatGenerator.run_async"></a>
164  
165  #### LlamaCppChatGenerator.run\_async
166  
167  ```python
168  @component.output_types(replies=list[ChatMessage])
169  async def run_async(
170      messages: list[ChatMessage],
171      generation_kwargs: dict[str, Any] | None = None,
172      *,
173      tools: ToolsType | None = None,
174      streaming_callback: StreamingCallbackT | None = None
175  ) -> dict[str, list[ChatMessage]]
176  ```
177  
178  Async version of run. Runs the text generation model on the given list of ChatMessages.
179  
180  Uses a thread pool to avoid blocking the event loop, since llama-cpp-python provides
181  only synchronous inference.
182  
183  **Arguments**:
184  
185  - `messages`: A list of ChatMessage instances representing the input messages.
186  - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
187  For more information on the available kwargs, see
188  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_chat_completion`).
189  - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
190  Each tool should have a unique name. If set, it will override the `tools` parameter set during
191  component initialization.
192  - `streaming_callback`: A callback function that is called when a new token is received from the stream.
193  If set, it will override the `streaming_callback` parameter set during component initialization.
194  
195  **Returns**:
196  
197  A dictionary with the following keys:
198  - `replies`: The responses from the model
199  
200  <a id="haystack_integrations.components.generators.llama_cpp.generator"></a>
201  
202  ## Module haystack\_integrations.components.generators.llama\_cpp.generator
203  
204  <a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator"></a>
205  
206  ### LlamaCppGenerator
207  
208  Provides an interface to generate text using LLM via llama.cpp.
209  
210  [llama.cpp](https://github.com/ggml-org/llama.cpp) is a project written in C/C++ for efficient inference of LLMs.
211  It employs the quantized GGUF format, suitable for running these models on standard machines (even without GPUs).
212  
213  Usage example:
214  ```python
215  from haystack_integrations.components.generators.llama_cpp import LlamaCppGenerator
216  generator = LlamaCppGenerator(model="zephyr-7b-beta.Q4_0.gguf", n_ctx=2048, n_batch=512)
217  
218  print(generator.run("Who is the best American actor?", generation_kwargs={"max_tokens": 128}))
219  # {'replies': ['John Cusack'], 'meta': [{"object": "text_completion", ...}]}
220  ```
221  
222  <a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.__init__"></a>
223  
224  #### LlamaCppGenerator.\_\_init\_\_
225  
226  ```python
227  def __init__(model: str,
228               n_ctx: int | None = 0,
229               n_batch: int | None = 512,
230               model_kwargs: dict[str, Any] | None = None,
231               generation_kwargs: dict[str, Any] | None = None) -> None
232  ```
233  
234  **Arguments**:
235  
236  - `model`: The path of a quantized model for text generation, for example, "zephyr-7b-beta.Q4_0.gguf".
237  If the model path is also specified in the `model_kwargs`, this parameter will be ignored.
238  - `n_ctx`: The number of tokens in the context. When set to 0, the context will be taken from the model.
239  - `n_batch`: Prompt processing maximum batch size.
240  - `model_kwargs`: Dictionary containing keyword arguments used to initialize the LLM for text generation.
241  These keyword arguments provide fine-grained control over the model loading.
242  In case of duplication, these kwargs override `model`, `n_ctx`, and `n_batch` init parameters.
243  For more information on the available kwargs, see
244  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.__init__`).
245  - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
246  For more information on the available kwargs, see
247  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`).
248  
249  <a id="haystack_integrations.components.generators.llama_cpp.generator.LlamaCppGenerator.run"></a>
250  
251  #### LlamaCppGenerator.run
252  
253  ```python
254  @component.output_types(replies=list[str], meta=list[dict[str, Any]])
255  def run(
256      prompt: str,
257      generation_kwargs: dict[str, Any] | None = None
258  ) -> dict[str, list[str] | list[dict[str, Any]]]
259  ```
260  
261  Run the text generation model on the given prompt.
262  
263  **Arguments**:
264  
265  - `prompt`: the prompt to be sent to the generative model.
266  - `generation_kwargs`: A dictionary containing keyword arguments to customize text generation.
267  For more information on the available kwargs, see
268  [llama.cpp documentation](https://llama-cpp-python.readthedocs.io/en/latest/api-reference/`llama_cpp.Llama.create_completion`).
269  
270  **Returns**:
271  
272  A dictionary with the following keys:
273  - `replies`: the list of replies generated by the model.
274  - `meta`: metadata about the request.
275