llama_stack.md
  1  ---
  2  title: "Llama Stack"
  3  id: integrations-llama-stack
  4  description: "Llama Stack integration for Haystack"
  5  slug: "/integrations-llama-stack"
  6  ---
  7  
  8  <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator"></a>
  9  
 10  ## Module haystack\_integrations.components.generators.llama\_stack.chat.chat\_generator
 11  
 12  <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator"></a>
 13  
 14  ### LlamaStackChatGenerator
 15  
 16  Enables text generation using Llama Stack framework.
 17  Llama Stack Server supports multiple inference providers, including Ollama, Together,
 18  and vLLM and other cloud providers.
 19  For a complete list of inference providers, see [Llama Stack docs](https://llama-stack.readthedocs.io/en/latest/providers/inference/index.html).
 20  
 21  Users can pass any text generation parameters valid for the OpenAI chat completion API
 22  directly to this component using the `generation_kwargs`
 23  parameter in `__init__` or the `generation_kwargs` parameter in `run` method.
 24  
 25  This component uses the `ChatMessage` format for structuring both input and output,
 26  ensuring coherent and contextually relevant responses in chat-based text generation scenarios.
 27  Details on the `ChatMessage` format can be found in the
 28  [Haystack docs](https://docs.haystack.deepset.ai/docs/chatmessage)
 29  
 30  Usage example:
 31  You need to setup Llama Stack Server before running this example and have a model available. For a quick start on
 32  how to setup server with Ollama, see [Llama Stack docs](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html).
 33  
 34  ```python
 35  from haystack_integrations.components.generators.llama_stack import LlamaStackChatGenerator
 36  from haystack.dataclasses import ChatMessage
 37  
 38  messages = [ChatMessage.from_user("What's Natural Language Processing?")]
 39  
 40  client = LlamaStackChatGenerator(model="ollama/llama3.2:3b")
 41  response = client.run(messages)
 42  print(response)
 43  
 44  >>{'replies': [ChatMessage(_content=[TextContent(text='Natural Language Processing (NLP)
 45  is a branch of artificial intelligence
 46  >>that focuses on enabling computers to understand, interpret, and generate human language in a way that is
 47  >>meaningful and useful.')], _role=<ChatRole.ASSISTANT: 'assistant'>, _name=None,
 48  >>_meta={'model': 'ollama/llama3.2:3b', 'index': 0, 'finish_reason': 'stop',
 49  >>'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]}
 50  
 51  <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator.__init__"></a>
 52  
 53  #### LlamaStackChatGenerator.\_\_init\_\_
 54  
 55  ```python
 56  def __init__(*,
 57               model: str,
 58               api_base_url: str = "http://localhost:8321/v1",
 59               organization: str | None = None,
 60               streaming_callback: StreamingCallbackT | None = None,
 61               generation_kwargs: dict[str, Any] | None = None,
 62               timeout: int | None = None,
 63               tools: ToolsType | None = None,
 64               tools_strict: bool = False,
 65               max_retries: int | None = None,
 66               http_client_kwargs: dict[str, Any] | None = None)
 67  ```
 68  
 69  Creates an instance of LlamaStackChatGenerator. To use this chat generator,
 70  
 71  you need to setup Llama Stack Server with an inference provider and have a model available.
 72  
 73  **Arguments**:
 74  
 75  - `model`: The name of the model to use for chat completion.
 76  This depends on the inference provider used for the Llama Stack Server.
 77  - `streaming_callback`: A callback function that is called when a new token is received from the stream.
 78  The callback function accepts StreamingChunk as an argument.
 79  - `api_base_url`: The Llama Stack API base url. If not specified, the localhost is used with the default port 8321.
 80  - `organization`: Your organization ID, defaults to `None`.
 81  - `generation_kwargs`: Other parameters to use for the model. These parameters are all sent directly to
 82  the Llama Stack endpoint. See [Llama Stack API docs](https://llama-stack.readthedocs.io/) for more details.
 83  Some of the supported parameters:
 84  - `max_tokens`: The maximum number of tokens the output text can have.
 85  - `temperature`: What sampling temperature to use. Higher values mean the model will take more risks.
 86      Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer.
 87  - `top_p`: An alternative to sampling with temperature, called nucleus sampling, where the model
 88      considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens
 89      comprising the top 10% probability mass are considered.
 90  - `stream`: Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent
 91      events as they become available, with the stream terminated by a data: [DONE] message.
 92  - `safe_prompt`: Whether to inject a safety prompt before all conversations.
 93  - `random_seed`: The seed to use for random sampling.
 94  - `response_format`: A JSON schema or a Pydantic model that enforces the structure of the model's response.
 95      If provided, the output will always be validated against this
 96      format (unless the model returns a tool call).
 97      For details, see the [OpenAI Structured Outputs documentation](https://platform.openai.com/docs/guides/structured-outputs).
 98      Notes:
 99      - For structured outputs with streaming,
100        the `response_format` must be a JSON schema and not a Pydantic model.
101  - `timeout`: Timeout for client calls using OpenAI API. If not set, it defaults to either the
102  `OPENAI_TIMEOUT` environment variable, or 30 seconds.
103  - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls.
104  Each tool should have a unique name.
105  - `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly
106  the schema provided in the `parameters` field of the tool definition, but this may increase latency.
107  - `max_retries`: Maximum number of retries to contact OpenAI after an internal error.
108  If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5.
109  - `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
110  For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`).
111  
112  <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator.to_dict"></a>
113  
114  #### LlamaStackChatGenerator.to\_dict
115  
116  ```python
117  def to_dict() -> dict[str, Any]
118  ```
119  
120  Serialize this component to a dictionary.
121  
122  **Returns**:
123  
124  The serialized component as a dictionary.
125  
126  <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator.from_dict"></a>
127  
128  #### LlamaStackChatGenerator.from\_dict
129  
130  ```python
131  @classmethod
132  def from_dict(cls, data: dict[str, Any]) -> "LlamaStackChatGenerator"
133  ```
134  
135  Deserialize this component from a dictionary.
136  
137  **Arguments**:
138  
139  - `data`: The dictionary representation of this component.
140  
141  **Returns**:
142  
143  The deserialized component instance.
144