llama_stack.md
1 --- 2 title: "Llama Stack" 3 id: integrations-llama-stack 4 description: "Llama Stack integration for Haystack" 5 slug: "/integrations-llama-stack" 6 --- 7 8 <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator"></a> 9 10 ## Module haystack\_integrations.components.generators.llama\_stack.chat.chat\_generator 11 12 <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator"></a> 13 14 ### LlamaStackChatGenerator 15 16 Enables text generation using Llama Stack framework. 17 Llama Stack Server supports multiple inference providers, including Ollama, Together, 18 and vLLM and other cloud providers. 19 For a complete list of inference providers, see [Llama Stack docs](https://llama-stack.readthedocs.io/en/latest/providers/inference/index.html). 20 21 Users can pass any text generation parameters valid for the OpenAI chat completion API 22 directly to this component using the `generation_kwargs` 23 parameter in `__init__` or the `generation_kwargs` parameter in `run` method. 24 25 This component uses the `ChatMessage` format for structuring both input and output, 26 ensuring coherent and contextually relevant responses in chat-based text generation scenarios. 27 Details on the `ChatMessage` format can be found in the 28 [Haystack docs](https://docs.haystack.deepset.ai/docs/chatmessage) 29 30 Usage example: 31 You need to setup Llama Stack Server before running this example and have a model available. For a quick start on 32 how to setup server with Ollama, see [Llama Stack docs](https://llama-stack.readthedocs.io/en/latest/getting_started/index.html). 33 34 ```python 35 from haystack_integrations.components.generators.llama_stack import LlamaStackChatGenerator 36 from haystack.dataclasses import ChatMessage 37 38 messages = [ChatMessage.from_user("What's Natural Language Processing?")] 39 40 client = LlamaStackChatGenerator(model="ollama/llama3.2:3b") 41 response = client.run(messages) 42 print(response) 43 44 >>{'replies': [ChatMessage(_content=[TextContent(text='Natural Language Processing (NLP) 45 is a branch of artificial intelligence 46 >>that focuses on enabling computers to understand, interpret, and generate human language in a way that is 47 >>meaningful and useful.')], _role=<ChatRole.ASSISTANT: 'assistant'>, _name=None, 48 >>_meta={'model': 'ollama/llama3.2:3b', 'index': 0, 'finish_reason': 'stop', 49 >>'usage': {'prompt_tokens': 15, 'completion_tokens': 36, 'total_tokens': 51}})]} 50 51 <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator.__init__"></a> 52 53 #### LlamaStackChatGenerator.\_\_init\_\_ 54 55 ```python 56 def __init__(*, 57 model: str, 58 api_base_url: str = "http://localhost:8321/v1", 59 organization: str | None = None, 60 streaming_callback: StreamingCallbackT | None = None, 61 generation_kwargs: dict[str, Any] | None = None, 62 timeout: int | None = None, 63 tools: ToolsType | None = None, 64 tools_strict: bool = False, 65 max_retries: int | None = None, 66 http_client_kwargs: dict[str, Any] | None = None) 67 ``` 68 69 Creates an instance of LlamaStackChatGenerator. To use this chat generator, 70 71 you need to setup Llama Stack Server with an inference provider and have a model available. 72 73 **Arguments**: 74 75 - `model`: The name of the model to use for chat completion. 76 This depends on the inference provider used for the Llama Stack Server. 77 - `streaming_callback`: A callback function that is called when a new token is received from the stream. 78 The callback function accepts StreamingChunk as an argument. 79 - `api_base_url`: The Llama Stack API base url. If not specified, the localhost is used with the default port 8321. 80 - `organization`: Your organization ID, defaults to `None`. 81 - `generation_kwargs`: Other parameters to use for the model. These parameters are all sent directly to 82 the Llama Stack endpoint. See [Llama Stack API docs](https://llama-stack.readthedocs.io/) for more details. 83 Some of the supported parameters: 84 - `max_tokens`: The maximum number of tokens the output text can have. 85 - `temperature`: What sampling temperature to use. Higher values mean the model will take more risks. 86 Try 0.9 for more creative applications and 0 (argmax sampling) for ones with a well-defined answer. 87 - `top_p`: An alternative to sampling with temperature, called nucleus sampling, where the model 88 considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens 89 comprising the top 10% probability mass are considered. 90 - `stream`: Whether to stream back partial progress. If set, tokens will be sent as data-only server-sent 91 events as they become available, with the stream terminated by a data: [DONE] message. 92 - `safe_prompt`: Whether to inject a safety prompt before all conversations. 93 - `random_seed`: The seed to use for random sampling. 94 - `response_format`: A JSON schema or a Pydantic model that enforces the structure of the model's response. 95 If provided, the output will always be validated against this 96 format (unless the model returns a tool call). 97 For details, see the [OpenAI Structured Outputs documentation](https://platform.openai.com/docs/guides/structured-outputs). 98 Notes: 99 - For structured outputs with streaming, 100 the `response_format` must be a JSON schema and not a Pydantic model. 101 - `timeout`: Timeout for client calls using OpenAI API. If not set, it defaults to either the 102 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 103 - `tools`: A list of Tool and/or Toolset objects, or a single Toolset for which the model can prepare calls. 104 Each tool should have a unique name. 105 - `tools_strict`: Whether to enable strict schema adherence for tool calls. If set to `True`, the model will follow exactly 106 the schema provided in the `parameters` field of the tool definition, but this may increase latency. 107 - `max_retries`: Maximum number of retries to contact OpenAI after an internal error. 108 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5. 109 - `http_client_kwargs`: A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 110 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/`client`). 111 112 <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator.to_dict"></a> 113 114 #### LlamaStackChatGenerator.to\_dict 115 116 ```python 117 def to_dict() -> dict[str, Any] 118 ``` 119 120 Serialize this component to a dictionary. 121 122 **Returns**: 123 124 The serialized component as a dictionary. 125 126 <a id="haystack_integrations.components.generators.llama_stack.chat.chat_generator.LlamaStackChatGenerator.from_dict"></a> 127 128 #### LlamaStackChatGenerator.from\_dict 129 130 ```python 131 @classmethod 132 def from_dict(cls, data: dict[str, Any]) -> "LlamaStackChatGenerator" 133 ``` 134 135 Deserialize this component from a dictionary. 136 137 **Arguments**: 138 139 - `data`: The dictionary representation of this component. 140 141 **Returns**: 142 143 The deserialized component instance. 144