embedders_api.md
1 --- 2 title: "Embedders" 3 id: embedders-api 4 description: "Transforms queries into vectors to look for similar or relevant Documents." 5 slug: "/embedders-api" 6 --- 7 8 9 ## azure_document_embedder 10 11 ### AzureOpenAIDocumentEmbedder 12 13 Bases: <code>OpenAIDocumentEmbedder</code> 14 15 Calculates document embeddings using OpenAI models deployed on Azure. 16 17 ### Usage example 18 19 ```python 20 from haystack import Document 21 from haystack.components.embedders import AzureOpenAIDocumentEmbedder 22 23 doc = Document(content="I love pizza!") 24 25 document_embedder = AzureOpenAIDocumentEmbedder() 26 27 result = document_embedder.run([doc]) 28 print(result['documents'][0].embedding) 29 30 # [0.017020374536514282, -0.023255806416273117, ...] 31 ``` 32 33 #### __init__ 34 35 ```python 36 __init__( 37 azure_endpoint: str | None = None, 38 api_version: str | None = "2023-05-15", 39 azure_deployment: str = "text-embedding-ada-002", 40 dimensions: int | None = None, 41 api_key: Secret | None = Secret.from_env_var( 42 "AZURE_OPENAI_API_KEY", strict=False 43 ), 44 azure_ad_token: Secret | None = Secret.from_env_var( 45 "AZURE_OPENAI_AD_TOKEN", strict=False 46 ), 47 organization: str | None = None, 48 prefix: str = "", 49 suffix: str = "", 50 batch_size: int = 32, 51 progress_bar: bool = True, 52 meta_fields_to_embed: list[str] | None = None, 53 embedding_separator: str = "\n", 54 timeout: float | None = None, 55 max_retries: int | None = None, 56 *, 57 default_headers: dict[str, str] | None = None, 58 azure_ad_token_provider: AzureADTokenProvider | None = None, 59 http_client_kwargs: dict[str, Any] | None = None, 60 raise_on_failure: bool = False 61 ) -> None 62 ``` 63 64 Creates an AzureOpenAIDocumentEmbedder component. 65 66 **Parameters:** 67 68 - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure. 69 - **api_version** (<code>str | None</code>) – The version of the API to use. 70 - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002. 71 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only supported in text-embedding-3 72 and later models. 73 - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key. 74 You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this 75 parameter during initialization. 76 - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's 77 [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) 78 documentation for more information. You can set it with an environment variable 79 `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. 80 Previously called Azure Active Directory. 81 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 82 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 83 for more information. 84 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 85 - **suffix** (<code>str</code>) – A string to add at the end of each text. 86 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 87 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 88 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 89 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 90 - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds. 91 If not set, defaults to either the 92 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 93 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error. 94 If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable or to 5 retries. 95 - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client. 96 - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on 97 every request. 98 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 99 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 100 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error 101 and continue processing the remaining documents. If `True`, it will raise an exception on failure. 102 103 #### to_dict 104 105 ```python 106 to_dict() -> dict[str, Any] 107 ``` 108 109 Serializes the component to a dictionary. 110 111 **Returns:** 112 113 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 114 115 #### from_dict 116 117 ```python 118 from_dict(data: dict[str, Any]) -> AzureOpenAIDocumentEmbedder 119 ``` 120 121 Deserializes the component from a dictionary. 122 123 **Parameters:** 124 125 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 126 127 **Returns:** 128 129 - <code>AzureOpenAIDocumentEmbedder</code> – Deserialized component. 130 131 ## azure_text_embedder 132 133 ### AzureOpenAITextEmbedder 134 135 Bases: <code>OpenAITextEmbedder</code> 136 137 Embeds strings using OpenAI models deployed on Azure. 138 139 ### Usage example 140 141 ```python 142 from haystack.components.embedders import AzureOpenAITextEmbedder 143 144 text_to_embed = "I love pizza!" 145 146 text_embedder = AzureOpenAITextEmbedder() 147 148 print(text_embedder.run(text_to_embed)) 149 150 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 151 # 'meta': {'model': 'text-embedding-ada-002-v2', 152 # 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}} 153 ``` 154 155 #### __init__ 156 157 ```python 158 __init__( 159 azure_endpoint: str | None = None, 160 api_version: str | None = "2023-05-15", 161 azure_deployment: str = "text-embedding-ada-002", 162 dimensions: int | None = None, 163 api_key: Secret | None = Secret.from_env_var( 164 "AZURE_OPENAI_API_KEY", strict=False 165 ), 166 azure_ad_token: Secret | None = Secret.from_env_var( 167 "AZURE_OPENAI_AD_TOKEN", strict=False 168 ), 169 organization: str | None = None, 170 timeout: float | None = None, 171 max_retries: int | None = None, 172 prefix: str = "", 173 suffix: str = "", 174 *, 175 default_headers: dict[str, str] | None = None, 176 azure_ad_token_provider: AzureADTokenProvider | None = None, 177 http_client_kwargs: dict[str, Any] | None = None 178 ) -> None 179 ``` 180 181 Creates an AzureOpenAITextEmbedder component. 182 183 **Parameters:** 184 185 - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure. 186 - **api_version** (<code>str | None</code>) – The version of the API to use. 187 - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002. 188 - **dimensions** (<code>int | None</code>) – The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 189 and later models. 190 - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key. 191 You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this 192 parameter during initialization. 193 - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's 194 [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) 195 documentation for more information. You can set it with an environment variable 196 `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. 197 Previously called Azure Active Directory. 198 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 199 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 200 for more information. 201 - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds. 202 If not set, defaults to either the 203 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 204 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error. 205 If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable, or to 5 retries. 206 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 207 - **suffix** (<code>str</code>) – A string to add at the end of each text. 208 - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client. 209 - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on 210 every request. 211 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 212 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 213 214 #### to_dict 215 216 ```python 217 to_dict() -> dict[str, Any] 218 ``` 219 220 Serializes the component to a dictionary. 221 222 **Returns:** 223 224 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 225 226 #### from_dict 227 228 ```python 229 from_dict(data: dict[str, Any]) -> AzureOpenAITextEmbedder 230 ``` 231 232 Deserializes the component from a dictionary. 233 234 **Parameters:** 235 236 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 237 238 **Returns:** 239 240 - <code>AzureOpenAITextEmbedder</code> – Deserialized component. 241 242 ## hugging_face_api_document_embedder 243 244 ### HuggingFaceAPIDocumentEmbedder 245 246 Embeds documents using Hugging Face APIs. 247 248 Use it with the following Hugging Face APIs: 249 250 - [Free Serverless Inference API](https://huggingface.co/inference-api) 251 - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints) 252 - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 253 254 ### Usage examples 255 256 #### With free serverless inference API 257 258 ```python 259 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 260 from haystack.utils import Secret 261 from haystack.dataclasses import Document 262 263 doc = Document(content="I love pizza!") 264 265 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="serverless_inference_api", 266 api_params={"model": "BAAI/bge-small-en-v1.5"}, 267 token=Secret.from_token("<your-api-key>")) 268 269 result = document_embedder.run([doc]) 270 print(result["documents"][0].embedding) 271 272 # [0.017020374536514282, -0.023255806416273117, ...] 273 ``` 274 275 #### With paid inference endpoints 276 277 ```python 278 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 279 from haystack.utils import Secret 280 from haystack.dataclasses import Document 281 282 doc = Document(content="I love pizza!") 283 284 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="inference_endpoints", 285 api_params={"url": "<your-inference-endpoint-url>"}, 286 token=Secret.from_token("<your-api-key>")) 287 288 result = document_embedder.run([doc]) 289 print(result["documents"][0].embedding) 290 291 # [0.017020374536514282, -0.023255806416273117, ...] 292 ``` 293 294 #### With self-hosted text embeddings inference 295 296 ```python 297 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 298 from haystack.dataclasses import Document 299 300 doc = Document(content="I love pizza!") 301 302 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="text_embeddings_inference", 303 api_params={"url": "http://localhost:8080"}) 304 305 result = document_embedder.run([doc]) 306 print(result["documents"][0].embedding) 307 308 # [0.017020374536514282, -0.023255806416273117, ...] 309 ``` 310 311 #### __init__ 312 313 ```python 314 __init__( 315 api_type: HFEmbeddingAPIType | str, 316 api_params: dict[str, str], 317 token: Secret | None = Secret.from_env_var( 318 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 319 ), 320 prefix: str = "", 321 suffix: str = "", 322 truncate: bool | None = True, 323 normalize: bool | None = False, 324 batch_size: int = 32, 325 progress_bar: bool = True, 326 meta_fields_to_embed: list[str] | None = None, 327 embedding_separator: str = "\n", 328 concurrency_limit: int = 4, 329 ) -> None 330 ``` 331 332 Creates a HuggingFaceAPIDocumentEmbedder component. 333 334 **Parameters:** 335 336 - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use. 337 - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys: 338 - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. 339 - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or 340 `TEXT_EMBEDDINGS_INFERENCE`. 341 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 342 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 343 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 344 - **suffix** (<code>str</code>) – A string to add at the end of each text. 345 - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model. 346 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 347 if the backend uses Text Embeddings Inference. 348 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 349 - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length. 350 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 351 if the backend uses Text Embeddings Inference. 352 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 353 - **batch_size** (<code>int</code>) – Number of documents to process at once. 354 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 355 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 356 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 357 - **concurrency_limit** (<code>int</code>) – The maximum number of requests that should be allowed to run concurrently. 358 This parameter is only used in the `run_async` method. 359 360 #### to_dict 361 362 ```python 363 to_dict() -> dict[str, Any] 364 ``` 365 366 Serializes the component to a dictionary. 367 368 **Returns:** 369 370 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 371 372 #### from_dict 373 374 ```python 375 from_dict(data: dict[str, Any]) -> HuggingFaceAPIDocumentEmbedder 376 ``` 377 378 Deserializes the component from a dictionary. 379 380 **Parameters:** 381 382 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 383 384 **Returns:** 385 386 - <code>HuggingFaceAPIDocumentEmbedder</code> – Deserialized component. 387 388 #### run 389 390 ```python 391 run(documents: list[Document]) -> dict[str, list[Document]] 392 ``` 393 394 Embeds a list of documents. 395 396 **Parameters:** 397 398 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 399 400 **Returns:** 401 402 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 403 - `documents`: A list of documents with embeddings. 404 405 #### run_async 406 407 ```python 408 run_async(documents: list[Document]) -> dict[str, list[Document]] 409 ``` 410 411 Embeds a list of documents asynchronously. 412 413 **Parameters:** 414 415 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 416 417 **Returns:** 418 419 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 420 - `documents`: A list of documents with embeddings. 421 422 ## hugging_face_api_text_embedder 423 424 ### HuggingFaceAPITextEmbedder 425 426 Embeds strings using Hugging Face APIs. 427 428 Use it with the following Hugging Face APIs: 429 430 - [Free Serverless Inference API](https://huggingface.co/inference-api) 431 - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints) 432 - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 433 434 ### Usage examples 435 436 #### With free serverless inference API 437 438 ```python 439 from haystack.components.embedders import HuggingFaceAPITextEmbedder 440 from haystack.utils import Secret 441 442 text_embedder = HuggingFaceAPITextEmbedder(api_type="serverless_inference_api", 443 api_params={"model": "BAAI/bge-small-en-v1.5"}, 444 token=Secret.from_token("<your-api-key>")) 445 446 print(text_embedder.run("I love pizza!")) 447 448 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 449 ``` 450 451 #### With paid inference endpoints 452 453 ```python 454 from haystack.components.embedders import HuggingFaceAPITextEmbedder 455 from haystack.utils import Secret 456 text_embedder = HuggingFaceAPITextEmbedder(api_type="inference_endpoints", 457 api_params={"model": "BAAI/bge-small-en-v1.5"}, 458 token=Secret.from_token("<your-api-key>")) 459 460 print(text_embedder.run("I love pizza!")) 461 462 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 463 ``` 464 465 #### With self-hosted text embeddings inference 466 467 ```python 468 from haystack.components.embedders import HuggingFaceAPITextEmbedder 469 from haystack.utils import Secret 470 471 text_embedder = HuggingFaceAPITextEmbedder(api_type="text_embeddings_inference", 472 api_params={"url": "http://localhost:8080"}) 473 474 print(text_embedder.run("I love pizza!")) 475 476 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 477 ``` 478 479 #### __init__ 480 481 ```python 482 __init__( 483 api_type: HFEmbeddingAPIType | str, 484 api_params: dict[str, str], 485 token: Secret | None = Secret.from_env_var( 486 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 487 ), 488 prefix: str = "", 489 suffix: str = "", 490 truncate: bool | None = True, 491 normalize: bool | None = False, 492 ) -> None 493 ``` 494 495 Creates a HuggingFaceAPITextEmbedder component. 496 497 **Parameters:** 498 499 - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use. 500 - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys: 501 - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. 502 - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or 503 `TEXT_EMBEDDINGS_INFERENCE`. 504 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 505 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 506 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 507 - **suffix** (<code>str</code>) – A string to add at the end of each text. 508 - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model. 509 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 510 if the backend uses Text Embeddings Inference. 511 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 512 - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length. 513 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 514 if the backend uses Text Embeddings Inference. 515 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 516 517 #### to_dict 518 519 ```python 520 to_dict() -> dict[str, Any] 521 ``` 522 523 Serializes the component to a dictionary. 524 525 **Returns:** 526 527 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 528 529 #### from_dict 530 531 ```python 532 from_dict(data: dict[str, Any]) -> HuggingFaceAPITextEmbedder 533 ``` 534 535 Deserializes the component from a dictionary. 536 537 **Parameters:** 538 539 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 540 541 **Returns:** 542 543 - <code>HuggingFaceAPITextEmbedder</code> – Deserialized component. 544 545 #### run 546 547 ```python 548 run(text: str) -> dict[str, Any] 549 ``` 550 551 Embeds a single string. 552 553 **Parameters:** 554 555 - **text** (<code>str</code>) – Text to embed. 556 557 **Returns:** 558 559 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 560 - `embedding`: The embedding of the input text. 561 562 #### run_async 563 564 ```python 565 run_async(text: str) -> dict[str, Any] 566 ``` 567 568 Embeds a single string asynchronously. 569 570 **Parameters:** 571 572 - **text** (<code>str</code>) – Text to embed. 573 574 **Returns:** 575 576 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 577 - `embedding`: The embedding of the input text. 578 579 ## image/sentence_transformers_doc_image_embedder 580 581 ### SentenceTransformersDocumentImageEmbedder 582 583 A component for computing Document embeddings based on images using Sentence Transformers models. 584 585 The embedding of each Document is stored in the `embedding` field of the Document. 586 587 ### Usage example 588 589 ```python 590 from haystack import Document 591 from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder 592 593 embedder = SentenceTransformersDocumentImageEmbedder(model="sentence-transformers/clip-ViT-B-32") 594 595 documents = [ 596 Document(content="A photo of a cat", meta={"file_path": "cat.jpg"}), 597 Document(content="A photo of a dog", meta={"file_path": "dog.jpg"}), 598 ] 599 600 result = embedder.run(documents=documents) 601 documents_with_embeddings = result["documents"] 602 print(documents_with_embeddings) 603 604 # [Document(id=..., 605 # content='A photo of a cat', 606 # meta={'file_path': 'cat.jpg', 607 # 'embedding_source': {'type': 'image', 'file_path_meta_field': 'file_path'}}, 608 # embedding=vector of size 512), 609 # ...] 610 ``` 611 612 #### __init__ 613 614 ```python 615 __init__( 616 *, 617 file_path_meta_field: str = "file_path", 618 root_path: str | None = None, 619 model: str = "sentence-transformers/clip-ViT-B-32", 620 device: ComponentDevice | None = None, 621 token: Secret | None = Secret.from_env_var( 622 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 623 ), 624 batch_size: int = 32, 625 progress_bar: bool = True, 626 normalize_embeddings: bool = False, 627 trust_remote_code: bool = False, 628 local_files_only: bool = False, 629 model_kwargs: dict[str, Any] | None = None, 630 tokenizer_kwargs: dict[str, Any] | None = None, 631 config_kwargs: dict[str, Any] | None = None, 632 precision: Literal[ 633 "float32", "int8", "uint8", "binary", "ubinary" 634 ] = "float32", 635 encode_kwargs: dict[str, Any] | None = None, 636 backend: Literal["torch", "onnx", "openvino"] = "torch" 637 ) -> None 638 ``` 639 640 Creates a SentenceTransformersDocumentEmbedder component. 641 642 **Parameters:** 643 644 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 645 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 646 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 647 - **model** (<code>str</code>) – The Sentence Transformers model to use for calculating embeddings. Pass a local path or ID of the model on 648 Hugging Face. To be used with this component, the model must be able to embed images and text into the same 649 vector space. Compatible models include: 650 - "sentence-transformers/clip-ViT-B-32" 651 - "sentence-transformers/clip-ViT-L-14" 652 - "sentence-transformers/clip-ViT-B-16" 653 - "sentence-transformers/clip-ViT-B-32-multilingual-v1" 654 - "jinaai/jina-embeddings-v4" 655 - "jinaai/jina-clip-v1" 656 - "jinaai/jina-clip-v2". 657 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 658 Overrides the default device. 659 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 660 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 661 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 662 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1. 663 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 664 If `True`, allows custom models and scripts. 665 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 666 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 667 when loading the model. Refer to specific model documentation for available kwargs. 668 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 669 Refer to specific model documentation for available kwargs. 670 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 671 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 672 All non-float32 precisions are quantized embeddings. 673 Quantized embeddings are smaller and faster to compute, but may have a lower accuracy. 674 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 675 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents. 676 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 677 avoid passing parameters that change the output type. 678 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 679 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 680 for more information on acceleration and quantization options. 681 682 #### to_dict 683 684 ```python 685 to_dict() -> dict[str, Any] 686 ``` 687 688 Serializes the component to a dictionary. 689 690 **Returns:** 691 692 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 693 694 #### from_dict 695 696 ```python 697 from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentImageEmbedder 698 ``` 699 700 Deserializes the component from a dictionary. 701 702 **Parameters:** 703 704 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 705 706 **Returns:** 707 708 - <code>SentenceTransformersDocumentImageEmbedder</code> – Deserialized component. 709 710 #### warm_up 711 712 ```python 713 warm_up() -> None 714 ``` 715 716 Initializes the component. 717 718 #### run 719 720 ```python 721 run(documents: list[Document]) -> dict[str, list[Document]] 722 ``` 723 724 Embed a list of documents. 725 726 **Parameters:** 727 728 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 729 730 **Returns:** 731 732 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 733 - `documents`: Documents with embeddings. 734 735 ## openai_document_embedder 736 737 ### OpenAIDocumentEmbedder 738 739 Computes document embeddings using OpenAI models. 740 741 ### Usage example 742 743 ```python 744 from haystack import Document 745 from haystack.components.embedders import OpenAIDocumentEmbedder 746 747 doc = Document(content="I love pizza!") 748 749 document_embedder = OpenAIDocumentEmbedder() 750 751 result = document_embedder.run([doc]) 752 print(result['documents'][0].embedding) 753 754 # [0.017020374536514282, -0.023255806416273117, ...] 755 ``` 756 757 #### __init__ 758 759 ```python 760 __init__( 761 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 762 model: str = "text-embedding-ada-002", 763 dimensions: int | None = None, 764 api_base_url: str | None = None, 765 organization: str | None = None, 766 prefix: str = "", 767 suffix: str = "", 768 batch_size: int = 32, 769 progress_bar: bool = True, 770 meta_fields_to_embed: list[str] | None = None, 771 embedding_separator: str = "\n", 772 timeout: float | None = None, 773 max_retries: int | None = None, 774 http_client_kwargs: dict[str, Any] | None = None, 775 *, 776 raise_on_failure: bool = False 777 ) -> None 778 ``` 779 780 Creates an OpenAIDocumentEmbedder component. 781 782 Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' 783 environment variables to override the `timeout` and `max_retries` parameters respectively 784 in the OpenAI client. 785 786 **Parameters:** 787 788 - **api_key** (<code>Secret</code>) – The OpenAI API key. 789 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 790 during initialization. 791 - **model** (<code>str</code>) – The name of the model to use for calculating embeddings. 792 The default model is `text-embedding-ada-002`. 793 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and 794 later models support this parameter. 795 - **api_base_url** (<code>str | None</code>) – Overrides the default base URL for all HTTP requests. 796 - **organization** (<code>str | None</code>) – Your OpenAI organization ID. See OpenAI's 797 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 798 for more information. 799 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 800 - **suffix** (<code>str</code>) – A string to add at the end of each text. 801 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 802 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 803 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 804 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 805 - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the 806 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 807 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error. 808 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or 5 retries. 809 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 810 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 811 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error 812 and continue processing the remaining documents. If `True`, it will raise an exception on failure. 813 814 #### to_dict 815 816 ```python 817 to_dict() -> dict[str, Any] 818 ``` 819 820 Serializes the component to a dictionary. 821 822 **Returns:** 823 824 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 825 826 #### from_dict 827 828 ```python 829 from_dict(data: dict[str, Any]) -> OpenAIDocumentEmbedder 830 ``` 831 832 Deserializes the component from a dictionary. 833 834 **Parameters:** 835 836 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 837 838 **Returns:** 839 840 - <code>OpenAIDocumentEmbedder</code> – Deserialized component. 841 842 #### run 843 844 ```python 845 run(documents: list[Document]) -> dict[str, Any] 846 ``` 847 848 Embeds a list of documents. 849 850 **Parameters:** 851 852 - **documents** (<code>list\[Document\]</code>) – A list of documents to embed. 853 854 **Returns:** 855 856 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 857 - `documents`: A list of documents with embeddings. 858 - `meta`: Information about the usage of the model. 859 860 #### run_async 861 862 ```python 863 run_async(documents: list[Document]) -> dict[str, Any] 864 ``` 865 866 Embeds a list of documents asynchronously. 867 868 **Parameters:** 869 870 - **documents** (<code>list\[Document\]</code>) – A list of documents to embed. 871 872 **Returns:** 873 874 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 875 - `documents`: A list of documents with embeddings. 876 - `meta`: Information about the usage of the model. 877 878 ## openai_text_embedder 879 880 ### OpenAITextEmbedder 881 882 Embeds strings using OpenAI models. 883 884 You can use it to embed user query and send it to an embedding Retriever. 885 886 ### Usage example 887 888 ```python 889 from haystack.components.embedders import OpenAITextEmbedder 890 891 text_to_embed = "I love pizza!" 892 893 text_embedder = OpenAITextEmbedder() 894 895 print(text_embedder.run(text_to_embed)) 896 897 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 898 # 'meta': {'model': 'text-embedding-ada-002-v2', 899 # 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}} 900 ``` 901 902 #### __init__ 903 904 ```python 905 __init__( 906 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 907 model: str = "text-embedding-ada-002", 908 dimensions: int | None = None, 909 api_base_url: str | None = None, 910 organization: str | None = None, 911 prefix: str = "", 912 suffix: str = "", 913 timeout: float | None = None, 914 max_retries: int | None = None, 915 http_client_kwargs: dict[str, Any] | None = None, 916 ) -> None 917 ``` 918 919 Creates an OpenAITextEmbedder component. 920 921 Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' 922 environment variables to override the `timeout` and `max_retries` parameters respectively 923 in the OpenAI client. 924 925 **Parameters:** 926 927 - **api_key** (<code>Secret</code>) – The OpenAI API key. 928 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 929 during initialization. 930 - **model** (<code>str</code>) – The name of the model to use for calculating embeddings. 931 The default model is `text-embedding-ada-002`. 932 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and 933 later models support this parameter. 934 - **api_base_url** (<code>str | None</code>) – Overrides default base URL for all HTTP requests. 935 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 936 [production best practices](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 937 for more information. 938 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to embed. 939 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 940 - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the 941 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 942 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error. 943 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5. 944 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 945 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 946 947 #### to_dict 948 949 ```python 950 to_dict() -> dict[str, Any] 951 ``` 952 953 Serializes the component to a dictionary. 954 955 **Returns:** 956 957 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 958 959 #### from_dict 960 961 ```python 962 from_dict(data: dict[str, Any]) -> OpenAITextEmbedder 963 ``` 964 965 Deserializes the component from a dictionary. 966 967 **Parameters:** 968 969 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 970 971 **Returns:** 972 973 - <code>OpenAITextEmbedder</code> – Deserialized component. 974 975 #### run 976 977 ```python 978 run(text: str) -> dict[str, Any] 979 ``` 980 981 Embeds a single string. 982 983 **Parameters:** 984 985 - **text** (<code>str</code>) – Text to embed. 986 987 **Returns:** 988 989 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 990 - `embedding`: The embedding of the input text. 991 - `meta`: Information about the usage of the model. 992 993 #### run_async 994 995 ```python 996 run_async(text: str) -> dict[str, Any] 997 ``` 998 999 Asynchronously embed a single string. 1000 1001 This is the asynchronous version of the `run` method. It has the same parameters and return values 1002 but can be used with `await` in async code. 1003 1004 **Parameters:** 1005 1006 - **text** (<code>str</code>) – Text to embed. 1007 1008 **Returns:** 1009 1010 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1011 - `embedding`: The embedding of the input text. 1012 - `meta`: Information about the usage of the model. 1013 1014 ## sentence_transformers_document_embedder 1015 1016 ### SentenceTransformersDocumentEmbedder 1017 1018 Calculates document embeddings using Sentence Transformers models. 1019 1020 It stores the embeddings in the `embedding` metadata field of each document. 1021 You can also embed documents' metadata. 1022 Use this component in indexing pipelines to embed input documents 1023 and send them to DocumentWriter to write into a Document Store. 1024 1025 ### Usage example: 1026 1027 ```python 1028 from haystack import Document 1029 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 1030 doc = Document(content="I love pizza!") 1031 doc_embedder = SentenceTransformersDocumentEmbedder() 1032 1033 result = doc_embedder.run([doc]) 1034 print(result['documents'][0].embedding) 1035 1036 # [-0.07804739475250244, 0.1498992145061493, ...] 1037 ``` 1038 1039 #### __init__ 1040 1041 ```python 1042 __init__( 1043 model: str = "sentence-transformers/all-mpnet-base-v2", 1044 device: ComponentDevice | None = None, 1045 token: Secret | None = Secret.from_env_var( 1046 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1047 ), 1048 prefix: str = "", 1049 suffix: str = "", 1050 batch_size: int = 32, 1051 progress_bar: bool = True, 1052 normalize_embeddings: bool = False, 1053 meta_fields_to_embed: list[str] | None = None, 1054 embedding_separator: str = "\n", 1055 trust_remote_code: bool = False, 1056 local_files_only: bool = False, 1057 truncate_dim: int | None = None, 1058 model_kwargs: dict[str, Any] | None = None, 1059 tokenizer_kwargs: dict[str, Any] | None = None, 1060 config_kwargs: dict[str, Any] | None = None, 1061 precision: Literal[ 1062 "float32", "int8", "uint8", "binary", "ubinary" 1063 ] = "float32", 1064 encode_kwargs: dict[str, Any] | None = None, 1065 backend: Literal["torch", "onnx", "openvino"] = "torch", 1066 revision: str | None = None, 1067 ) -> None 1068 ``` 1069 1070 Creates a SentenceTransformersDocumentEmbedder component. 1071 1072 **Parameters:** 1073 1074 - **model** (<code>str</code>) – The model to use for calculating embeddings. 1075 Pass a local path or ID of the model on Hugging Face. 1076 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 1077 Overrides the default device. 1078 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 1079 - **prefix** (<code>str</code>) – A string to add at the beginning of each document text. 1080 Can be used to prepend the text with an instruction, as required by some embedding models, 1081 such as E5 and bge. 1082 - **suffix** (<code>str</code>) – A string to add at the end of each document text. 1083 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 1084 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 1085 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1. 1086 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 1087 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 1088 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 1089 If `True`, allows custom models and scripts. 1090 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1091 - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation. 1092 If the model wasn't trained with Matryoshka Representation Learning, 1093 truncating embeddings can significantly affect performance. 1094 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1095 when loading the model. Refer to specific model documentation for available kwargs. 1096 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1097 Refer to specific model documentation for available kwargs. 1098 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1099 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 1100 All non-float32 precisions are quantized embeddings. 1101 Quantized embeddings are smaller and faster to compute, but may have a lower accuracy. 1102 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 1103 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents. 1104 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 1105 avoid passing parameters that change the output type. 1106 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1107 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1108 for more information on acceleration and quantization options. 1109 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1110 for a stored model on Hugging Face. 1111 1112 #### to_dict 1113 1114 ```python 1115 to_dict() -> dict[str, Any] 1116 ``` 1117 1118 Serializes the component to a dictionary. 1119 1120 **Returns:** 1121 1122 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1123 1124 #### from_dict 1125 1126 ```python 1127 from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentEmbedder 1128 ``` 1129 1130 Deserializes the component from a dictionary. 1131 1132 **Parameters:** 1133 1134 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1135 1136 **Returns:** 1137 1138 - <code>SentenceTransformersDocumentEmbedder</code> – Deserialized component. 1139 1140 #### warm_up 1141 1142 ```python 1143 warm_up() -> None 1144 ``` 1145 1146 Initializes the component. 1147 1148 #### run 1149 1150 ```python 1151 run(documents: list[Document]) -> dict[str, list[Document]] 1152 ``` 1153 1154 Embed a list of documents. 1155 1156 **Parameters:** 1157 1158 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 1159 1160 **Returns:** 1161 1162 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1163 - `documents`: Documents with embeddings. 1164 1165 ## sentence_transformers_sparse_document_embedder 1166 1167 ### SentenceTransformersSparseDocumentEmbedder 1168 1169 Calculates document sparse embeddings using sparse embedding models from Sentence Transformers. 1170 1171 It stores the sparse embeddings in the `sparse_embedding` metadata field of each document. 1172 You can also embed documents' metadata. 1173 Use this component in indexing pipelines to embed input documents 1174 and send them to DocumentWriter to write a into a Document Store. 1175 1176 ### Usage example: 1177 1178 ```python 1179 from haystack import Document 1180 from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder 1181 1182 doc = Document(content="I love pizza!") 1183 doc_embedder = SentenceTransformersSparseDocumentEmbedder() 1184 1185 result = doc_embedder.run([doc]) 1186 print(result['documents'][0].sparse_embedding) 1187 1188 # SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...]) 1189 ``` 1190 1191 #### __init__ 1192 1193 ```python 1194 __init__( 1195 *, 1196 model: str = "prithivida/Splade_PP_en_v2", 1197 device: ComponentDevice | None = None, 1198 token: Secret | None = Secret.from_env_var( 1199 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1200 ), 1201 prefix: str = "", 1202 suffix: str = "", 1203 batch_size: int = 32, 1204 progress_bar: bool = True, 1205 meta_fields_to_embed: list[str] | None = None, 1206 embedding_separator: str = "\n", 1207 trust_remote_code: bool = False, 1208 local_files_only: bool = False, 1209 model_kwargs: dict[str, Any] | None = None, 1210 tokenizer_kwargs: dict[str, Any] | None = None, 1211 config_kwargs: dict[str, Any] | None = None, 1212 backend: Literal["torch", "onnx", "openvino"] = "torch", 1213 revision: str | None = None 1214 ) -> None 1215 ``` 1216 1217 Creates a SentenceTransformersSparseDocumentEmbedder component. 1218 1219 **Parameters:** 1220 1221 - **model** (<code>str</code>) – The model to use for calculating sparse embeddings. 1222 Pass a local path or ID of the model on Hugging Face. 1223 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 1224 Overrides the default device. 1225 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 1226 - **prefix** (<code>str</code>) – A string to add at the beginning of each document text. 1227 - **suffix** (<code>str</code>) – A string to add at the end of each document text. 1228 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 1229 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 1230 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 1231 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 1232 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 1233 If `True`, allows custom models and scripts. 1234 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1235 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1236 when loading the model. Refer to specific model documentation for available kwargs. 1237 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1238 Refer to specific model documentation for available kwargs. 1239 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1240 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1241 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1242 for more information on acceleration and quantization options. 1243 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1244 for a stored model on Hugging Face. 1245 1246 #### to_dict 1247 1248 ```python 1249 to_dict() -> dict[str, Any] 1250 ``` 1251 1252 Serializes the component to a dictionary. 1253 1254 **Returns:** 1255 1256 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1257 1258 #### from_dict 1259 1260 ```python 1261 from_dict(data: dict[str, Any]) -> SentenceTransformersSparseDocumentEmbedder 1262 ``` 1263 1264 Deserializes the component from a dictionary. 1265 1266 **Parameters:** 1267 1268 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1269 1270 **Returns:** 1271 1272 - <code>SentenceTransformersSparseDocumentEmbedder</code> – Deserialized component. 1273 1274 #### warm_up 1275 1276 ```python 1277 warm_up() -> None 1278 ``` 1279 1280 Initializes the component. 1281 1282 #### run 1283 1284 ```python 1285 run(documents: list[Document]) -> dict[str, list[Document]] 1286 ``` 1287 1288 Embed a list of documents. 1289 1290 **Parameters:** 1291 1292 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 1293 1294 **Returns:** 1295 1296 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1297 - `documents`: Documents with sparse embeddings under the `sparse_embedding` field. 1298 1299 ## sentence_transformers_sparse_text_embedder 1300 1301 ### SentenceTransformersSparseTextEmbedder 1302 1303 Embeds strings using sparse embedding models from Sentence Transformers. 1304 1305 You can use it to embed user query and send it to a sparse embedding retriever. 1306 1307 Usage example: 1308 1309 ```python 1310 from haystack.components.embedders import SentenceTransformersSparseTextEmbedder 1311 1312 text_to_embed = "I love pizza!" 1313 1314 text_embedder = SentenceTransformersSparseTextEmbedder() 1315 1316 print(text_embedder.run(text_to_embed)) 1317 1318 # {'sparse_embedding': SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])} 1319 ``` 1320 1321 #### __init__ 1322 1323 ```python 1324 __init__( 1325 *, 1326 model: str = "prithivida/Splade_PP_en_v2", 1327 device: ComponentDevice | None = None, 1328 token: Secret | None = Secret.from_env_var( 1329 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1330 ), 1331 prefix: str = "", 1332 suffix: str = "", 1333 trust_remote_code: bool = False, 1334 local_files_only: bool = False, 1335 model_kwargs: dict[str, Any] | None = None, 1336 tokenizer_kwargs: dict[str, Any] | None = None, 1337 config_kwargs: dict[str, Any] | None = None, 1338 backend: Literal["torch", "onnx", "openvino"] = "torch", 1339 revision: str | None = None 1340 ) -> None 1341 ``` 1342 1343 Create a SentenceTransformersSparseTextEmbedder component. 1344 1345 **Parameters:** 1346 1347 - **model** (<code>str</code>) – The model to use for calculating sparse embeddings. 1348 Specify the path to a local model or the ID of the model on Hugging Face. 1349 - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model. 1350 - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face. 1351 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded. 1352 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 1353 - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures. 1354 If `True`, permits custom models and scripts. 1355 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1356 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1357 when loading the model. Refer to specific model documentation for available kwargs. 1358 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1359 Refer to specific model documentation for available kwargs. 1360 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1361 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1362 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1363 for more information on acceleration and quantization options. 1364 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1365 for a stored model on Hugging Face. 1366 1367 #### to_dict 1368 1369 ```python 1370 to_dict() -> dict[str, Any] 1371 ``` 1372 1373 Serializes the component to a dictionary. 1374 1375 **Returns:** 1376 1377 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1378 1379 #### from_dict 1380 1381 ```python 1382 from_dict(data: dict[str, Any]) -> SentenceTransformersSparseTextEmbedder 1383 ``` 1384 1385 Deserializes the component from a dictionary. 1386 1387 **Parameters:** 1388 1389 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1390 1391 **Returns:** 1392 1393 - <code>SentenceTransformersSparseTextEmbedder</code> – Deserialized component. 1394 1395 #### warm_up 1396 1397 ```python 1398 warm_up() -> None 1399 ``` 1400 1401 Initializes the component. 1402 1403 #### run 1404 1405 ```python 1406 run(text: str) -> dict[str, Any] 1407 ``` 1408 1409 Embed a single string. 1410 1411 **Parameters:** 1412 1413 - **text** (<code>str</code>) – Text to embed. 1414 1415 **Returns:** 1416 1417 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1418 - `sparse_embedding`: The sparse embedding of the input text. 1419 1420 ## sentence_transformers_text_embedder 1421 1422 ### SentenceTransformersTextEmbedder 1423 1424 Embeds strings using Sentence Transformers models. 1425 1426 You can use it to embed user query and send it to an embedding retriever. 1427 1428 Usage example: 1429 1430 ```python 1431 from haystack.components.embedders import SentenceTransformersTextEmbedder 1432 1433 text_to_embed = "I love pizza!" 1434 1435 text_embedder = SentenceTransformersTextEmbedder() 1436 1437 print(text_embedder.run(text_to_embed)) 1438 1439 # {'embedding': [-0.07804739475250244, 0.1498992145061493,, ...]} 1440 ``` 1441 1442 #### __init__ 1443 1444 ```python 1445 __init__( 1446 model: str = "sentence-transformers/all-mpnet-base-v2", 1447 device: ComponentDevice | None = None, 1448 token: Secret | None = Secret.from_env_var( 1449 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1450 ), 1451 prefix: str = "", 1452 suffix: str = "", 1453 batch_size: int = 32, 1454 progress_bar: bool = True, 1455 normalize_embeddings: bool = False, 1456 trust_remote_code: bool = False, 1457 local_files_only: bool = False, 1458 truncate_dim: int | None = None, 1459 model_kwargs: dict[str, Any] | None = None, 1460 tokenizer_kwargs: dict[str, Any] | None = None, 1461 config_kwargs: dict[str, Any] | None = None, 1462 precision: Literal[ 1463 "float32", "int8", "uint8", "binary", "ubinary" 1464 ] = "float32", 1465 encode_kwargs: dict[str, Any] | None = None, 1466 backend: Literal["torch", "onnx", "openvino"] = "torch", 1467 revision: str | None = None, 1468 ) -> None 1469 ``` 1470 1471 Create a SentenceTransformersTextEmbedder component. 1472 1473 **Parameters:** 1474 1475 - **model** (<code>str</code>) – The model to use for calculating embeddings. 1476 Specify the path to a local model or the ID of the model on Hugging Face. 1477 - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model. 1478 - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face. 1479 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded. 1480 You can use it to prepend the text with an instruction, as required by some embedding models, 1481 such as E5 and bge. 1482 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 1483 - **batch_size** (<code>int</code>) – Number of texts to embed at once. 1484 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar for calculating embeddings. 1485 If `False`, disables the progress bar. 1486 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that the embeddings have a norm of 1. 1487 - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures. 1488 If `True`, permits custom models and scripts. 1489 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1490 - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation. 1491 If the model has not been trained with Matryoshka Representation Learning, 1492 truncation of embeddings can significantly affect performance. 1493 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1494 when loading the model. Refer to specific model documentation for available kwargs. 1495 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1496 Refer to specific model documentation for available kwargs. 1497 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1498 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 1499 All non-float32 precisions are quantized embeddings. 1500 Quantized embeddings are smaller in size and faster to compute, but may have a lower accuracy. 1501 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 1502 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding texts. 1503 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 1504 avoid passing parameters that change the output type. 1505 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1506 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1507 for more information on acceleration and quantization options. 1508 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1509 for a stored model on Hugging Face. 1510 1511 #### to_dict 1512 1513 ```python 1514 to_dict() -> dict[str, Any] 1515 ``` 1516 1517 Serializes the component to a dictionary. 1518 1519 **Returns:** 1520 1521 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1522 1523 #### from_dict 1524 1525 ```python 1526 from_dict(data: dict[str, Any]) -> SentenceTransformersTextEmbedder 1527 ``` 1528 1529 Deserializes the component from a dictionary. 1530 1531 **Parameters:** 1532 1533 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1534 1535 **Returns:** 1536 1537 - <code>SentenceTransformersTextEmbedder</code> – Deserialized component. 1538 1539 #### warm_up 1540 1541 ```python 1542 warm_up() -> None 1543 ``` 1544 1545 Initializes the component. 1546 1547 #### run 1548 1549 ```python 1550 run(text: str) -> dict[str, Any] 1551 ``` 1552 1553 Embed a single string. 1554 1555 **Parameters:** 1556 1557 - **text** (<code>str</code>) – Text to embed. 1558 1559 **Returns:** 1560 1561 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1562 - `embedding`: The embedding of the input text.