embedders_api.md
1 --- 2 title: "Embedders" 3 id: embedders-api 4 description: "Transforms queries into vectors to look for similar or relevant Documents." 5 slug: "/embedders-api" 6 --- 7 8 9 ## azure_document_embedder 10 11 ### AzureOpenAIDocumentEmbedder 12 13 Bases: <code>OpenAIDocumentEmbedder</code> 14 15 Calculates document embeddings using OpenAI models deployed on Azure. 16 17 ### Usage example 18 19 <!-- test-ignore --> 20 21 ```python 22 from haystack import Document 23 from haystack.components.embedders import AzureOpenAIDocumentEmbedder 24 25 doc = Document(content="I love pizza!") 26 document_embedder = AzureOpenAIDocumentEmbedder() 27 28 result = document_embedder.run([doc]) 29 print(result['documents'][0].embedding) 30 31 # [0.017020374536514282, -0.023255806416273117, ...] 32 ``` 33 34 #### __init__ 35 36 ```python 37 __init__( 38 azure_endpoint: str | None = None, 39 api_version: str | None = "2023-05-15", 40 azure_deployment: str = "text-embedding-ada-002", 41 dimensions: int | None = None, 42 api_key: Secret | None = Secret.from_env_var( 43 "AZURE_OPENAI_API_KEY", strict=False 44 ), 45 azure_ad_token: Secret | None = Secret.from_env_var( 46 "AZURE_OPENAI_AD_TOKEN", strict=False 47 ), 48 organization: str | None = None, 49 prefix: str = "", 50 suffix: str = "", 51 batch_size: int = 32, 52 progress_bar: bool = True, 53 meta_fields_to_embed: list[str] | None = None, 54 embedding_separator: str = "\n", 55 timeout: float | None = None, 56 max_retries: int | None = None, 57 *, 58 default_headers: dict[str, str] | None = None, 59 azure_ad_token_provider: AzureADTokenProvider | None = None, 60 http_client_kwargs: dict[str, Any] | None = None, 61 raise_on_failure: bool = False 62 ) -> None 63 ``` 64 65 Creates an AzureOpenAIDocumentEmbedder component. 66 67 **Parameters:** 68 69 - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure. 70 - **api_version** (<code>str | None</code>) – The version of the API to use. 71 - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002. 72 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only supported in text-embedding-3 73 and later models. 74 - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key. 75 You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this 76 parameter during initialization. 77 - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's 78 [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) 79 documentation for more information. You can set it with an environment variable 80 `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. 81 Previously called Azure Active Directory. 82 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 83 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 84 for more information. 85 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 86 - **suffix** (<code>str</code>) – A string to add at the end of each text. 87 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 88 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 89 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 90 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 91 - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds. 92 If not set, defaults to either the 93 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 94 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error. 95 If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable or to 5 retries. 96 - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client. 97 - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on 98 every request. 99 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 100 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 101 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error 102 and continue processing the remaining documents. If `True`, it will raise an exception on failure. 103 104 #### to_dict 105 106 ```python 107 to_dict() -> dict[str, Any] 108 ``` 109 110 Serializes the component to a dictionary. 111 112 **Returns:** 113 114 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 115 116 #### from_dict 117 118 ```python 119 from_dict(data: dict[str, Any]) -> AzureOpenAIDocumentEmbedder 120 ``` 121 122 Deserializes the component from a dictionary. 123 124 **Parameters:** 125 126 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 127 128 **Returns:** 129 130 - <code>AzureOpenAIDocumentEmbedder</code> – Deserialized component. 131 132 ## azure_text_embedder 133 134 ### AzureOpenAITextEmbedder 135 136 Bases: <code>OpenAITextEmbedder</code> 137 138 Embeds strings using OpenAI models deployed on Azure. 139 140 ### Usage example 141 142 <!-- test-ignore --> 143 144 ```python 145 from haystack.components.embedders import AzureOpenAITextEmbedder 146 147 text_to_embed = "I love pizza!" 148 text_embedder = AzureOpenAITextEmbedder() 149 150 print(text_embedder.run(text_to_embed)) 151 152 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 153 # 'meta': {'model': 'text-embedding-ada-002-v2', 154 # 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}} 155 ``` 156 157 #### __init__ 158 159 ```python 160 __init__( 161 azure_endpoint: str | None = None, 162 api_version: str | None = "2023-05-15", 163 azure_deployment: str = "text-embedding-ada-002", 164 dimensions: int | None = None, 165 api_key: Secret | None = Secret.from_env_var( 166 "AZURE_OPENAI_API_KEY", strict=False 167 ), 168 azure_ad_token: Secret | None = Secret.from_env_var( 169 "AZURE_OPENAI_AD_TOKEN", strict=False 170 ), 171 organization: str | None = None, 172 timeout: float | None = None, 173 max_retries: int | None = None, 174 prefix: str = "", 175 suffix: str = "", 176 *, 177 default_headers: dict[str, str] | None = None, 178 azure_ad_token_provider: AzureADTokenProvider | None = None, 179 http_client_kwargs: dict[str, Any] | None = None 180 ) -> None 181 ``` 182 183 Creates an AzureOpenAITextEmbedder component. 184 185 **Parameters:** 186 187 - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure. 188 - **api_version** (<code>str | None</code>) – The version of the API to use. 189 - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002. 190 - **dimensions** (<code>int | None</code>) – The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 191 and later models. 192 - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key. 193 You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this 194 parameter during initialization. 195 - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's 196 [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) 197 documentation for more information. You can set it with an environment variable 198 `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. 199 Previously called Azure Active Directory. 200 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 201 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 202 for more information. 203 - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds. 204 If not set, defaults to either the 205 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 206 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error. 207 If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable, or to 5 retries. 208 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 209 - **suffix** (<code>str</code>) – A string to add at the end of each text. 210 - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client. 211 - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on 212 every request. 213 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 214 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 215 216 #### to_dict 217 218 ```python 219 to_dict() -> dict[str, Any] 220 ``` 221 222 Serializes the component to a dictionary. 223 224 **Returns:** 225 226 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 227 228 #### from_dict 229 230 ```python 231 from_dict(data: dict[str, Any]) -> AzureOpenAITextEmbedder 232 ``` 233 234 Deserializes the component from a dictionary. 235 236 **Parameters:** 237 238 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 239 240 **Returns:** 241 242 - <code>AzureOpenAITextEmbedder</code> – Deserialized component. 243 244 ## hugging_face_api_document_embedder 245 246 ### HuggingFaceAPIDocumentEmbedder 247 248 Embeds documents using Hugging Face APIs. 249 250 Use it with the following Hugging Face APIs: 251 252 - [Free Serverless Inference API](https://huggingface.co/inference-api) 253 - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints) 254 - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 255 256 ### Usage examples 257 258 #### With free serverless inference API 259 260 <!-- test-ignore --> 261 262 ```python 263 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 264 from haystack.utils import Secret 265 from haystack.dataclasses import Document 266 267 doc = Document(content="I love pizza!") 268 269 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="serverless_inference_api", 270 api_params={"model": "BAAI/bge-small-en-v1.5"}, 271 token=Secret.from_token("<your-api-key>")) 272 273 result = document_embedder.run([doc]) 274 print(result["documents"][0].embedding) 275 276 # [0.017020374536514282, -0.023255806416273117, ...] 277 ``` 278 279 #### With paid inference endpoints 280 281 <!-- test-ignore --> 282 283 ```python 284 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 285 from haystack.utils import Secret 286 from haystack.dataclasses import Document 287 288 doc = Document(content="I love pizza!") 289 290 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="inference_endpoints", 291 api_params={"url": "<your-inference-endpoint-url>"}, 292 token=Secret.from_token("<your-api-key>")) 293 294 result = document_embedder.run([doc]) 295 print(result["documents"][0].embedding) 296 297 # [0.017020374536514282, -0.023255806416273117, ...] 298 ``` 299 300 #### With self-hosted text embeddings inference 301 302 <!-- test-ignore --> 303 304 ```python 305 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 306 from haystack.dataclasses import Document 307 308 doc = Document(content="I love pizza!") 309 310 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="text_embeddings_inference", 311 api_params={"url": "http://localhost:8080"}) 312 313 result = document_embedder.run([doc]) 314 print(result["documents"][0].embedding) 315 316 # [0.017020374536514282, -0.023255806416273117, ...] 317 ``` 318 319 #### __init__ 320 321 ```python 322 __init__( 323 api_type: HFEmbeddingAPIType | str, 324 api_params: dict[str, str], 325 token: Secret | None = Secret.from_env_var( 326 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 327 ), 328 prefix: str = "", 329 suffix: str = "", 330 truncate: bool | None = True, 331 normalize: bool | None = False, 332 batch_size: int = 32, 333 progress_bar: bool = True, 334 meta_fields_to_embed: list[str] | None = None, 335 embedding_separator: str = "\n", 336 concurrency_limit: int = 4, 337 ) -> None 338 ``` 339 340 Creates a HuggingFaceAPIDocumentEmbedder component. 341 342 **Parameters:** 343 344 - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use. 345 - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys: 346 - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. 347 - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or 348 `TEXT_EMBEDDINGS_INFERENCE`. 349 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 350 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 351 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 352 - **suffix** (<code>str</code>) – A string to add at the end of each text. 353 - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model. 354 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 355 if the backend uses Text Embeddings Inference. 356 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 357 - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length. 358 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 359 if the backend uses Text Embeddings Inference. 360 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 361 - **batch_size** (<code>int</code>) – Number of documents to process at once. 362 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 363 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 364 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 365 - **concurrency_limit** (<code>int</code>) – The maximum number of requests that should be allowed to run concurrently. 366 This parameter is only used in the `run_async` method. 367 368 #### to_dict 369 370 ```python 371 to_dict() -> dict[str, Any] 372 ``` 373 374 Serializes the component to a dictionary. 375 376 **Returns:** 377 378 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 379 380 #### from_dict 381 382 ```python 383 from_dict(data: dict[str, Any]) -> HuggingFaceAPIDocumentEmbedder 384 ``` 385 386 Deserializes the component from a dictionary. 387 388 **Parameters:** 389 390 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 391 392 **Returns:** 393 394 - <code>HuggingFaceAPIDocumentEmbedder</code> – Deserialized component. 395 396 #### run 397 398 ```python 399 run(documents: list[Document]) -> dict[str, list[Document]] 400 ``` 401 402 Embeds a list of documents. 403 404 **Parameters:** 405 406 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 407 408 **Returns:** 409 410 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 411 - `documents`: A list of documents with embeddings. 412 413 #### run_async 414 415 ```python 416 run_async(documents: list[Document]) -> dict[str, list[Document]] 417 ``` 418 419 Embeds a list of documents asynchronously. 420 421 **Parameters:** 422 423 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 424 425 **Returns:** 426 427 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 428 - `documents`: A list of documents with embeddings. 429 430 ## hugging_face_api_text_embedder 431 432 ### HuggingFaceAPITextEmbedder 433 434 Embeds strings using Hugging Face APIs. 435 436 Use it with the following Hugging Face APIs: 437 438 - [Free Serverless Inference API](https://huggingface.co/inference-api) 439 - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints) 440 - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 441 442 ### Usage examples 443 444 #### With free serverless inference API 445 446 <!-- test-ignore --> 447 448 ```python 449 from haystack.components.embedders import HuggingFaceAPITextEmbedder 450 from haystack.utils import Secret 451 452 text_embedder = HuggingFaceAPITextEmbedder(api_type="serverless_inference_api", 453 api_params={"model": "BAAI/bge-small-en-v1.5"}, 454 token=Secret.from_token("<your-api-key>")) 455 456 print(text_embedder.run("I love pizza!")) 457 458 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 459 ``` 460 461 #### With paid inference endpoints 462 463 <!-- test-ignore --> 464 465 ```python 466 from haystack.components.embedders import HuggingFaceAPITextEmbedder 467 from haystack.utils import Secret 468 text_embedder = HuggingFaceAPITextEmbedder(api_type="inference_endpoints", 469 api_params={"model": "BAAI/bge-small-en-v1.5"}, 470 token=Secret.from_token("<your-api-key>")) 471 472 print(text_embedder.run("I love pizza!")) 473 474 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 475 ``` 476 477 #### With self-hosted text embeddings inference 478 479 <!-- test-ignore --> 480 481 ```python 482 from haystack.components.embedders import HuggingFaceAPITextEmbedder 483 from haystack.utils import Secret 484 485 text_embedder = HuggingFaceAPITextEmbedder(api_type="text_embeddings_inference", 486 api_params={"url": "http://localhost:8080"}) 487 488 print(text_embedder.run("I love pizza!")) 489 490 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 491 ``` 492 493 #### __init__ 494 495 ```python 496 __init__( 497 api_type: HFEmbeddingAPIType | str, 498 api_params: dict[str, str], 499 token: Secret | None = Secret.from_env_var( 500 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 501 ), 502 prefix: str = "", 503 suffix: str = "", 504 truncate: bool | None = True, 505 normalize: bool | None = False, 506 ) -> None 507 ``` 508 509 Creates a HuggingFaceAPITextEmbedder component. 510 511 **Parameters:** 512 513 - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use. 514 - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys: 515 - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. 516 - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or 517 `TEXT_EMBEDDINGS_INFERENCE`. 518 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 519 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 520 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 521 - **suffix** (<code>str</code>) – A string to add at the end of each text. 522 - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model. 523 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 524 if the backend uses Text Embeddings Inference. 525 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 526 - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length. 527 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 528 if the backend uses Text Embeddings Inference. 529 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 530 531 #### to_dict 532 533 ```python 534 to_dict() -> dict[str, Any] 535 ``` 536 537 Serializes the component to a dictionary. 538 539 **Returns:** 540 541 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 542 543 #### from_dict 544 545 ```python 546 from_dict(data: dict[str, Any]) -> HuggingFaceAPITextEmbedder 547 ``` 548 549 Deserializes the component from a dictionary. 550 551 **Parameters:** 552 553 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 554 555 **Returns:** 556 557 - <code>HuggingFaceAPITextEmbedder</code> – Deserialized component. 558 559 #### run 560 561 ```python 562 run(text: str) -> dict[str, Any] 563 ``` 564 565 Embeds a single string. 566 567 **Parameters:** 568 569 - **text** (<code>str</code>) – Text to embed. 570 571 **Returns:** 572 573 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 574 - `embedding`: The embedding of the input text. 575 576 #### run_async 577 578 ```python 579 run_async(text: str) -> dict[str, Any] 580 ``` 581 582 Embeds a single string asynchronously. 583 584 **Parameters:** 585 586 - **text** (<code>str</code>) – Text to embed. 587 588 **Returns:** 589 590 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 591 - `embedding`: The embedding of the input text. 592 593 ## image/sentence_transformers_doc_image_embedder 594 595 ### SentenceTransformersDocumentImageEmbedder 596 597 A component for computing Document embeddings based on images using Sentence Transformers models. 598 599 The embedding of each Document is stored in the `embedding` field of the Document. 600 601 ### Usage example 602 603 <!-- test-ignore --> 604 605 ```python 606 from haystack import Document 607 from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder 608 609 embedder = SentenceTransformersDocumentImageEmbedder(model="sentence-transformers/clip-ViT-B-32") 610 611 documents = [ 612 Document(content="A photo of a cat", meta={"file_path": "cat.jpg"}), 613 Document(content="A photo of a dog", meta={"file_path": "dog.jpg"}), 614 ] 615 616 result = embedder.run(documents=documents) 617 documents_with_embeddings = result["documents"] 618 print(documents_with_embeddings) 619 620 # [Document(id=..., 621 # content='A photo of a cat', 622 # meta={'file_path': 'cat.jpg', 623 # 'embedding_source': {'type': 'image', 'file_path_meta_field': 'file_path'}}, 624 # embedding=vector of size 512), 625 # ...] 626 ``` 627 628 #### __init__ 629 630 ```python 631 __init__( 632 *, 633 file_path_meta_field: str = "file_path", 634 root_path: str | None = None, 635 model: str = "sentence-transformers/clip-ViT-B-32", 636 device: ComponentDevice | None = None, 637 token: Secret | None = Secret.from_env_var( 638 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 639 ), 640 batch_size: int = 32, 641 progress_bar: bool = True, 642 normalize_embeddings: bool = False, 643 trust_remote_code: bool = False, 644 local_files_only: bool = False, 645 model_kwargs: dict[str, Any] | None = None, 646 tokenizer_kwargs: dict[str, Any] | None = None, 647 config_kwargs: dict[str, Any] | None = None, 648 precision: Literal[ 649 "float32", "int8", "uint8", "binary", "ubinary" 650 ] = "float32", 651 encode_kwargs: dict[str, Any] | None = None, 652 backend: Literal["torch", "onnx", "openvino"] = "torch" 653 ) -> None 654 ``` 655 656 Creates a SentenceTransformersDocumentEmbedder component. 657 658 **Parameters:** 659 660 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 661 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 662 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 663 - **model** (<code>str</code>) – The Sentence Transformers model to use for calculating embeddings. Pass a local path or ID of the model on 664 Hugging Face. To be used with this component, the model must be able to embed images and text into the same 665 vector space. Compatible models include: 666 - "sentence-transformers/clip-ViT-B-32" 667 - "sentence-transformers/clip-ViT-L-14" 668 - "sentence-transformers/clip-ViT-B-16" 669 - "sentence-transformers/clip-ViT-B-32-multilingual-v1" 670 - "jinaai/jina-embeddings-v4" 671 - "jinaai/jina-clip-v1" 672 - "jinaai/jina-clip-v2". 673 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 674 Overrides the default device. 675 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 676 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 677 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 678 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1. 679 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 680 If `True`, allows custom models and scripts. 681 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 682 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 683 when loading the model. Refer to specific model documentation for available kwargs. 684 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 685 Refer to specific model documentation for available kwargs. 686 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 687 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 688 All non-float32 precisions are quantized embeddings. 689 Quantized embeddings are smaller and faster to compute, but may have a lower accuracy. 690 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 691 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents. 692 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 693 avoid passing parameters that change the output type. 694 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 695 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 696 for more information on acceleration and quantization options. 697 698 #### to_dict 699 700 ```python 701 to_dict() -> dict[str, Any] 702 ``` 703 704 Serializes the component to a dictionary. 705 706 **Returns:** 707 708 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 709 710 #### from_dict 711 712 ```python 713 from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentImageEmbedder 714 ``` 715 716 Deserializes the component from a dictionary. 717 718 **Parameters:** 719 720 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 721 722 **Returns:** 723 724 - <code>SentenceTransformersDocumentImageEmbedder</code> – Deserialized component. 725 726 #### warm_up 727 728 ```python 729 warm_up() -> None 730 ``` 731 732 Initializes the component. 733 734 #### run 735 736 ```python 737 run(documents: list[Document]) -> dict[str, list[Document]] 738 ``` 739 740 Embed a list of documents. 741 742 **Parameters:** 743 744 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 745 746 **Returns:** 747 748 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 749 - `documents`: Documents with embeddings. 750 751 ## openai_document_embedder 752 753 ### OpenAIDocumentEmbedder 754 755 Computes document embeddings using OpenAI models. 756 757 ### Usage example 758 759 <!-- test-ignore --> 760 761 ```python 762 from haystack import Document 763 from haystack.components.embedders import OpenAIDocumentEmbedder 764 765 doc = Document(content="I love pizza!") 766 document_embedder = OpenAIDocumentEmbedder() 767 result = document_embedder.run([doc]) 768 769 print(result['documents'][0].embedding) 770 771 # [0.017020374536514282, -0.023255806416273117, ...] 772 ``` 773 774 #### __init__ 775 776 ```python 777 __init__( 778 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 779 model: str = "text-embedding-ada-002", 780 dimensions: int | None = None, 781 api_base_url: str | None = None, 782 organization: str | None = None, 783 prefix: str = "", 784 suffix: str = "", 785 batch_size: int = 32, 786 progress_bar: bool = True, 787 meta_fields_to_embed: list[str] | None = None, 788 embedding_separator: str = "\n", 789 timeout: float | None = None, 790 max_retries: int | None = None, 791 http_client_kwargs: dict[str, Any] | None = None, 792 *, 793 raise_on_failure: bool = False 794 ) -> None 795 ``` 796 797 Creates an OpenAIDocumentEmbedder component. 798 799 Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' 800 environment variables to override the `timeout` and `max_retries` parameters respectively 801 in the OpenAI client. 802 803 **Parameters:** 804 805 - **api_key** (<code>Secret</code>) – The OpenAI API key. 806 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 807 during initialization. 808 - **model** (<code>str</code>) – The name of the model to use for calculating embeddings. 809 The default model is `text-embedding-ada-002`. 810 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and 811 later models support this parameter. 812 - **api_base_url** (<code>str | None</code>) – Overrides the default base URL for all HTTP requests. 813 - **organization** (<code>str | None</code>) – Your OpenAI organization ID. See OpenAI's 814 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 815 for more information. 816 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 817 - **suffix** (<code>str</code>) – A string to add at the end of each text. 818 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 819 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 820 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 821 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 822 - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the 823 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 824 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error. 825 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or 5 retries. 826 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 827 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 828 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error 829 and continue processing the remaining documents. If `True`, it will raise an exception on failure. 830 831 #### to_dict 832 833 ```python 834 to_dict() -> dict[str, Any] 835 ``` 836 837 Serializes the component to a dictionary. 838 839 **Returns:** 840 841 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 842 843 #### from_dict 844 845 ```python 846 from_dict(data: dict[str, Any]) -> OpenAIDocumentEmbedder 847 ``` 848 849 Deserializes the component from a dictionary. 850 851 **Parameters:** 852 853 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 854 855 **Returns:** 856 857 - <code>OpenAIDocumentEmbedder</code> – Deserialized component. 858 859 #### run 860 861 ```python 862 run(documents: list[Document]) -> dict[str, Any] 863 ``` 864 865 Embeds a list of documents. 866 867 **Parameters:** 868 869 - **documents** (<code>list\[Document\]</code>) – A list of documents to embed. 870 871 **Returns:** 872 873 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 874 - `documents`: A list of documents with embeddings. 875 - `meta`: Information about the usage of the model. 876 877 #### run_async 878 879 ```python 880 run_async(documents: list[Document]) -> dict[str, Any] 881 ``` 882 883 Embeds a list of documents asynchronously. 884 885 **Parameters:** 886 887 - **documents** (<code>list\[Document\]</code>) – A list of documents to embed. 888 889 **Returns:** 890 891 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 892 - `documents`: A list of documents with embeddings. 893 - `meta`: Information about the usage of the model. 894 895 ## openai_text_embedder 896 897 ### OpenAITextEmbedder 898 899 Embeds strings using OpenAI models. 900 901 You can use it to embed user query and send it to an embedding Retriever. 902 903 ### Usage example 904 905 <!-- test-ignore --> 906 907 ```python 908 from haystack.components.embedders import OpenAITextEmbedder 909 910 text_to_embed = "I love pizza!" 911 text_embedder = OpenAITextEmbedder() 912 913 print(text_embedder.run(text_to_embed)) 914 915 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 916 # 'meta': {'model': 'text-embedding-ada-002-v2', 917 # 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}} 918 ``` 919 920 #### __init__ 921 922 ```python 923 __init__( 924 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 925 model: str = "text-embedding-ada-002", 926 dimensions: int | None = None, 927 api_base_url: str | None = None, 928 organization: str | None = None, 929 prefix: str = "", 930 suffix: str = "", 931 timeout: float | None = None, 932 max_retries: int | None = None, 933 http_client_kwargs: dict[str, Any] | None = None, 934 ) -> None 935 ``` 936 937 Creates an OpenAITextEmbedder component. 938 939 Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' 940 environment variables to override the `timeout` and `max_retries` parameters respectively 941 in the OpenAI client. 942 943 **Parameters:** 944 945 - **api_key** (<code>Secret</code>) – The OpenAI API key. 946 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 947 during initialization. 948 - **model** (<code>str</code>) – The name of the model to use for calculating embeddings. 949 The default model is `text-embedding-ada-002`. 950 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and 951 later models support this parameter. 952 - **api_base_url** (<code>str | None</code>) – Overrides default base URL for all HTTP requests. 953 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 954 [production best practices](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 955 for more information. 956 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to embed. 957 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 958 - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the 959 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 960 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error. 961 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5. 962 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 963 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 964 965 #### to_dict 966 967 ```python 968 to_dict() -> dict[str, Any] 969 ``` 970 971 Serializes the component to a dictionary. 972 973 **Returns:** 974 975 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 976 977 #### from_dict 978 979 ```python 980 from_dict(data: dict[str, Any]) -> OpenAITextEmbedder 981 ``` 982 983 Deserializes the component from a dictionary. 984 985 **Parameters:** 986 987 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 988 989 **Returns:** 990 991 - <code>OpenAITextEmbedder</code> – Deserialized component. 992 993 #### run 994 995 ```python 996 run(text: str) -> dict[str, Any] 997 ``` 998 999 Embeds a single string. 1000 1001 **Parameters:** 1002 1003 - **text** (<code>str</code>) – Text to embed. 1004 1005 **Returns:** 1006 1007 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1008 - `embedding`: The embedding of the input text. 1009 - `meta`: Information about the usage of the model. 1010 1011 #### run_async 1012 1013 ```python 1014 run_async(text: str) -> dict[str, Any] 1015 ``` 1016 1017 Asynchronously embed a single string. 1018 1019 This is the asynchronous version of the `run` method. It has the same parameters and return values 1020 but can be used with `await` in async code. 1021 1022 **Parameters:** 1023 1024 - **text** (<code>str</code>) – Text to embed. 1025 1026 **Returns:** 1027 1028 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1029 - `embedding`: The embedding of the input text. 1030 - `meta`: Information about the usage of the model. 1031 1032 ## sentence_transformers_document_embedder 1033 1034 ### SentenceTransformersDocumentEmbedder 1035 1036 Calculates document embeddings using Sentence Transformers models. 1037 1038 It stores the embeddings in the `embedding` metadata field of each document. 1039 You can also embed documents' metadata. 1040 Use this component in indexing pipelines to embed input documents 1041 and send them to DocumentWriter to write into a Document Store. 1042 1043 ### Usage example: 1044 1045 <!-- test-ignore --> 1046 1047 ```python 1048 from haystack import Document 1049 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 1050 doc = Document(content="I love pizza!") 1051 doc_embedder = SentenceTransformersDocumentEmbedder() 1052 1053 result = doc_embedder.run([doc]) 1054 print(result['documents'][0].embedding) 1055 1056 # [-0.07804739475250244, 0.1498992145061493, ...] 1057 ``` 1058 1059 #### __init__ 1060 1061 ```python 1062 __init__( 1063 model: str = "sentence-transformers/all-mpnet-base-v2", 1064 device: ComponentDevice | None = None, 1065 token: Secret | None = Secret.from_env_var( 1066 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1067 ), 1068 prefix: str = "", 1069 suffix: str = "", 1070 batch_size: int = 32, 1071 progress_bar: bool = True, 1072 normalize_embeddings: bool = False, 1073 meta_fields_to_embed: list[str] | None = None, 1074 embedding_separator: str = "\n", 1075 trust_remote_code: bool = False, 1076 local_files_only: bool = False, 1077 truncate_dim: int | None = None, 1078 model_kwargs: dict[str, Any] | None = None, 1079 tokenizer_kwargs: dict[str, Any] | None = None, 1080 config_kwargs: dict[str, Any] | None = None, 1081 precision: Literal[ 1082 "float32", "int8", "uint8", "binary", "ubinary" 1083 ] = "float32", 1084 encode_kwargs: dict[str, Any] | None = None, 1085 backend: Literal["torch", "onnx", "openvino"] = "torch", 1086 revision: str | None = None, 1087 ) -> None 1088 ``` 1089 1090 Creates a SentenceTransformersDocumentEmbedder component. 1091 1092 **Parameters:** 1093 1094 - **model** (<code>str</code>) – The model to use for calculating embeddings. 1095 Pass a local path or ID of the model on Hugging Face. 1096 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 1097 Overrides the default device. 1098 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 1099 - **prefix** (<code>str</code>) – A string to add at the beginning of each document text. 1100 Can be used to prepend the text with an instruction, as required by some embedding models, 1101 such as E5 and bge. 1102 - **suffix** (<code>str</code>) – A string to add at the end of each document text. 1103 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 1104 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 1105 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1. 1106 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 1107 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 1108 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 1109 If `True`, allows custom models and scripts. 1110 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1111 - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation. 1112 If the model wasn't trained with Matryoshka Representation Learning, 1113 truncating embeddings can significantly affect performance. 1114 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1115 when loading the model. Refer to specific model documentation for available kwargs. 1116 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1117 Refer to specific model documentation for available kwargs. 1118 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1119 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 1120 All non-float32 precisions are quantized embeddings. 1121 Quantized embeddings are smaller and faster to compute, but may have a lower accuracy. 1122 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 1123 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents. 1124 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 1125 avoid passing parameters that change the output type. 1126 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1127 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1128 for more information on acceleration and quantization options. 1129 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1130 for a stored model on Hugging Face. 1131 1132 #### to_dict 1133 1134 ```python 1135 to_dict() -> dict[str, Any] 1136 ``` 1137 1138 Serializes the component to a dictionary. 1139 1140 **Returns:** 1141 1142 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1143 1144 #### from_dict 1145 1146 ```python 1147 from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentEmbedder 1148 ``` 1149 1150 Deserializes the component from a dictionary. 1151 1152 **Parameters:** 1153 1154 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1155 1156 **Returns:** 1157 1158 - <code>SentenceTransformersDocumentEmbedder</code> – Deserialized component. 1159 1160 #### warm_up 1161 1162 ```python 1163 warm_up() -> None 1164 ``` 1165 1166 Initializes the component. 1167 1168 #### run 1169 1170 ```python 1171 run(documents: list[Document]) -> dict[str, list[Document]] 1172 ``` 1173 1174 Embed a list of documents. 1175 1176 **Parameters:** 1177 1178 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 1179 1180 **Returns:** 1181 1182 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1183 - `documents`: Documents with embeddings. 1184 1185 ## sentence_transformers_sparse_document_embedder 1186 1187 ### SentenceTransformersSparseDocumentEmbedder 1188 1189 Calculates document sparse embeddings using sparse embedding models from Sentence Transformers. 1190 1191 It stores the sparse embeddings in the `sparse_embedding` metadata field of each document. 1192 You can also embed documents' metadata. 1193 Use this component in indexing pipelines to embed input documents 1194 and send them to DocumentWriter to write a into a Document Store. 1195 1196 ### Usage example: 1197 1198 <!-- test-ignore --> 1199 1200 ```python 1201 from haystack import Document 1202 from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder 1203 1204 doc = Document(content="I love pizza!") 1205 doc_embedder = SentenceTransformersSparseDocumentEmbedder() 1206 1207 result = doc_embedder.run([doc]) 1208 print(result['documents'][0].sparse_embedding) 1209 1210 # SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...]) 1211 ``` 1212 1213 #### __init__ 1214 1215 ```python 1216 __init__( 1217 *, 1218 model: str = "prithivida/Splade_PP_en_v2", 1219 device: ComponentDevice | None = None, 1220 token: Secret | None = Secret.from_env_var( 1221 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1222 ), 1223 prefix: str = "", 1224 suffix: str = "", 1225 batch_size: int = 32, 1226 progress_bar: bool = True, 1227 meta_fields_to_embed: list[str] | None = None, 1228 embedding_separator: str = "\n", 1229 trust_remote_code: bool = False, 1230 local_files_only: bool = False, 1231 model_kwargs: dict[str, Any] | None = None, 1232 tokenizer_kwargs: dict[str, Any] | None = None, 1233 config_kwargs: dict[str, Any] | None = None, 1234 backend: Literal["torch", "onnx", "openvino"] = "torch", 1235 revision: str | None = None 1236 ) -> None 1237 ``` 1238 1239 Creates a SentenceTransformersSparseDocumentEmbedder component. 1240 1241 **Parameters:** 1242 1243 - **model** (<code>str</code>) – The model to use for calculating sparse embeddings. 1244 Pass a local path or ID of the model on Hugging Face. 1245 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 1246 Overrides the default device. 1247 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 1248 - **prefix** (<code>str</code>) – A string to add at the beginning of each document text. 1249 - **suffix** (<code>str</code>) – A string to add at the end of each document text. 1250 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 1251 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 1252 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 1253 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 1254 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 1255 If `True`, allows custom models and scripts. 1256 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1257 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1258 when loading the model. Refer to specific model documentation for available kwargs. 1259 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1260 Refer to specific model documentation for available kwargs. 1261 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1262 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1263 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1264 for more information on acceleration and quantization options. 1265 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1266 for a stored model on Hugging Face. 1267 1268 #### to_dict 1269 1270 ```python 1271 to_dict() -> dict[str, Any] 1272 ``` 1273 1274 Serializes the component to a dictionary. 1275 1276 **Returns:** 1277 1278 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1279 1280 #### from_dict 1281 1282 ```python 1283 from_dict(data: dict[str, Any]) -> SentenceTransformersSparseDocumentEmbedder 1284 ``` 1285 1286 Deserializes the component from a dictionary. 1287 1288 **Parameters:** 1289 1290 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1291 1292 **Returns:** 1293 1294 - <code>SentenceTransformersSparseDocumentEmbedder</code> – Deserialized component. 1295 1296 #### warm_up 1297 1298 ```python 1299 warm_up() -> None 1300 ``` 1301 1302 Initializes the component. 1303 1304 #### run 1305 1306 ```python 1307 run(documents: list[Document]) -> dict[str, list[Document]] 1308 ``` 1309 1310 Embed a list of documents. 1311 1312 **Parameters:** 1313 1314 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 1315 1316 **Returns:** 1317 1318 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 1319 - `documents`: Documents with sparse embeddings under the `sparse_embedding` field. 1320 1321 ## sentence_transformers_sparse_text_embedder 1322 1323 ### SentenceTransformersSparseTextEmbedder 1324 1325 Embeds strings using sparse embedding models from Sentence Transformers. 1326 1327 You can use it to embed user query and send it to a sparse embedding retriever. 1328 1329 Usage example: 1330 1331 <!-- test-ignore --> 1332 1333 ```python 1334 from haystack.components.embedders import SentenceTransformersSparseTextEmbedder 1335 1336 text_to_embed = "I love pizza!" 1337 1338 text_embedder = SentenceTransformersSparseTextEmbedder() 1339 1340 print(text_embedder.run(text_to_embed)) 1341 1342 # {'sparse_embedding': SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])} 1343 ``` 1344 1345 #### __init__ 1346 1347 ```python 1348 __init__( 1349 *, 1350 model: str = "prithivida/Splade_PP_en_v2", 1351 device: ComponentDevice | None = None, 1352 token: Secret | None = Secret.from_env_var( 1353 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1354 ), 1355 prefix: str = "", 1356 suffix: str = "", 1357 trust_remote_code: bool = False, 1358 local_files_only: bool = False, 1359 model_kwargs: dict[str, Any] | None = None, 1360 tokenizer_kwargs: dict[str, Any] | None = None, 1361 config_kwargs: dict[str, Any] | None = None, 1362 backend: Literal["torch", "onnx", "openvino"] = "torch", 1363 revision: str | None = None 1364 ) -> None 1365 ``` 1366 1367 Create a SentenceTransformersSparseTextEmbedder component. 1368 1369 **Parameters:** 1370 1371 - **model** (<code>str</code>) – The model to use for calculating sparse embeddings. 1372 Specify the path to a local model or the ID of the model on Hugging Face. 1373 - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model. 1374 - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face. 1375 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded. 1376 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 1377 - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures. 1378 If `True`, permits custom models and scripts. 1379 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1380 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1381 when loading the model. Refer to specific model documentation for available kwargs. 1382 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1383 Refer to specific model documentation for available kwargs. 1384 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1385 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1386 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1387 for more information on acceleration and quantization options. 1388 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1389 for a stored model on Hugging Face. 1390 1391 #### to_dict 1392 1393 ```python 1394 to_dict() -> dict[str, Any] 1395 ``` 1396 1397 Serializes the component to a dictionary. 1398 1399 **Returns:** 1400 1401 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1402 1403 #### from_dict 1404 1405 ```python 1406 from_dict(data: dict[str, Any]) -> SentenceTransformersSparseTextEmbedder 1407 ``` 1408 1409 Deserializes the component from a dictionary. 1410 1411 **Parameters:** 1412 1413 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1414 1415 **Returns:** 1416 1417 - <code>SentenceTransformersSparseTextEmbedder</code> – Deserialized component. 1418 1419 #### warm_up 1420 1421 ```python 1422 warm_up() -> None 1423 ``` 1424 1425 Initializes the component. 1426 1427 #### run 1428 1429 ```python 1430 run(text: str) -> dict[str, Any] 1431 ``` 1432 1433 Embed a single string. 1434 1435 **Parameters:** 1436 1437 - **text** (<code>str</code>) – Text to embed. 1438 1439 **Returns:** 1440 1441 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1442 - `sparse_embedding`: The sparse embedding of the input text. 1443 1444 ## sentence_transformers_text_embedder 1445 1446 ### SentenceTransformersTextEmbedder 1447 1448 Embeds strings using Sentence Transformers models. 1449 1450 You can use it to embed user query and send it to an embedding retriever. 1451 1452 Usage example: 1453 1454 <!-- test-ignore --> 1455 1456 ```python 1457 from haystack.components.embedders import SentenceTransformersTextEmbedder 1458 1459 text_to_embed = "I love pizza!" 1460 1461 text_embedder = SentenceTransformersTextEmbedder() 1462 1463 print(text_embedder.run(text_to_embed)) 1464 1465 # {'embedding': [-0.07804739475250244, 0.1498992145061493,, ...]} 1466 ``` 1467 1468 #### __init__ 1469 1470 ```python 1471 __init__( 1472 model: str = "sentence-transformers/all-mpnet-base-v2", 1473 device: ComponentDevice | None = None, 1474 token: Secret | None = Secret.from_env_var( 1475 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1476 ), 1477 prefix: str = "", 1478 suffix: str = "", 1479 batch_size: int = 32, 1480 progress_bar: bool = True, 1481 normalize_embeddings: bool = False, 1482 trust_remote_code: bool = False, 1483 local_files_only: bool = False, 1484 truncate_dim: int | None = None, 1485 model_kwargs: dict[str, Any] | None = None, 1486 tokenizer_kwargs: dict[str, Any] | None = None, 1487 config_kwargs: dict[str, Any] | None = None, 1488 precision: Literal[ 1489 "float32", "int8", "uint8", "binary", "ubinary" 1490 ] = "float32", 1491 encode_kwargs: dict[str, Any] | None = None, 1492 backend: Literal["torch", "onnx", "openvino"] = "torch", 1493 revision: str | None = None, 1494 ) -> None 1495 ``` 1496 1497 Create a SentenceTransformersTextEmbedder component. 1498 1499 **Parameters:** 1500 1501 - **model** (<code>str</code>) – The model to use for calculating embeddings. 1502 Specify the path to a local model or the ID of the model on Hugging Face. 1503 - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model. 1504 - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face. 1505 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded. 1506 You can use it to prepend the text with an instruction, as required by some embedding models, 1507 such as E5 and bge. 1508 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 1509 - **batch_size** (<code>int</code>) – Number of texts to embed at once. 1510 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar for calculating embeddings. 1511 If `False`, disables the progress bar. 1512 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that the embeddings have a norm of 1. 1513 - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures. 1514 If `True`, permits custom models and scripts. 1515 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1516 - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation. 1517 If the model has not been trained with Matryoshka Representation Learning, 1518 truncation of embeddings can significantly affect performance. 1519 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1520 when loading the model. Refer to specific model documentation for available kwargs. 1521 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1522 Refer to specific model documentation for available kwargs. 1523 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1524 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 1525 All non-float32 precisions are quantized embeddings. 1526 Quantized embeddings are smaller in size and faster to compute, but may have a lower accuracy. 1527 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 1528 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding texts. 1529 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 1530 avoid passing parameters that change the output type. 1531 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1532 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1533 for more information on acceleration and quantization options. 1534 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1535 for a stored model on Hugging Face. 1536 1537 #### to_dict 1538 1539 ```python 1540 to_dict() -> dict[str, Any] 1541 ``` 1542 1543 Serializes the component to a dictionary. 1544 1545 **Returns:** 1546 1547 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1548 1549 #### from_dict 1550 1551 ```python 1552 from_dict(data: dict[str, Any]) -> SentenceTransformersTextEmbedder 1553 ``` 1554 1555 Deserializes the component from a dictionary. 1556 1557 **Parameters:** 1558 1559 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1560 1561 **Returns:** 1562 1563 - <code>SentenceTransformersTextEmbedder</code> – Deserialized component. 1564 1565 #### warm_up 1566 1567 ```python 1568 warm_up() -> None 1569 ``` 1570 1571 Initializes the component. 1572 1573 #### run 1574 1575 ```python 1576 run(text: str) -> dict[str, Any] 1577 ``` 1578 1579 Embed a single string. 1580 1581 **Parameters:** 1582 1583 - **text** (<code>str</code>) – Text to embed. 1584 1585 **Returns:** 1586 1587 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 1588 - `embedding`: The embedding of the input text.