embedders_api.md
1 --- 2 title: "Embedders" 3 id: embedders-api 4 description: "Transforms queries into vectors to look for similar or relevant Documents." 5 slug: "/embedders-api" 6 --- 7 8 9 ## azure_document_embedder 10 11 ### AzureOpenAIDocumentEmbedder 12 13 Bases: <code>OpenAIDocumentEmbedder</code> 14 15 Calculates document embeddings using OpenAI models deployed on Azure. 16 17 ### Usage example 18 19 ```python 20 from haystack import Document 21 from haystack.components.embedders import AzureOpenAIDocumentEmbedder 22 23 doc = Document(content="I love pizza!") 24 25 document_embedder = AzureOpenAIDocumentEmbedder() 26 27 result = document_embedder.run([doc]) 28 print(result['documents'][0].embedding) 29 30 # [0.017020374536514282, -0.023255806416273117, ...] 31 ``` 32 33 #### __init__ 34 35 ```python 36 __init__( 37 azure_endpoint: str | None = None, 38 api_version: str | None = "2023-05-15", 39 azure_deployment: str = "text-embedding-ada-002", 40 dimensions: int | None = None, 41 api_key: Secret | None = Secret.from_env_var( 42 "AZURE_OPENAI_API_KEY", strict=False 43 ), 44 azure_ad_token: Secret | None = Secret.from_env_var( 45 "AZURE_OPENAI_AD_TOKEN", strict=False 46 ), 47 organization: str | None = None, 48 prefix: str = "", 49 suffix: str = "", 50 batch_size: int = 32, 51 progress_bar: bool = True, 52 meta_fields_to_embed: list[str] | None = None, 53 embedding_separator: str = "\n", 54 timeout: float | None = None, 55 max_retries: int | None = None, 56 *, 57 default_headers: dict[str, str] | None = None, 58 azure_ad_token_provider: AzureADTokenProvider | None = None, 59 http_client_kwargs: dict[str, Any] | None = None, 60 raise_on_failure: bool = False 61 ) 62 ``` 63 64 Creates an AzureOpenAIDocumentEmbedder component. 65 66 **Parameters:** 67 68 - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure. 69 - **api_version** (<code>str | None</code>) – The version of the API to use. 70 - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002. 71 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only supported in text-embedding-3 72 and later models. 73 - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key. 74 You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this 75 parameter during initialization. 76 - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's 77 [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) 78 documentation for more information. You can set it with an environment variable 79 `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. 80 Previously called Azure Active Directory. 81 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 82 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 83 for more information. 84 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 85 - **suffix** (<code>str</code>) – A string to add at the end of each text. 86 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 87 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 88 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 89 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 90 - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds. 91 If not set, defaults to either the 92 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 93 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error. 94 If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable or to 5 retries. 95 - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client. 96 - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on 97 every request. 98 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 99 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 100 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error 101 and continue processing the remaining documents. If `True`, it will raise an exception on failure. 102 103 #### to_dict 104 105 ```python 106 to_dict() -> dict[str, Any] 107 ``` 108 109 Serializes the component to a dictionary. 110 111 **Returns:** 112 113 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 114 115 #### from_dict 116 117 ```python 118 from_dict(data: dict[str, Any]) -> AzureOpenAIDocumentEmbedder 119 ``` 120 121 Deserializes the component from a dictionary. 122 123 **Parameters:** 124 125 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 126 127 **Returns:** 128 129 - <code>AzureOpenAIDocumentEmbedder</code> – Deserialized component. 130 131 ## azure_text_embedder 132 133 ### AzureOpenAITextEmbedder 134 135 Bases: <code>OpenAITextEmbedder</code> 136 137 Embeds strings using OpenAI models deployed on Azure. 138 139 ### Usage example 140 141 ```python 142 from haystack.components.embedders import AzureOpenAITextEmbedder 143 144 text_to_embed = "I love pizza!" 145 146 text_embedder = AzureOpenAITextEmbedder() 147 148 print(text_embedder.run(text_to_embed)) 149 150 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 151 # 'meta': {'model': 'text-embedding-ada-002-v2', 152 # 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}} 153 ``` 154 155 #### __init__ 156 157 ```python 158 __init__( 159 azure_endpoint: str | None = None, 160 api_version: str | None = "2023-05-15", 161 azure_deployment: str = "text-embedding-ada-002", 162 dimensions: int | None = None, 163 api_key: Secret | None = Secret.from_env_var( 164 "AZURE_OPENAI_API_KEY", strict=False 165 ), 166 azure_ad_token: Secret | None = Secret.from_env_var( 167 "AZURE_OPENAI_AD_TOKEN", strict=False 168 ), 169 organization: str | None = None, 170 timeout: float | None = None, 171 max_retries: int | None = None, 172 prefix: str = "", 173 suffix: str = "", 174 *, 175 default_headers: dict[str, str] | None = None, 176 azure_ad_token_provider: AzureADTokenProvider | None = None, 177 http_client_kwargs: dict[str, Any] | None = None 178 ) 179 ``` 180 181 Creates an AzureOpenAITextEmbedder component. 182 183 **Parameters:** 184 185 - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure. 186 - **api_version** (<code>str | None</code>) – The version of the API to use. 187 - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002. 188 - **dimensions** (<code>int | None</code>) – The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3 189 and later models. 190 - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key. 191 You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this 192 parameter during initialization. 193 - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's 194 [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id) 195 documentation for more information. You can set it with an environment variable 196 `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization. 197 Previously called Azure Active Directory. 198 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 199 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 200 for more information. 201 - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds. 202 If not set, defaults to either the 203 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 204 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error. 205 If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable, or to 5 retries. 206 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 207 - **suffix** (<code>str</code>) – A string to add at the end of each text. 208 - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client. 209 - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on 210 every request. 211 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 212 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 213 214 #### to_dict 215 216 ```python 217 to_dict() -> dict[str, Any] 218 ``` 219 220 Serializes the component to a dictionary. 221 222 **Returns:** 223 224 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 225 226 #### from_dict 227 228 ```python 229 from_dict(data: dict[str, Any]) -> AzureOpenAITextEmbedder 230 ``` 231 232 Deserializes the component from a dictionary. 233 234 **Parameters:** 235 236 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 237 238 **Returns:** 239 240 - <code>AzureOpenAITextEmbedder</code> – Deserialized component. 241 242 ## hugging_face_api_document_embedder 243 244 ### HuggingFaceAPIDocumentEmbedder 245 246 Embeds documents using Hugging Face APIs. 247 248 Use it with the following Hugging Face APIs: 249 250 - [Free Serverless Inference API](https://huggingface.co/inference-api) 251 - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints) 252 - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 253 254 ### Usage examples 255 256 #### With free serverless inference API 257 258 ```python 259 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 260 from haystack.utils import Secret 261 from haystack.dataclasses import Document 262 263 doc = Document(content="I love pizza!") 264 265 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="serverless_inference_api", 266 api_params={"model": "BAAI/bge-small-en-v1.5"}, 267 token=Secret.from_token("<your-api-key>")) 268 269 result = document_embedder.run([doc]) 270 print(result["documents"][0].embedding) 271 272 # [0.017020374536514282, -0.023255806416273117, ...] 273 ``` 274 275 #### With paid inference endpoints 276 277 ```python 278 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 279 from haystack.utils import Secret 280 from haystack.dataclasses import Document 281 282 doc = Document(content="I love pizza!") 283 284 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="inference_endpoints", 285 api_params={"url": "<your-inference-endpoint-url>"}, 286 token=Secret.from_token("<your-api-key>")) 287 288 result = document_embedder.run([doc]) 289 print(result["documents"][0].embedding) 290 291 # [0.017020374536514282, -0.023255806416273117, ...] 292 ``` 293 294 #### With self-hosted text embeddings inference 295 296 ```python 297 from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder 298 from haystack.dataclasses import Document 299 300 doc = Document(content="I love pizza!") 301 302 doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="text_embeddings_inference", 303 api_params={"url": "http://localhost:8080"}) 304 305 result = document_embedder.run([doc]) 306 print(result["documents"][0].embedding) 307 308 # [0.017020374536514282, -0.023255806416273117, ...] 309 ``` 310 311 #### __init__ 312 313 ```python 314 __init__( 315 api_type: HFEmbeddingAPIType | str, 316 api_params: dict[str, str], 317 token: Secret | None = Secret.from_env_var( 318 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 319 ), 320 prefix: str = "", 321 suffix: str = "", 322 truncate: bool | None = True, 323 normalize: bool | None = False, 324 batch_size: int = 32, 325 progress_bar: bool = True, 326 meta_fields_to_embed: list[str] | None = None, 327 embedding_separator: str = "\n", 328 ) 329 ``` 330 331 Creates a HuggingFaceAPIDocumentEmbedder component. 332 333 **Parameters:** 334 335 - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use. 336 - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys: 337 - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. 338 - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or 339 `TEXT_EMBEDDINGS_INFERENCE`. 340 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 341 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 342 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 343 - **suffix** (<code>str</code>) – A string to add at the end of each text. 344 - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model. 345 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 346 if the backend uses Text Embeddings Inference. 347 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 348 - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length. 349 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 350 if the backend uses Text Embeddings Inference. 351 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 352 - **batch_size** (<code>int</code>) – Number of documents to process at once. 353 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 354 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 355 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 356 357 #### to_dict 358 359 ```python 360 to_dict() -> dict[str, Any] 361 ``` 362 363 Serializes the component to a dictionary. 364 365 **Returns:** 366 367 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 368 369 #### from_dict 370 371 ```python 372 from_dict(data: dict[str, Any]) -> HuggingFaceAPIDocumentEmbedder 373 ``` 374 375 Deserializes the component from a dictionary. 376 377 **Parameters:** 378 379 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 380 381 **Returns:** 382 383 - <code>HuggingFaceAPIDocumentEmbedder</code> – Deserialized component. 384 385 #### run 386 387 ```python 388 run(documents: list[Document]) 389 ``` 390 391 Embeds a list of documents. 392 393 **Parameters:** 394 395 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 396 397 **Returns:** 398 399 - – A dictionary with the following keys: 400 - `documents`: A list of documents with embeddings. 401 402 #### run_async 403 404 ```python 405 run_async(documents: list[Document]) 406 ``` 407 408 Embeds a list of documents asynchronously. 409 410 **Parameters:** 411 412 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 413 414 **Returns:** 415 416 - – A dictionary with the following keys: 417 - `documents`: A list of documents with embeddings. 418 419 ## hugging_face_api_text_embedder 420 421 ### HuggingFaceAPITextEmbedder 422 423 Embeds strings using Hugging Face APIs. 424 425 Use it with the following Hugging Face APIs: 426 427 - [Free Serverless Inference API](https://huggingface.co/inference-api) 428 - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints) 429 - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference) 430 431 ### Usage examples 432 433 #### With free serverless inference API 434 435 ```python 436 from haystack.components.embedders import HuggingFaceAPITextEmbedder 437 from haystack.utils import Secret 438 439 text_embedder = HuggingFaceAPITextEmbedder(api_type="serverless_inference_api", 440 api_params={"model": "BAAI/bge-small-en-v1.5"}, 441 token=Secret.from_token("<your-api-key>")) 442 443 print(text_embedder.run("I love pizza!")) 444 445 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 446 ``` 447 448 #### With paid inference endpoints 449 450 ```python 451 from haystack.components.embedders import HuggingFaceAPITextEmbedder 452 from haystack.utils import Secret 453 text_embedder = HuggingFaceAPITextEmbedder(api_type="inference_endpoints", 454 api_params={"model": "BAAI/bge-small-en-v1.5"}, 455 token=Secret.from_token("<your-api-key>")) 456 457 print(text_embedder.run("I love pizza!")) 458 459 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 460 ``` 461 462 #### With self-hosted text embeddings inference 463 464 ```python 465 from haystack.components.embedders import HuggingFaceAPITextEmbedder 466 from haystack.utils import Secret 467 468 text_embedder = HuggingFaceAPITextEmbedder(api_type="text_embeddings_inference", 469 api_params={"url": "http://localhost:8080"}) 470 471 print(text_embedder.run("I love pizza!")) 472 473 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 474 ``` 475 476 #### __init__ 477 478 ```python 479 __init__( 480 api_type: HFEmbeddingAPIType | str, 481 api_params: dict[str, str], 482 token: Secret | None = Secret.from_env_var( 483 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 484 ), 485 prefix: str = "", 486 suffix: str = "", 487 truncate: bool | None = True, 488 normalize: bool | None = False, 489 ) 490 ``` 491 492 Creates a HuggingFaceAPITextEmbedder component. 493 494 **Parameters:** 495 496 - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use. 497 - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys: 498 - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`. 499 - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or 500 `TEXT_EMBEDDINGS_INFERENCE`. 501 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 502 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 503 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 504 - **suffix** (<code>str</code>) – A string to add at the end of each text. 505 - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model. 506 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 507 if the backend uses Text Embeddings Inference. 508 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 509 - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length. 510 Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS` 511 if the backend uses Text Embeddings Inference. 512 If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored. 513 514 #### to_dict 515 516 ```python 517 to_dict() -> dict[str, Any] 518 ``` 519 520 Serializes the component to a dictionary. 521 522 **Returns:** 523 524 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 525 526 #### from_dict 527 528 ```python 529 from_dict(data: dict[str, Any]) -> HuggingFaceAPITextEmbedder 530 ``` 531 532 Deserializes the component from a dictionary. 533 534 **Parameters:** 535 536 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 537 538 **Returns:** 539 540 - <code>HuggingFaceAPITextEmbedder</code> – Deserialized component. 541 542 #### run 543 544 ```python 545 run(text: str) 546 ``` 547 548 Embeds a single string. 549 550 **Parameters:** 551 552 - **text** (<code>str</code>) – Text to embed. 553 554 **Returns:** 555 556 - – A dictionary with the following keys: 557 - `embedding`: The embedding of the input text. 558 559 #### run_async 560 561 ```python 562 run_async(text: str) 563 ``` 564 565 Embeds a single string asynchronously. 566 567 **Parameters:** 568 569 - **text** (<code>str</code>) – Text to embed. 570 571 **Returns:** 572 573 - – A dictionary with the following keys: 574 - `embedding`: The embedding of the input text. 575 576 ## image/sentence_transformers_doc_image_embedder 577 578 ### SentenceTransformersDocumentImageEmbedder 579 580 A component for computing Document embeddings based on images using Sentence Transformers models. 581 582 The embedding of each Document is stored in the `embedding` field of the Document. 583 584 ### Usage example 585 586 ```python 587 from haystack import Document 588 from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder 589 590 embedder = SentenceTransformersDocumentImageEmbedder(model="sentence-transformers/clip-ViT-B-32") 591 592 documents = [ 593 Document(content="A photo of a cat", meta={"file_path": "cat.jpg"}), 594 Document(content="A photo of a dog", meta={"file_path": "dog.jpg"}), 595 ] 596 597 result = embedder.run(documents=documents) 598 documents_with_embeddings = result["documents"] 599 print(documents_with_embeddings) 600 601 # [Document(id=..., 602 # content='A photo of a cat', 603 # meta={'file_path': 'cat.jpg', 604 # 'embedding_source': {'type': 'image', 'file_path_meta_field': 'file_path'}}, 605 # embedding=vector of size 512), 606 # ...] 607 ``` 608 609 #### __init__ 610 611 ```python 612 __init__( 613 *, 614 file_path_meta_field: str = "file_path", 615 root_path: str | None = None, 616 model: str = "sentence-transformers/clip-ViT-B-32", 617 device: ComponentDevice | None = None, 618 token: Secret | None = Secret.from_env_var( 619 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 620 ), 621 batch_size: int = 32, 622 progress_bar: bool = True, 623 normalize_embeddings: bool = False, 624 trust_remote_code: bool = False, 625 local_files_only: bool = False, 626 model_kwargs: dict[str, Any] | None = None, 627 tokenizer_kwargs: dict[str, Any] | None = None, 628 config_kwargs: dict[str, Any] | None = None, 629 precision: Literal[ 630 "float32", "int8", "uint8", "binary", "ubinary" 631 ] = "float32", 632 encode_kwargs: dict[str, Any] | None = None, 633 backend: Literal["torch", "onnx", "openvino"] = "torch" 634 ) -> None 635 ``` 636 637 Creates a SentenceTransformersDocumentEmbedder component. 638 639 **Parameters:** 640 641 - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF. 642 - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in 643 document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths. 644 - **model** (<code>str</code>) – The Sentence Transformers model to use for calculating embeddings. Pass a local path or ID of the model on 645 Hugging Face. To be used with this component, the model must be able to embed images and text into the same 646 vector space. Compatible models include: 647 - "sentence-transformers/clip-ViT-B-32" 648 - "sentence-transformers/clip-ViT-L-14" 649 - "sentence-transformers/clip-ViT-B-16" 650 - "sentence-transformers/clip-ViT-B-32-multilingual-v1" 651 - "jinaai/jina-embeddings-v4" 652 - "jinaai/jina-clip-v1" 653 - "jinaai/jina-clip-v2". 654 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 655 Overrides the default device. 656 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 657 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 658 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 659 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1. 660 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 661 If `True`, allows custom models and scripts. 662 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 663 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 664 when loading the model. Refer to specific model documentation for available kwargs. 665 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 666 Refer to specific model documentation for available kwargs. 667 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 668 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 669 All non-float32 precisions are quantized embeddings. 670 Quantized embeddings are smaller and faster to compute, but may have a lower accuracy. 671 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 672 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents. 673 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 674 avoid passing parameters that change the output type. 675 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 676 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 677 for more information on acceleration and quantization options. 678 679 #### to_dict 680 681 ```python 682 to_dict() -> dict[str, Any] 683 ``` 684 685 Serializes the component to a dictionary. 686 687 **Returns:** 688 689 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 690 691 #### from_dict 692 693 ```python 694 from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentImageEmbedder 695 ``` 696 697 Deserializes the component from a dictionary. 698 699 **Parameters:** 700 701 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 702 703 **Returns:** 704 705 - <code>SentenceTransformersDocumentImageEmbedder</code> – Deserialized component. 706 707 #### warm_up 708 709 ```python 710 warm_up() -> None 711 ``` 712 713 Initializes the component. 714 715 #### run 716 717 ```python 718 run(documents: list[Document]) -> dict[str, list[Document]] 719 ``` 720 721 Embed a list of documents. 722 723 **Parameters:** 724 725 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 726 727 **Returns:** 728 729 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 730 - `documents`: Documents with embeddings. 731 732 ## openai_document_embedder 733 734 ### OpenAIDocumentEmbedder 735 736 Computes document embeddings using OpenAI models. 737 738 ### Usage example 739 740 ```python 741 from haystack import Document 742 from haystack.components.embedders import OpenAIDocumentEmbedder 743 744 doc = Document(content="I love pizza!") 745 746 document_embedder = OpenAIDocumentEmbedder() 747 748 result = document_embedder.run([doc]) 749 print(result['documents'][0].embedding) 750 751 # [0.017020374536514282, -0.023255806416273117, ...] 752 ``` 753 754 #### __init__ 755 756 ```python 757 __init__( 758 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 759 model: str = "text-embedding-ada-002", 760 dimensions: int | None = None, 761 api_base_url: str | None = None, 762 organization: str | None = None, 763 prefix: str = "", 764 suffix: str = "", 765 batch_size: int = 32, 766 progress_bar: bool = True, 767 meta_fields_to_embed: list[str] | None = None, 768 embedding_separator: str = "\n", 769 timeout: float | None = None, 770 max_retries: int | None = None, 771 http_client_kwargs: dict[str, Any] | None = None, 772 *, 773 raise_on_failure: bool = False 774 ) 775 ``` 776 777 Creates an OpenAIDocumentEmbedder component. 778 779 Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' 780 environment variables to override the `timeout` and `max_retries` parameters respectively 781 in the OpenAI client. 782 783 **Parameters:** 784 785 - **api_key** (<code>Secret</code>) – The OpenAI API key. 786 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 787 during initialization. 788 - **model** (<code>str</code>) – The name of the model to use for calculating embeddings. 789 The default model is `text-embedding-ada-002`. 790 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and 791 later models support this parameter. 792 - **api_base_url** (<code>str | None</code>) – Overrides the default base URL for all HTTP requests. 793 - **organization** (<code>str | None</code>) – Your OpenAI organization ID. See OpenAI's 794 [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 795 for more information. 796 - **prefix** (<code>str</code>) – A string to add at the beginning of each text. 797 - **suffix** (<code>str</code>) – A string to add at the end of each text. 798 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 799 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running. 800 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 801 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 802 - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the 803 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 804 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error. 805 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or 5 retries. 806 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 807 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 808 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error 809 and continue processing the remaining documents. If `True`, it will raise an exception on failure. 810 811 #### to_dict 812 813 ```python 814 to_dict() -> dict[str, Any] 815 ``` 816 817 Serializes the component to a dictionary. 818 819 **Returns:** 820 821 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 822 823 #### from_dict 824 825 ```python 826 from_dict(data: dict[str, Any]) -> OpenAIDocumentEmbedder 827 ``` 828 829 Deserializes the component from a dictionary. 830 831 **Parameters:** 832 833 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 834 835 **Returns:** 836 837 - <code>OpenAIDocumentEmbedder</code> – Deserialized component. 838 839 #### run 840 841 ```python 842 run(documents: list[Document]) 843 ``` 844 845 Embeds a list of documents. 846 847 **Parameters:** 848 849 - **documents** (<code>list\[Document\]</code>) – A list of documents to embed. 850 851 **Returns:** 852 853 - – A dictionary with the following keys: 854 - `documents`: A list of documents with embeddings. 855 - `meta`: Information about the usage of the model. 856 857 #### run_async 858 859 ```python 860 run_async(documents: list[Document]) 861 ``` 862 863 Embeds a list of documents asynchronously. 864 865 **Parameters:** 866 867 - **documents** (<code>list\[Document\]</code>) – A list of documents to embed. 868 869 **Returns:** 870 871 - – A dictionary with the following keys: 872 - `documents`: A list of documents with embeddings. 873 - `meta`: Information about the usage of the model. 874 875 ## openai_text_embedder 876 877 ### OpenAITextEmbedder 878 879 Embeds strings using OpenAI models. 880 881 You can use it to embed user query and send it to an embedding Retriever. 882 883 ### Usage example 884 885 ```python 886 from haystack.components.embedders import OpenAITextEmbedder 887 888 text_to_embed = "I love pizza!" 889 890 text_embedder = OpenAITextEmbedder() 891 892 print(text_embedder.run(text_to_embed)) 893 894 # {'embedding': [0.017020374536514282, -0.023255806416273117, ...], 895 # 'meta': {'model': 'text-embedding-ada-002-v2', 896 # 'usage': {'prompt_tokens': 4, 'total_tokens': 4}}} 897 ``` 898 899 #### __init__ 900 901 ```python 902 __init__( 903 api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"), 904 model: str = "text-embedding-ada-002", 905 dimensions: int | None = None, 906 api_base_url: str | None = None, 907 organization: str | None = None, 908 prefix: str = "", 909 suffix: str = "", 910 timeout: float | None = None, 911 max_retries: int | None = None, 912 http_client_kwargs: dict[str, Any] | None = None, 913 ) 914 ``` 915 916 Creates an OpenAITextEmbedder component. 917 918 Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES' 919 environment variables to override the `timeout` and `max_retries` parameters respectively 920 in the OpenAI client. 921 922 **Parameters:** 923 924 - **api_key** (<code>Secret</code>) – The OpenAI API key. 925 You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter 926 during initialization. 927 - **model** (<code>str</code>) – The name of the model to use for calculating embeddings. 928 The default model is `text-embedding-ada-002`. 929 - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and 930 later models support this parameter. 931 - **api_base_url** (<code>str | None</code>) – Overrides default base URL for all HTTP requests. 932 - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's 933 [production best practices](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization) 934 for more information. 935 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to embed. 936 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 937 - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the 938 `OPENAI_TIMEOUT` environment variable, or 30 seconds. 939 - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error. 940 If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5. 941 - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`. 942 For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client). 943 944 #### to_dict 945 946 ```python 947 to_dict() -> dict[str, Any] 948 ``` 949 950 Serializes the component to a dictionary. 951 952 **Returns:** 953 954 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 955 956 #### from_dict 957 958 ```python 959 from_dict(data: dict[str, Any]) -> OpenAITextEmbedder 960 ``` 961 962 Deserializes the component from a dictionary. 963 964 **Parameters:** 965 966 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 967 968 **Returns:** 969 970 - <code>OpenAITextEmbedder</code> – Deserialized component. 971 972 #### run 973 974 ```python 975 run(text: str) 976 ``` 977 978 Embeds a single string. 979 980 **Parameters:** 981 982 - **text** (<code>str</code>) – Text to embed. 983 984 **Returns:** 985 986 - – A dictionary with the following keys: 987 - `embedding`: The embedding of the input text. 988 - `meta`: Information about the usage of the model. 989 990 #### run_async 991 992 ```python 993 run_async(text: str) 994 ``` 995 996 Asynchronously embed a single string. 997 998 This is the asynchronous version of the `run` method. It has the same parameters and return values 999 but can be used with `await` in async code. 1000 1001 **Parameters:** 1002 1003 - **text** (<code>str</code>) – Text to embed. 1004 1005 **Returns:** 1006 1007 - – A dictionary with the following keys: 1008 - `embedding`: The embedding of the input text. 1009 - `meta`: Information about the usage of the model. 1010 1011 ## sentence_transformers_document_embedder 1012 1013 ### SentenceTransformersDocumentEmbedder 1014 1015 Calculates document embeddings using Sentence Transformers models. 1016 1017 It stores the embeddings in the `embedding` metadata field of each document. 1018 You can also embed documents' metadata. 1019 Use this component in indexing pipelines to embed input documents 1020 and send them to DocumentWriter to write into a Document Store. 1021 1022 ### Usage example: 1023 1024 ```python 1025 from haystack import Document 1026 from haystack.components.embedders import SentenceTransformersDocumentEmbedder 1027 doc = Document(content="I love pizza!") 1028 doc_embedder = SentenceTransformersDocumentEmbedder() 1029 1030 result = doc_embedder.run([doc]) 1031 print(result['documents'][0].embedding) 1032 1033 # [-0.07804739475250244, 0.1498992145061493, ...] 1034 ``` 1035 1036 #### __init__ 1037 1038 ```python 1039 __init__( 1040 model: str = "sentence-transformers/all-mpnet-base-v2", 1041 device: ComponentDevice | None = None, 1042 token: Secret | None = Secret.from_env_var( 1043 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1044 ), 1045 prefix: str = "", 1046 suffix: str = "", 1047 batch_size: int = 32, 1048 progress_bar: bool = True, 1049 normalize_embeddings: bool = False, 1050 meta_fields_to_embed: list[str] | None = None, 1051 embedding_separator: str = "\n", 1052 trust_remote_code: bool = False, 1053 local_files_only: bool = False, 1054 truncate_dim: int | None = None, 1055 model_kwargs: dict[str, Any] | None = None, 1056 tokenizer_kwargs: dict[str, Any] | None = None, 1057 config_kwargs: dict[str, Any] | None = None, 1058 precision: Literal[ 1059 "float32", "int8", "uint8", "binary", "ubinary" 1060 ] = "float32", 1061 encode_kwargs: dict[str, Any] | None = None, 1062 backend: Literal["torch", "onnx", "openvino"] = "torch", 1063 revision: str | None = None, 1064 ) 1065 ``` 1066 1067 Creates a SentenceTransformersDocumentEmbedder component. 1068 1069 **Parameters:** 1070 1071 - **model** (<code>str</code>) – The model to use for calculating embeddings. 1072 Pass a local path or ID of the model on Hugging Face. 1073 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 1074 Overrides the default device. 1075 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 1076 - **prefix** (<code>str</code>) – A string to add at the beginning of each document text. 1077 Can be used to prepend the text with an instruction, as required by some embedding models, 1078 such as E5 and bge. 1079 - **suffix** (<code>str</code>) – A string to add at the end of each document text. 1080 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 1081 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 1082 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1. 1083 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 1084 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 1085 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 1086 If `True`, allows custom models and scripts. 1087 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1088 - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation. 1089 If the model wasn't trained with Matryoshka Representation Learning, 1090 truncating embeddings can significantly affect performance. 1091 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1092 when loading the model. Refer to specific model documentation for available kwargs. 1093 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1094 Refer to specific model documentation for available kwargs. 1095 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1096 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 1097 All non-float32 precisions are quantized embeddings. 1098 Quantized embeddings are smaller and faster to compute, but may have a lower accuracy. 1099 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 1100 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents. 1101 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 1102 avoid passing parameters that change the output type. 1103 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1104 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1105 for more information on acceleration and quantization options. 1106 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1107 for a stored model on Hugging Face. 1108 1109 #### to_dict 1110 1111 ```python 1112 to_dict() -> dict[str, Any] 1113 ``` 1114 1115 Serializes the component to a dictionary. 1116 1117 **Returns:** 1118 1119 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1120 1121 #### from_dict 1122 1123 ```python 1124 from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentEmbedder 1125 ``` 1126 1127 Deserializes the component from a dictionary. 1128 1129 **Parameters:** 1130 1131 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1132 1133 **Returns:** 1134 1135 - <code>SentenceTransformersDocumentEmbedder</code> – Deserialized component. 1136 1137 #### warm_up 1138 1139 ```python 1140 warm_up() 1141 ``` 1142 1143 Initializes the component. 1144 1145 #### run 1146 1147 ```python 1148 run(documents: list[Document]) 1149 ``` 1150 1151 Embed a list of documents. 1152 1153 **Parameters:** 1154 1155 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 1156 1157 **Returns:** 1158 1159 - – A dictionary with the following keys: 1160 - `documents`: Documents with embeddings. 1161 1162 ## sentence_transformers_sparse_document_embedder 1163 1164 ### SentenceTransformersSparseDocumentEmbedder 1165 1166 Calculates document sparse embeddings using sparse embedding models from Sentence Transformers. 1167 1168 It stores the sparse embeddings in the `sparse_embedding` metadata field of each document. 1169 You can also embed documents' metadata. 1170 Use this component in indexing pipelines to embed input documents 1171 and send them to DocumentWriter to write a into a Document Store. 1172 1173 ### Usage example: 1174 1175 ```python 1176 from haystack import Document 1177 from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder 1178 1179 doc = Document(content="I love pizza!") 1180 doc_embedder = SentenceTransformersSparseDocumentEmbedder() 1181 1182 result = doc_embedder.run([doc]) 1183 print(result['documents'][0].sparse_embedding) 1184 1185 # SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...]) 1186 ``` 1187 1188 #### __init__ 1189 1190 ```python 1191 __init__( 1192 *, 1193 model: str = "prithivida/Splade_PP_en_v2", 1194 device: ComponentDevice | None = None, 1195 token: Secret | None = Secret.from_env_var( 1196 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1197 ), 1198 prefix: str = "", 1199 suffix: str = "", 1200 batch_size: int = 32, 1201 progress_bar: bool = True, 1202 meta_fields_to_embed: list[str] | None = None, 1203 embedding_separator: str = "\n", 1204 trust_remote_code: bool = False, 1205 local_files_only: bool = False, 1206 model_kwargs: dict[str, Any] | None = None, 1207 tokenizer_kwargs: dict[str, Any] | None = None, 1208 config_kwargs: dict[str, Any] | None = None, 1209 backend: Literal["torch", "onnx", "openvino"] = "torch", 1210 revision: str | None = None 1211 ) 1212 ``` 1213 1214 Creates a SentenceTransformersSparseDocumentEmbedder component. 1215 1216 **Parameters:** 1217 1218 - **model** (<code>str</code>) – The model to use for calculating sparse embeddings. 1219 Pass a local path or ID of the model on Hugging Face. 1220 - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model. 1221 Overrides the default device. 1222 - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face. 1223 - **prefix** (<code>str</code>) – A string to add at the beginning of each document text. 1224 - **suffix** (<code>str</code>) – A string to add at the end of each document text. 1225 - **batch_size** (<code>int</code>) – Number of documents to embed at once. 1226 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents. 1227 - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text. 1228 - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text. 1229 - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures. 1230 If `True`, allows custom models and scripts. 1231 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1232 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1233 when loading the model. Refer to specific model documentation for available kwargs. 1234 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1235 Refer to specific model documentation for available kwargs. 1236 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1237 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1238 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1239 for more information on acceleration and quantization options. 1240 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1241 for a stored model on Hugging Face. 1242 1243 #### to_dict 1244 1245 ```python 1246 to_dict() -> dict[str, Any] 1247 ``` 1248 1249 Serializes the component to a dictionary. 1250 1251 **Returns:** 1252 1253 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1254 1255 #### from_dict 1256 1257 ```python 1258 from_dict(data: dict[str, Any]) -> SentenceTransformersSparseDocumentEmbedder 1259 ``` 1260 1261 Deserializes the component from a dictionary. 1262 1263 **Parameters:** 1264 1265 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1266 1267 **Returns:** 1268 1269 - <code>SentenceTransformersSparseDocumentEmbedder</code> – Deserialized component. 1270 1271 #### warm_up 1272 1273 ```python 1274 warm_up() 1275 ``` 1276 1277 Initializes the component. 1278 1279 #### run 1280 1281 ```python 1282 run(documents: list[Document]) 1283 ``` 1284 1285 Embed a list of documents. 1286 1287 **Parameters:** 1288 1289 - **documents** (<code>list\[Document\]</code>) – Documents to embed. 1290 1291 **Returns:** 1292 1293 - – A dictionary with the following keys: 1294 - `documents`: Documents with sparse embeddings under the `sparse_embedding` field. 1295 1296 ## sentence_transformers_sparse_text_embedder 1297 1298 ### SentenceTransformersSparseTextEmbedder 1299 1300 Embeds strings using sparse embedding models from Sentence Transformers. 1301 1302 You can use it to embed user query and send it to a sparse embedding retriever. 1303 1304 Usage example: 1305 1306 ```python 1307 from haystack.components.embedders import SentenceTransformersSparseTextEmbedder 1308 1309 text_to_embed = "I love pizza!" 1310 1311 text_embedder = SentenceTransformersSparseTextEmbedder() 1312 1313 print(text_embedder.run(text_to_embed)) 1314 1315 # {'sparse_embedding': SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])} 1316 ``` 1317 1318 #### __init__ 1319 1320 ```python 1321 __init__( 1322 *, 1323 model: str = "prithivida/Splade_PP_en_v2", 1324 device: ComponentDevice | None = None, 1325 token: Secret | None = Secret.from_env_var( 1326 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1327 ), 1328 prefix: str = "", 1329 suffix: str = "", 1330 trust_remote_code: bool = False, 1331 local_files_only: bool = False, 1332 model_kwargs: dict[str, Any] | None = None, 1333 tokenizer_kwargs: dict[str, Any] | None = None, 1334 config_kwargs: dict[str, Any] | None = None, 1335 backend: Literal["torch", "onnx", "openvino"] = "torch", 1336 revision: str | None = None 1337 ) 1338 ``` 1339 1340 Create a SentenceTransformersSparseTextEmbedder component. 1341 1342 **Parameters:** 1343 1344 - **model** (<code>str</code>) – The model to use for calculating sparse embeddings. 1345 Specify the path to a local model or the ID of the model on Hugging Face. 1346 - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model. 1347 - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face. 1348 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded. 1349 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 1350 - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures. 1351 If `True`, permits custom models and scripts. 1352 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1353 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1354 when loading the model. Refer to specific model documentation for available kwargs. 1355 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1356 Refer to specific model documentation for available kwargs. 1357 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1358 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1359 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1360 for more information on acceleration and quantization options. 1361 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1362 for a stored model on Hugging Face. 1363 1364 #### to_dict 1365 1366 ```python 1367 to_dict() -> dict[str, Any] 1368 ``` 1369 1370 Serializes the component to a dictionary. 1371 1372 **Returns:** 1373 1374 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1375 1376 #### from_dict 1377 1378 ```python 1379 from_dict(data: dict[str, Any]) -> SentenceTransformersSparseTextEmbedder 1380 ``` 1381 1382 Deserializes the component from a dictionary. 1383 1384 **Parameters:** 1385 1386 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1387 1388 **Returns:** 1389 1390 - <code>SentenceTransformersSparseTextEmbedder</code> – Deserialized component. 1391 1392 #### warm_up 1393 1394 ```python 1395 warm_up() 1396 ``` 1397 1398 Initializes the component. 1399 1400 #### run 1401 1402 ```python 1403 run(text: str) 1404 ``` 1405 1406 Embed a single string. 1407 1408 **Parameters:** 1409 1410 - **text** (<code>str</code>) – Text to embed. 1411 1412 **Returns:** 1413 1414 - – A dictionary with the following keys: 1415 - `sparse_embedding`: The sparse embedding of the input text. 1416 1417 ## sentence_transformers_text_embedder 1418 1419 ### SentenceTransformersTextEmbedder 1420 1421 Embeds strings using Sentence Transformers models. 1422 1423 You can use it to embed user query and send it to an embedding retriever. 1424 1425 Usage example: 1426 1427 ```python 1428 from haystack.components.embedders import SentenceTransformersTextEmbedder 1429 1430 text_to_embed = "I love pizza!" 1431 1432 text_embedder = SentenceTransformersTextEmbedder() 1433 1434 print(text_embedder.run(text_to_embed)) 1435 1436 # {'embedding': [-0.07804739475250244, 0.1498992145061493,, ...]} 1437 ``` 1438 1439 #### __init__ 1440 1441 ```python 1442 __init__( 1443 model: str = "sentence-transformers/all-mpnet-base-v2", 1444 device: ComponentDevice | None = None, 1445 token: Secret | None = Secret.from_env_var( 1446 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 1447 ), 1448 prefix: str = "", 1449 suffix: str = "", 1450 batch_size: int = 32, 1451 progress_bar: bool = True, 1452 normalize_embeddings: bool = False, 1453 trust_remote_code: bool = False, 1454 local_files_only: bool = False, 1455 truncate_dim: int | None = None, 1456 model_kwargs: dict[str, Any] | None = None, 1457 tokenizer_kwargs: dict[str, Any] | None = None, 1458 config_kwargs: dict[str, Any] | None = None, 1459 precision: Literal[ 1460 "float32", "int8", "uint8", "binary", "ubinary" 1461 ] = "float32", 1462 encode_kwargs: dict[str, Any] | None = None, 1463 backend: Literal["torch", "onnx", "openvino"] = "torch", 1464 revision: str | None = None, 1465 ) 1466 ``` 1467 1468 Create a SentenceTransformersTextEmbedder component. 1469 1470 **Parameters:** 1471 1472 - **model** (<code>str</code>) – The model to use for calculating embeddings. 1473 Specify the path to a local model or the ID of the model on Hugging Face. 1474 - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model. 1475 - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face. 1476 - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded. 1477 You can use it to prepend the text with an instruction, as required by some embedding models, 1478 such as E5 and bge. 1479 - **suffix** (<code>str</code>) – A string to add at the end of each text to embed. 1480 - **batch_size** (<code>int</code>) – Number of texts to embed at once. 1481 - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar for calculating embeddings. 1482 If `False`, disables the progress bar. 1483 - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that the embeddings have a norm of 1. 1484 - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures. 1485 If `True`, permits custom models and scripts. 1486 - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files. 1487 - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation. 1488 If the model has not been trained with Matryoshka Representation Learning, 1489 truncation of embeddings can significantly affect performance. 1490 - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained` 1491 when loading the model. Refer to specific model documentation for available kwargs. 1492 - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer. 1493 Refer to specific model documentation for available kwargs. 1494 - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration. 1495 - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings. 1496 All non-float32 precisions are quantized embeddings. 1497 Quantized embeddings are smaller in size and faster to compute, but may have a lower accuracy. 1498 They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks. 1499 - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding texts. 1500 This parameter is provided for fine customization. Be careful not to clash with already set parameters and 1501 avoid passing parameters that change the output type. 1502 - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino". 1503 Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html) 1504 for more information on acceleration and quantization options. 1505 - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id, 1506 for a stored model on Hugging Face. 1507 1508 #### to_dict 1509 1510 ```python 1511 to_dict() -> dict[str, Any] 1512 ``` 1513 1514 Serializes the component to a dictionary. 1515 1516 **Returns:** 1517 1518 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 1519 1520 #### from_dict 1521 1522 ```python 1523 from_dict(data: dict[str, Any]) -> SentenceTransformersTextEmbedder 1524 ``` 1525 1526 Deserializes the component from a dictionary. 1527 1528 **Parameters:** 1529 1530 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 1531 1532 **Returns:** 1533 1534 - <code>SentenceTransformersTextEmbedder</code> – Deserialized component. 1535 1536 #### warm_up 1537 1538 ```python 1539 warm_up() 1540 ``` 1541 1542 Initializes the component. 1543 1544 #### run 1545 1546 ```python 1547 run(text: str) 1548 ``` 1549 1550 Embed a single string. 1551 1552 **Parameters:** 1553 1554 - **text** (<code>str</code>) – Text to embed. 1555 1556 **Returns:** 1557 1558 - – A dictionary with the following keys: 1559 - `embedding`: The embedding of the input text.