embedders_api.md
   1  ---
   2  title: "Embedders"
   3  id: embedders-api
   4  description: "Transforms queries into vectors to look for similar or relevant Documents."
   5  slug: "/embedders-api"
   6  ---
   7  
   8  
   9  ## azure_document_embedder
  10  
  11  ### AzureOpenAIDocumentEmbedder
  12  
  13  Bases: <code>OpenAIDocumentEmbedder</code>
  14  
  15  Calculates document embeddings using OpenAI models deployed on Azure.
  16  
  17  ### Usage example
  18  
  19  ```python
  20  from haystack import Document
  21  from haystack.components.embedders import AzureOpenAIDocumentEmbedder
  22  
  23  doc = Document(content="I love pizza!")
  24  
  25  document_embedder = AzureOpenAIDocumentEmbedder()
  26  
  27  result = document_embedder.run([doc])
  28  print(result['documents'][0].embedding)
  29  
  30  # [0.017020374536514282, -0.023255806416273117, ...]
  31  ```
  32  
  33  #### __init__
  34  
  35  ```python
  36  __init__(
  37      azure_endpoint: str | None = None,
  38      api_version: str | None = "2023-05-15",
  39      azure_deployment: str = "text-embedding-ada-002",
  40      dimensions: int | None = None,
  41      api_key: Secret | None = Secret.from_env_var(
  42          "AZURE_OPENAI_API_KEY", strict=False
  43      ),
  44      azure_ad_token: Secret | None = Secret.from_env_var(
  45          "AZURE_OPENAI_AD_TOKEN", strict=False
  46      ),
  47      organization: str | None = None,
  48      prefix: str = "",
  49      suffix: str = "",
  50      batch_size: int = 32,
  51      progress_bar: bool = True,
  52      meta_fields_to_embed: list[str] | None = None,
  53      embedding_separator: str = "\n",
  54      timeout: float | None = None,
  55      max_retries: int | None = None,
  56      *,
  57      default_headers: dict[str, str] | None = None,
  58      azure_ad_token_provider: AzureADTokenProvider | None = None,
  59      http_client_kwargs: dict[str, Any] | None = None,
  60      raise_on_failure: bool = False
  61  )
  62  ```
  63  
  64  Creates an AzureOpenAIDocumentEmbedder component.
  65  
  66  **Parameters:**
  67  
  68  - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure.
  69  - **api_version** (<code>str | None</code>) – The version of the API to use.
  70  - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002.
  71  - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only supported in text-embedding-3
  72    and later models.
  73  - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key.
  74    You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this
  75    parameter during initialization.
  76  - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's
  77    [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id)
  78    documentation for more information. You can set it with an environment variable
  79    `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization.
  80    Previously called Azure Active Directory.
  81  - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's
  82    [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization)
  83    for more information.
  84  - **prefix** (<code>str</code>) – A string to add at the beginning of each text.
  85  - **suffix** (<code>str</code>) – A string to add at the end of each text.
  86  - **batch_size** (<code>int</code>) – Number of documents to embed at once.
  87  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running.
  88  - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text.
  89  - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text.
  90  - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds.
  91    If not set, defaults to either the
  92    `OPENAI_TIMEOUT` environment variable, or 30 seconds.
  93  - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error.
  94    If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable or to 5 retries.
  95  - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client.
  96  - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on
  97    every request.
  98  - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
  99    For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client).
 100  - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error
 101    and continue processing the remaining documents. If `True`, it will raise an exception on failure.
 102  
 103  #### to_dict
 104  
 105  ```python
 106  to_dict() -> dict[str, Any]
 107  ```
 108  
 109  Serializes the component to a dictionary.
 110  
 111  **Returns:**
 112  
 113  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 114  
 115  #### from_dict
 116  
 117  ```python
 118  from_dict(data: dict[str, Any]) -> AzureOpenAIDocumentEmbedder
 119  ```
 120  
 121  Deserializes the component from a dictionary.
 122  
 123  **Parameters:**
 124  
 125  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 126  
 127  **Returns:**
 128  
 129  - <code>AzureOpenAIDocumentEmbedder</code> – Deserialized component.
 130  
 131  ## azure_text_embedder
 132  
 133  ### AzureOpenAITextEmbedder
 134  
 135  Bases: <code>OpenAITextEmbedder</code>
 136  
 137  Embeds strings using OpenAI models deployed on Azure.
 138  
 139  ### Usage example
 140  
 141  ```python
 142  from haystack.components.embedders import AzureOpenAITextEmbedder
 143  
 144  text_to_embed = "I love pizza!"
 145  
 146  text_embedder = AzureOpenAITextEmbedder()
 147  
 148  print(text_embedder.run(text_to_embed))
 149  
 150  # {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
 151  # 'meta': {'model': 'text-embedding-ada-002-v2',
 152  #          'usage': {'prompt_tokens': 4, 'total_tokens': 4}}}
 153  ```
 154  
 155  #### __init__
 156  
 157  ```python
 158  __init__(
 159      azure_endpoint: str | None = None,
 160      api_version: str | None = "2023-05-15",
 161      azure_deployment: str = "text-embedding-ada-002",
 162      dimensions: int | None = None,
 163      api_key: Secret | None = Secret.from_env_var(
 164          "AZURE_OPENAI_API_KEY", strict=False
 165      ),
 166      azure_ad_token: Secret | None = Secret.from_env_var(
 167          "AZURE_OPENAI_AD_TOKEN", strict=False
 168      ),
 169      organization: str | None = None,
 170      timeout: float | None = None,
 171      max_retries: int | None = None,
 172      prefix: str = "",
 173      suffix: str = "",
 174      *,
 175      default_headers: dict[str, str] | None = None,
 176      azure_ad_token_provider: AzureADTokenProvider | None = None,
 177      http_client_kwargs: dict[str, Any] | None = None
 178  )
 179  ```
 180  
 181  Creates an AzureOpenAITextEmbedder component.
 182  
 183  **Parameters:**
 184  
 185  - **azure_endpoint** (<code>str | None</code>) – The endpoint of the model deployed on Azure.
 186  - **api_version** (<code>str | None</code>) – The version of the API to use.
 187  - **azure_deployment** (<code>str</code>) – The name of the model deployed on Azure. The default model is text-embedding-ada-002.
 188  - **dimensions** (<code>int | None</code>) – The number of dimensions the resulting output embeddings should have. Only supported in text-embedding-3
 189    and later models.
 190  - **api_key** (<code>Secret | None</code>) – The Azure OpenAI API key.
 191    You can set it with an environment variable `AZURE_OPENAI_API_KEY`, or pass with this
 192    parameter during initialization.
 193  - **azure_ad_token** (<code>Secret | None</code>) – Microsoft Entra ID token, see Microsoft's
 194    [Entra ID](https://www.microsoft.com/en-us/security/business/identity-access/microsoft-entra-id)
 195    documentation for more information. You can set it with an environment variable
 196    `AZURE_OPENAI_AD_TOKEN`, or pass with this parameter during initialization.
 197    Previously called Azure Active Directory.
 198  - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's
 199    [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization)
 200    for more information.
 201  - **timeout** (<code>float | None</code>) – The timeout for `AzureOpenAI` client calls, in seconds.
 202    If not set, defaults to either the
 203    `OPENAI_TIMEOUT` environment variable, or 30 seconds.
 204  - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact AzureOpenAI after an internal error.
 205    If not set, defaults to either the `OPENAI_MAX_RETRIES` environment variable, or to 5 retries.
 206  - **prefix** (<code>str</code>) – A string to add at the beginning of each text.
 207  - **suffix** (<code>str</code>) – A string to add at the end of each text.
 208  - **default_headers** (<code>dict\[str, str\] | None</code>) – Default headers to send to the AzureOpenAI client.
 209  - **azure_ad_token_provider** (<code>AzureADTokenProvider | None</code>) – A function that returns an Azure Active Directory token, will be invoked on
 210    every request.
 211  - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
 212    For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client).
 213  
 214  #### to_dict
 215  
 216  ```python
 217  to_dict() -> dict[str, Any]
 218  ```
 219  
 220  Serializes the component to a dictionary.
 221  
 222  **Returns:**
 223  
 224  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 225  
 226  #### from_dict
 227  
 228  ```python
 229  from_dict(data: dict[str, Any]) -> AzureOpenAITextEmbedder
 230  ```
 231  
 232  Deserializes the component from a dictionary.
 233  
 234  **Parameters:**
 235  
 236  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 237  
 238  **Returns:**
 239  
 240  - <code>AzureOpenAITextEmbedder</code> – Deserialized component.
 241  
 242  ## hugging_face_api_document_embedder
 243  
 244  ### HuggingFaceAPIDocumentEmbedder
 245  
 246  Embeds documents using Hugging Face APIs.
 247  
 248  Use it with the following Hugging Face APIs:
 249  
 250  - [Free Serverless Inference API](https://huggingface.co/inference-api)
 251  - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints)
 252  - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference)
 253  
 254  ### Usage examples
 255  
 256  #### With free serverless inference API
 257  
 258  ```python
 259  from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder
 260  from haystack.utils import Secret
 261  from haystack.dataclasses import Document
 262  
 263  doc = Document(content="I love pizza!")
 264  
 265  doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="serverless_inference_api",
 266                                                api_params={"model": "BAAI/bge-small-en-v1.5"},
 267                                                token=Secret.from_token("<your-api-key>"))
 268  
 269  result = document_embedder.run([doc])
 270  print(result["documents"][0].embedding)
 271  
 272  # [0.017020374536514282, -0.023255806416273117, ...]
 273  ```
 274  
 275  #### With paid inference endpoints
 276  
 277  ```python
 278  from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder
 279  from haystack.utils import Secret
 280  from haystack.dataclasses import Document
 281  
 282  doc = Document(content="I love pizza!")
 283  
 284  doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="inference_endpoints",
 285                                                api_params={"url": "<your-inference-endpoint-url>"},
 286                                                token=Secret.from_token("<your-api-key>"))
 287  
 288  result = document_embedder.run([doc])
 289  print(result["documents"][0].embedding)
 290  
 291  # [0.017020374536514282, -0.023255806416273117, ...]
 292  ```
 293  
 294  #### With self-hosted text embeddings inference
 295  
 296  ```python
 297  from haystack.components.embedders import HuggingFaceAPIDocumentEmbedder
 298  from haystack.dataclasses import Document
 299  
 300  doc = Document(content="I love pizza!")
 301  
 302  doc_embedder = HuggingFaceAPIDocumentEmbedder(api_type="text_embeddings_inference",
 303                                                api_params={"url": "http://localhost:8080"})
 304  
 305  result = document_embedder.run([doc])
 306  print(result["documents"][0].embedding)
 307  
 308  # [0.017020374536514282, -0.023255806416273117, ...]
 309  ```
 310  
 311  #### __init__
 312  
 313  ```python
 314  __init__(
 315      api_type: HFEmbeddingAPIType | str,
 316      api_params: dict[str, str],
 317      token: Secret | None = Secret.from_env_var(
 318          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
 319      ),
 320      prefix: str = "",
 321      suffix: str = "",
 322      truncate: bool | None = True,
 323      normalize: bool | None = False,
 324      batch_size: int = 32,
 325      progress_bar: bool = True,
 326      meta_fields_to_embed: list[str] | None = None,
 327      embedding_separator: str = "\n",
 328  )
 329  ```
 330  
 331  Creates a HuggingFaceAPIDocumentEmbedder component.
 332  
 333  **Parameters:**
 334  
 335  - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use.
 336  - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys:
 337  - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.
 338  - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or
 339    `TEXT_EMBEDDINGS_INFERENCE`.
 340  - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization.
 341    Check your HF token in your [account settings](https://huggingface.co/settings/tokens).
 342  - **prefix** (<code>str</code>) – A string to add at the beginning of each text.
 343  - **suffix** (<code>str</code>) – A string to add at the end of each text.
 344  - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model.
 345    Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS`
 346    if the backend uses Text Embeddings Inference.
 347    If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored.
 348  - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length.
 349    Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS`
 350    if the backend uses Text Embeddings Inference.
 351    If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored.
 352  - **batch_size** (<code>int</code>) – Number of documents to process at once.
 353  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running.
 354  - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text.
 355  - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text.
 356  
 357  #### to_dict
 358  
 359  ```python
 360  to_dict() -> dict[str, Any]
 361  ```
 362  
 363  Serializes the component to a dictionary.
 364  
 365  **Returns:**
 366  
 367  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 368  
 369  #### from_dict
 370  
 371  ```python
 372  from_dict(data: dict[str, Any]) -> HuggingFaceAPIDocumentEmbedder
 373  ```
 374  
 375  Deserializes the component from a dictionary.
 376  
 377  **Parameters:**
 378  
 379  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 380  
 381  **Returns:**
 382  
 383  - <code>HuggingFaceAPIDocumentEmbedder</code> – Deserialized component.
 384  
 385  #### run
 386  
 387  ```python
 388  run(documents: list[Document])
 389  ```
 390  
 391  Embeds a list of documents.
 392  
 393  **Parameters:**
 394  
 395  - **documents** (<code>list\[Document\]</code>) – Documents to embed.
 396  
 397  **Returns:**
 398  
 399  - – A dictionary with the following keys:
 400  - `documents`: A list of documents with embeddings.
 401  
 402  #### run_async
 403  
 404  ```python
 405  run_async(documents: list[Document])
 406  ```
 407  
 408  Embeds a list of documents asynchronously.
 409  
 410  **Parameters:**
 411  
 412  - **documents** (<code>list\[Document\]</code>) – Documents to embed.
 413  
 414  **Returns:**
 415  
 416  - – A dictionary with the following keys:
 417  - `documents`: A list of documents with embeddings.
 418  
 419  ## hugging_face_api_text_embedder
 420  
 421  ### HuggingFaceAPITextEmbedder
 422  
 423  Embeds strings using Hugging Face APIs.
 424  
 425  Use it with the following Hugging Face APIs:
 426  
 427  - [Free Serverless Inference API](https://huggingface.co/inference-api)
 428  - [Paid Inference Endpoints](https://huggingface.co/inference-endpoints)
 429  - [Self-hosted Text Embeddings Inference](https://github.com/huggingface/text-embeddings-inference)
 430  
 431  ### Usage examples
 432  
 433  #### With free serverless inference API
 434  
 435  ```python
 436  from haystack.components.embedders import HuggingFaceAPITextEmbedder
 437  from haystack.utils import Secret
 438  
 439  text_embedder = HuggingFaceAPITextEmbedder(api_type="serverless_inference_api",
 440                                             api_params={"model": "BAAI/bge-small-en-v1.5"},
 441                                             token=Secret.from_token("<your-api-key>"))
 442  
 443  print(text_embedder.run("I love pizza!"))
 444  
 445  # {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
 446  ```
 447  
 448  #### With paid inference endpoints
 449  
 450  ```python
 451  from haystack.components.embedders import HuggingFaceAPITextEmbedder
 452  from haystack.utils import Secret
 453  text_embedder = HuggingFaceAPITextEmbedder(api_type="inference_endpoints",
 454                                             api_params={"model": "BAAI/bge-small-en-v1.5"},
 455                                             token=Secret.from_token("<your-api-key>"))
 456  
 457  print(text_embedder.run("I love pizza!"))
 458  
 459  # {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
 460  ```
 461  
 462  #### With self-hosted text embeddings inference
 463  
 464  ```python
 465  from haystack.components.embedders import HuggingFaceAPITextEmbedder
 466  from haystack.utils import Secret
 467  
 468  text_embedder = HuggingFaceAPITextEmbedder(api_type="text_embeddings_inference",
 469                                             api_params={"url": "http://localhost:8080"})
 470  
 471  print(text_embedder.run("I love pizza!"))
 472  
 473  # {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
 474  ```
 475  
 476  #### __init__
 477  
 478  ```python
 479  __init__(
 480      api_type: HFEmbeddingAPIType | str,
 481      api_params: dict[str, str],
 482      token: Secret | None = Secret.from_env_var(
 483          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
 484      ),
 485      prefix: str = "",
 486      suffix: str = "",
 487      truncate: bool | None = True,
 488      normalize: bool | None = False,
 489  )
 490  ```
 491  
 492  Creates a HuggingFaceAPITextEmbedder component.
 493  
 494  **Parameters:**
 495  
 496  - **api_type** (<code>HFEmbeddingAPIType | str</code>) – The type of Hugging Face API to use.
 497  - **api_params** (<code>dict\[str, str\]</code>) – A dictionary with the following keys:
 498  - `model`: Hugging Face model ID. Required when `api_type` is `SERVERLESS_INFERENCE_API`.
 499  - `url`: URL of the inference endpoint. Required when `api_type` is `INFERENCE_ENDPOINTS` or
 500    `TEXT_EMBEDDINGS_INFERENCE`.
 501  - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization.
 502    Check your HF token in your [account settings](https://huggingface.co/settings/tokens).
 503  - **prefix** (<code>str</code>) – A string to add at the beginning of each text.
 504  - **suffix** (<code>str</code>) – A string to add at the end of each text.
 505  - **truncate** (<code>bool | None</code>) – Truncates the input text to the maximum length supported by the model.
 506    Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS`
 507    if the backend uses Text Embeddings Inference.
 508    If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored.
 509  - **normalize** (<code>bool | None</code>) – Normalizes the embeddings to unit length.
 510    Applicable when `api_type` is `TEXT_EMBEDDINGS_INFERENCE`, or `INFERENCE_ENDPOINTS`
 511    if the backend uses Text Embeddings Inference.
 512    If `api_type` is `SERVERLESS_INFERENCE_API`, this parameter is ignored.
 513  
 514  #### to_dict
 515  
 516  ```python
 517  to_dict() -> dict[str, Any]
 518  ```
 519  
 520  Serializes the component to a dictionary.
 521  
 522  **Returns:**
 523  
 524  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 525  
 526  #### from_dict
 527  
 528  ```python
 529  from_dict(data: dict[str, Any]) -> HuggingFaceAPITextEmbedder
 530  ```
 531  
 532  Deserializes the component from a dictionary.
 533  
 534  **Parameters:**
 535  
 536  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 537  
 538  **Returns:**
 539  
 540  - <code>HuggingFaceAPITextEmbedder</code> – Deserialized component.
 541  
 542  #### run
 543  
 544  ```python
 545  run(text: str)
 546  ```
 547  
 548  Embeds a single string.
 549  
 550  **Parameters:**
 551  
 552  - **text** (<code>str</code>) – Text to embed.
 553  
 554  **Returns:**
 555  
 556  - – A dictionary with the following keys:
 557  - `embedding`: The embedding of the input text.
 558  
 559  #### run_async
 560  
 561  ```python
 562  run_async(text: str)
 563  ```
 564  
 565  Embeds a single string asynchronously.
 566  
 567  **Parameters:**
 568  
 569  - **text** (<code>str</code>) – Text to embed.
 570  
 571  **Returns:**
 572  
 573  - – A dictionary with the following keys:
 574  - `embedding`: The embedding of the input text.
 575  
 576  ## image/sentence_transformers_doc_image_embedder
 577  
 578  ### SentenceTransformersDocumentImageEmbedder
 579  
 580  A component for computing Document embeddings based on images using Sentence Transformers models.
 581  
 582  The embedding of each Document is stored in the `embedding` field of the Document.
 583  
 584  ### Usage example
 585  
 586  ```python
 587  from haystack import Document
 588  from haystack.components.embedders.image import SentenceTransformersDocumentImageEmbedder
 589  
 590  embedder = SentenceTransformersDocumentImageEmbedder(model="sentence-transformers/clip-ViT-B-32")
 591  
 592  documents = [
 593      Document(content="A photo of a cat", meta={"file_path": "cat.jpg"}),
 594      Document(content="A photo of a dog", meta={"file_path": "dog.jpg"}),
 595  ]
 596  
 597  result = embedder.run(documents=documents)
 598  documents_with_embeddings = result["documents"]
 599  print(documents_with_embeddings)
 600  
 601  # [Document(id=...,
 602  #           content='A photo of a cat',
 603  #           meta={'file_path': 'cat.jpg',
 604  #                 'embedding_source': {'type': 'image', 'file_path_meta_field': 'file_path'}},
 605  #           embedding=vector of size 512),
 606  #  ...]
 607  ```
 608  
 609  #### __init__
 610  
 611  ```python
 612  __init__(
 613      *,
 614      file_path_meta_field: str = "file_path",
 615      root_path: str | None = None,
 616      model: str = "sentence-transformers/clip-ViT-B-32",
 617      device: ComponentDevice | None = None,
 618      token: Secret | None = Secret.from_env_var(
 619          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
 620      ),
 621      batch_size: int = 32,
 622      progress_bar: bool = True,
 623      normalize_embeddings: bool = False,
 624      trust_remote_code: bool = False,
 625      local_files_only: bool = False,
 626      model_kwargs: dict[str, Any] | None = None,
 627      tokenizer_kwargs: dict[str, Any] | None = None,
 628      config_kwargs: dict[str, Any] | None = None,
 629      precision: Literal[
 630          "float32", "int8", "uint8", "binary", "ubinary"
 631      ] = "float32",
 632      encode_kwargs: dict[str, Any] | None = None,
 633      backend: Literal["torch", "onnx", "openvino"] = "torch"
 634  ) -> None
 635  ```
 636  
 637  Creates a SentenceTransformersDocumentEmbedder component.
 638  
 639  **Parameters:**
 640  
 641  - **file_path_meta_field** (<code>str</code>) – The metadata field in the Document that contains the file path to the image or PDF.
 642  - **root_path** (<code>str | None</code>) – The root directory path where document files are located. If provided, file paths in
 643    document metadata will be resolved relative to this path. If None, file paths are treated as absolute paths.
 644  - **model** (<code>str</code>) – The Sentence Transformers model to use for calculating embeddings. Pass a local path or ID of the model on
 645    Hugging Face. To be used with this component, the model must be able to embed images and text into the same
 646    vector space. Compatible models include:
 647  - "sentence-transformers/clip-ViT-B-32"
 648  - "sentence-transformers/clip-ViT-L-14"
 649  - "sentence-transformers/clip-ViT-B-16"
 650  - "sentence-transformers/clip-ViT-B-32-multilingual-v1"
 651  - "jinaai/jina-embeddings-v4"
 652  - "jinaai/jina-clip-v1"
 653  - "jinaai/jina-clip-v2".
 654  - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model.
 655    Overrides the default device.
 656  - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face.
 657  - **batch_size** (<code>int</code>) – Number of documents to embed at once.
 658  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents.
 659  - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1.
 660  - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures.
 661    If `True`, allows custom models and scripts.
 662  - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files.
 663  - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained`
 664    when loading the model. Refer to specific model documentation for available kwargs.
 665  - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer.
 666    Refer to specific model documentation for available kwargs.
 667  - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration.
 668  - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings.
 669    All non-float32 precisions are quantized embeddings.
 670    Quantized embeddings are smaller and faster to compute, but may have a lower accuracy.
 671    They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.
 672  - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents.
 673    This parameter is provided for fine customization. Be careful not to clash with already set parameters and
 674    avoid passing parameters that change the output type.
 675  - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino".
 676    Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html)
 677    for more information on acceleration and quantization options.
 678  
 679  #### to_dict
 680  
 681  ```python
 682  to_dict() -> dict[str, Any]
 683  ```
 684  
 685  Serializes the component to a dictionary.
 686  
 687  **Returns:**
 688  
 689  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 690  
 691  #### from_dict
 692  
 693  ```python
 694  from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentImageEmbedder
 695  ```
 696  
 697  Deserializes the component from a dictionary.
 698  
 699  **Parameters:**
 700  
 701  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 702  
 703  **Returns:**
 704  
 705  - <code>SentenceTransformersDocumentImageEmbedder</code> – Deserialized component.
 706  
 707  #### warm_up
 708  
 709  ```python
 710  warm_up() -> None
 711  ```
 712  
 713  Initializes the component.
 714  
 715  #### run
 716  
 717  ```python
 718  run(documents: list[Document]) -> dict[str, list[Document]]
 719  ```
 720  
 721  Embed a list of documents.
 722  
 723  **Parameters:**
 724  
 725  - **documents** (<code>list\[Document\]</code>) – Documents to embed.
 726  
 727  **Returns:**
 728  
 729  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
 730  - `documents`: Documents with embeddings.
 731  
 732  ## openai_document_embedder
 733  
 734  ### OpenAIDocumentEmbedder
 735  
 736  Computes document embeddings using OpenAI models.
 737  
 738  ### Usage example
 739  
 740  ```python
 741  from haystack import Document
 742  from haystack.components.embedders import OpenAIDocumentEmbedder
 743  
 744  doc = Document(content="I love pizza!")
 745  
 746  document_embedder = OpenAIDocumentEmbedder()
 747  
 748  result = document_embedder.run([doc])
 749  print(result['documents'][0].embedding)
 750  
 751  # [0.017020374536514282, -0.023255806416273117, ...]
 752  ```
 753  
 754  #### __init__
 755  
 756  ```python
 757  __init__(
 758      api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
 759      model: str = "text-embedding-ada-002",
 760      dimensions: int | None = None,
 761      api_base_url: str | None = None,
 762      organization: str | None = None,
 763      prefix: str = "",
 764      suffix: str = "",
 765      batch_size: int = 32,
 766      progress_bar: bool = True,
 767      meta_fields_to_embed: list[str] | None = None,
 768      embedding_separator: str = "\n",
 769      timeout: float | None = None,
 770      max_retries: int | None = None,
 771      http_client_kwargs: dict[str, Any] | None = None,
 772      *,
 773      raise_on_failure: bool = False
 774  )
 775  ```
 776  
 777  Creates an OpenAIDocumentEmbedder component.
 778  
 779  Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES'
 780  environment variables to override the `timeout` and `max_retries` parameters respectively
 781  in the OpenAI client.
 782  
 783  **Parameters:**
 784  
 785  - **api_key** (<code>Secret</code>) – The OpenAI API key.
 786    You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter
 787    during initialization.
 788  - **model** (<code>str</code>) – The name of the model to use for calculating embeddings.
 789    The default model is `text-embedding-ada-002`.
 790  - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and
 791    later models support this parameter.
 792  - **api_base_url** (<code>str | None</code>) – Overrides the default base URL for all HTTP requests.
 793  - **organization** (<code>str | None</code>) – Your OpenAI organization ID. See OpenAI's
 794    [Setting Up Your Organization](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization)
 795    for more information.
 796  - **prefix** (<code>str</code>) – A string to add at the beginning of each text.
 797  - **suffix** (<code>str</code>) – A string to add at the end of each text.
 798  - **batch_size** (<code>int</code>) – Number of documents to embed at once.
 799  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when running.
 800  - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text.
 801  - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text.
 802  - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the
 803    `OPENAI_TIMEOUT` environment variable, or 30 seconds.
 804  - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error.
 805    If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or 5 retries.
 806  - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
 807    For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client).
 808  - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the embedding request fails. If `False`, the component will log the error
 809    and continue processing the remaining documents. If `True`, it will raise an exception on failure.
 810  
 811  #### to_dict
 812  
 813  ```python
 814  to_dict() -> dict[str, Any]
 815  ```
 816  
 817  Serializes the component to a dictionary.
 818  
 819  **Returns:**
 820  
 821  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 822  
 823  #### from_dict
 824  
 825  ```python
 826  from_dict(data: dict[str, Any]) -> OpenAIDocumentEmbedder
 827  ```
 828  
 829  Deserializes the component from a dictionary.
 830  
 831  **Parameters:**
 832  
 833  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 834  
 835  **Returns:**
 836  
 837  - <code>OpenAIDocumentEmbedder</code> – Deserialized component.
 838  
 839  #### run
 840  
 841  ```python
 842  run(documents: list[Document])
 843  ```
 844  
 845  Embeds a list of documents.
 846  
 847  **Parameters:**
 848  
 849  - **documents** (<code>list\[Document\]</code>) – A list of documents to embed.
 850  
 851  **Returns:**
 852  
 853  - – A dictionary with the following keys:
 854  - `documents`: A list of documents with embeddings.
 855  - `meta`: Information about the usage of the model.
 856  
 857  #### run_async
 858  
 859  ```python
 860  run_async(documents: list[Document])
 861  ```
 862  
 863  Embeds a list of documents asynchronously.
 864  
 865  **Parameters:**
 866  
 867  - **documents** (<code>list\[Document\]</code>) – A list of documents to embed.
 868  
 869  **Returns:**
 870  
 871  - – A dictionary with the following keys:
 872  - `documents`: A list of documents with embeddings.
 873  - `meta`: Information about the usage of the model.
 874  
 875  ## openai_text_embedder
 876  
 877  ### OpenAITextEmbedder
 878  
 879  Embeds strings using OpenAI models.
 880  
 881  You can use it to embed user query and send it to an embedding Retriever.
 882  
 883  ### Usage example
 884  
 885  ```python
 886  from haystack.components.embedders import OpenAITextEmbedder
 887  
 888  text_to_embed = "I love pizza!"
 889  
 890  text_embedder = OpenAITextEmbedder()
 891  
 892  print(text_embedder.run(text_to_embed))
 893  
 894  # {'embedding': [0.017020374536514282, -0.023255806416273117, ...],
 895  # 'meta': {'model': 'text-embedding-ada-002-v2',
 896  #          'usage': {'prompt_tokens': 4, 'total_tokens': 4}}}
 897  ```
 898  
 899  #### __init__
 900  
 901  ```python
 902  __init__(
 903      api_key: Secret = Secret.from_env_var("OPENAI_API_KEY"),
 904      model: str = "text-embedding-ada-002",
 905      dimensions: int | None = None,
 906      api_base_url: str | None = None,
 907      organization: str | None = None,
 908      prefix: str = "",
 909      suffix: str = "",
 910      timeout: float | None = None,
 911      max_retries: int | None = None,
 912      http_client_kwargs: dict[str, Any] | None = None,
 913  )
 914  ```
 915  
 916  Creates an OpenAITextEmbedder component.
 917  
 918  Before initializing the component, you can set the 'OPENAI_TIMEOUT' and 'OPENAI_MAX_RETRIES'
 919  environment variables to override the `timeout` and `max_retries` parameters respectively
 920  in the OpenAI client.
 921  
 922  **Parameters:**
 923  
 924  - **api_key** (<code>Secret</code>) – The OpenAI API key.
 925    You can set it with an environment variable `OPENAI_API_KEY`, or pass with this parameter
 926    during initialization.
 927  - **model** (<code>str</code>) – The name of the model to use for calculating embeddings.
 928    The default model is `text-embedding-ada-002`.
 929  - **dimensions** (<code>int | None</code>) – The number of dimensions of the resulting embeddings. Only `text-embedding-3` and
 930    later models support this parameter.
 931  - **api_base_url** (<code>str | None</code>) – Overrides default base URL for all HTTP requests.
 932  - **organization** (<code>str | None</code>) – Your organization ID. See OpenAI's
 933    [production best practices](https://platform.openai.com/docs/guides/production-best-practices/setting-up-your-organization)
 934    for more information.
 935  - **prefix** (<code>str</code>) – A string to add at the beginning of each text to embed.
 936  - **suffix** (<code>str</code>) – A string to add at the end of each text to embed.
 937  - **timeout** (<code>float | None</code>) – Timeout for OpenAI client calls. If not set, it defaults to either the
 938    `OPENAI_TIMEOUT` environment variable, or 30 seconds.
 939  - **max_retries** (<code>int | None</code>) – Maximum number of retries to contact OpenAI after an internal error.
 940    If not set, it defaults to either the `OPENAI_MAX_RETRIES` environment variable, or set to 5.
 941  - **http_client_kwargs** (<code>dict\[str, Any\] | None</code>) – A dictionary of keyword arguments to configure a custom `httpx.Client`or `httpx.AsyncClient`.
 942    For more information, see the [HTTPX documentation](https://www.python-httpx.org/api/#client).
 943  
 944  #### to_dict
 945  
 946  ```python
 947  to_dict() -> dict[str, Any]
 948  ```
 949  
 950  Serializes the component to a dictionary.
 951  
 952  **Returns:**
 953  
 954  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 955  
 956  #### from_dict
 957  
 958  ```python
 959  from_dict(data: dict[str, Any]) -> OpenAITextEmbedder
 960  ```
 961  
 962  Deserializes the component from a dictionary.
 963  
 964  **Parameters:**
 965  
 966  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
 967  
 968  **Returns:**
 969  
 970  - <code>OpenAITextEmbedder</code> – Deserialized component.
 971  
 972  #### run
 973  
 974  ```python
 975  run(text: str)
 976  ```
 977  
 978  Embeds a single string.
 979  
 980  **Parameters:**
 981  
 982  - **text** (<code>str</code>) – Text to embed.
 983  
 984  **Returns:**
 985  
 986  - – A dictionary with the following keys:
 987  - `embedding`: The embedding of the input text.
 988  - `meta`: Information about the usage of the model.
 989  
 990  #### run_async
 991  
 992  ```python
 993  run_async(text: str)
 994  ```
 995  
 996  Asynchronously embed a single string.
 997  
 998  This is the asynchronous version of the `run` method. It has the same parameters and return values
 999  but can be used with `await` in async code.
1000  
1001  **Parameters:**
1002  
1003  - **text** (<code>str</code>) – Text to embed.
1004  
1005  **Returns:**
1006  
1007  - – A dictionary with the following keys:
1008  - `embedding`: The embedding of the input text.
1009  - `meta`: Information about the usage of the model.
1010  
1011  ## sentence_transformers_document_embedder
1012  
1013  ### SentenceTransformersDocumentEmbedder
1014  
1015  Calculates document embeddings using Sentence Transformers models.
1016  
1017  It stores the embeddings in the `embedding` metadata field of each document.
1018  You can also embed documents' metadata.
1019  Use this component in indexing pipelines to embed input documents
1020  and send them to DocumentWriter to write into a Document Store.
1021  
1022  ### Usage example:
1023  
1024  ```python
1025  from haystack import Document
1026  from haystack.components.embedders import SentenceTransformersDocumentEmbedder
1027  doc = Document(content="I love pizza!")
1028  doc_embedder = SentenceTransformersDocumentEmbedder()
1029  
1030  result = doc_embedder.run([doc])
1031  print(result['documents'][0].embedding)
1032  
1033  # [-0.07804739475250244, 0.1498992145061493, ...]
1034  ```
1035  
1036  #### __init__
1037  
1038  ```python
1039  __init__(
1040      model: str = "sentence-transformers/all-mpnet-base-v2",
1041      device: ComponentDevice | None = None,
1042      token: Secret | None = Secret.from_env_var(
1043          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
1044      ),
1045      prefix: str = "",
1046      suffix: str = "",
1047      batch_size: int = 32,
1048      progress_bar: bool = True,
1049      normalize_embeddings: bool = False,
1050      meta_fields_to_embed: list[str] | None = None,
1051      embedding_separator: str = "\n",
1052      trust_remote_code: bool = False,
1053      local_files_only: bool = False,
1054      truncate_dim: int | None = None,
1055      model_kwargs: dict[str, Any] | None = None,
1056      tokenizer_kwargs: dict[str, Any] | None = None,
1057      config_kwargs: dict[str, Any] | None = None,
1058      precision: Literal[
1059          "float32", "int8", "uint8", "binary", "ubinary"
1060      ] = "float32",
1061      encode_kwargs: dict[str, Any] | None = None,
1062      backend: Literal["torch", "onnx", "openvino"] = "torch",
1063      revision: str | None = None,
1064  )
1065  ```
1066  
1067  Creates a SentenceTransformersDocumentEmbedder component.
1068  
1069  **Parameters:**
1070  
1071  - **model** (<code>str</code>) – The model to use for calculating embeddings.
1072    Pass a local path or ID of the model on Hugging Face.
1073  - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model.
1074    Overrides the default device.
1075  - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face.
1076  - **prefix** (<code>str</code>) – A string to add at the beginning of each document text.
1077    Can be used to prepend the text with an instruction, as required by some embedding models,
1078    such as E5 and bge.
1079  - **suffix** (<code>str</code>) – A string to add at the end of each document text.
1080  - **batch_size** (<code>int</code>) – Number of documents to embed at once.
1081  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents.
1082  - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that each embedding has a norm of 1.
1083  - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text.
1084  - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text.
1085  - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures.
1086    If `True`, allows custom models and scripts.
1087  - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files.
1088  - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation.
1089    If the model wasn't trained with Matryoshka Representation Learning,
1090    truncating embeddings can significantly affect performance.
1091  - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained`
1092    when loading the model. Refer to specific model documentation for available kwargs.
1093  - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer.
1094    Refer to specific model documentation for available kwargs.
1095  - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration.
1096  - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings.
1097    All non-float32 precisions are quantized embeddings.
1098    Quantized embeddings are smaller and faster to compute, but may have a lower accuracy.
1099    They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.
1100  - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding documents.
1101    This parameter is provided for fine customization. Be careful not to clash with already set parameters and
1102    avoid passing parameters that change the output type.
1103  - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino".
1104    Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html)
1105    for more information on acceleration and quantization options.
1106  - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id,
1107    for a stored model on Hugging Face.
1108  
1109  #### to_dict
1110  
1111  ```python
1112  to_dict() -> dict[str, Any]
1113  ```
1114  
1115  Serializes the component to a dictionary.
1116  
1117  **Returns:**
1118  
1119  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1120  
1121  #### from_dict
1122  
1123  ```python
1124  from_dict(data: dict[str, Any]) -> SentenceTransformersDocumentEmbedder
1125  ```
1126  
1127  Deserializes the component from a dictionary.
1128  
1129  **Parameters:**
1130  
1131  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
1132  
1133  **Returns:**
1134  
1135  - <code>SentenceTransformersDocumentEmbedder</code> – Deserialized component.
1136  
1137  #### warm_up
1138  
1139  ```python
1140  warm_up()
1141  ```
1142  
1143  Initializes the component.
1144  
1145  #### run
1146  
1147  ```python
1148  run(documents: list[Document])
1149  ```
1150  
1151  Embed a list of documents.
1152  
1153  **Parameters:**
1154  
1155  - **documents** (<code>list\[Document\]</code>) – Documents to embed.
1156  
1157  **Returns:**
1158  
1159  - – A dictionary with the following keys:
1160  - `documents`: Documents with embeddings.
1161  
1162  ## sentence_transformers_sparse_document_embedder
1163  
1164  ### SentenceTransformersSparseDocumentEmbedder
1165  
1166  Calculates document sparse embeddings using sparse embedding models from Sentence Transformers.
1167  
1168  It stores the sparse embeddings in the `sparse_embedding` metadata field of each document.
1169  You can also embed documents' metadata.
1170  Use this component in indexing pipelines to embed input documents
1171  and send them to DocumentWriter to write a into a Document Store.
1172  
1173  ### Usage example:
1174  
1175  ```python
1176  from haystack import Document
1177  from haystack.components.embedders import SentenceTransformersSparseDocumentEmbedder
1178  
1179  doc = Document(content="I love pizza!")
1180  doc_embedder = SentenceTransformersSparseDocumentEmbedder()
1181  
1182  result = doc_embedder.run([doc])
1183  print(result['documents'][0].sparse_embedding)
1184  
1185  # SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])
1186  ```
1187  
1188  #### __init__
1189  
1190  ```python
1191  __init__(
1192      *,
1193      model: str = "prithivida/Splade_PP_en_v2",
1194      device: ComponentDevice | None = None,
1195      token: Secret | None = Secret.from_env_var(
1196          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
1197      ),
1198      prefix: str = "",
1199      suffix: str = "",
1200      batch_size: int = 32,
1201      progress_bar: bool = True,
1202      meta_fields_to_embed: list[str] | None = None,
1203      embedding_separator: str = "\n",
1204      trust_remote_code: bool = False,
1205      local_files_only: bool = False,
1206      model_kwargs: dict[str, Any] | None = None,
1207      tokenizer_kwargs: dict[str, Any] | None = None,
1208      config_kwargs: dict[str, Any] | None = None,
1209      backend: Literal["torch", "onnx", "openvino"] = "torch",
1210      revision: str | None = None
1211  )
1212  ```
1213  
1214  Creates a SentenceTransformersSparseDocumentEmbedder component.
1215  
1216  **Parameters:**
1217  
1218  - **model** (<code>str</code>) – The model to use for calculating sparse embeddings.
1219    Pass a local path or ID of the model on Hugging Face.
1220  - **device** (<code>ComponentDevice | None</code>) – The device to use for loading the model.
1221    Overrides the default device.
1222  - **token** (<code>Secret | None</code>) – The API token to download private models from Hugging Face.
1223  - **prefix** (<code>str</code>) – A string to add at the beginning of each document text.
1224  - **suffix** (<code>str</code>) – A string to add at the end of each document text.
1225  - **batch_size** (<code>int</code>) – Number of documents to embed at once.
1226  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar when embedding documents.
1227  - **meta_fields_to_embed** (<code>list\[str\] | None</code>) – List of metadata fields to embed along with the document text.
1228  - **embedding_separator** (<code>str</code>) – Separator used to concatenate the metadata fields to the document text.
1229  - **trust_remote_code** (<code>bool</code>) – If `False`, allows only Hugging Face verified model architectures.
1230    If `True`, allows custom models and scripts.
1231  - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files.
1232  - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained`
1233    when loading the model. Refer to specific model documentation for available kwargs.
1234  - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer.
1235    Refer to specific model documentation for available kwargs.
1236  - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration.
1237  - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino".
1238    Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html)
1239    for more information on acceleration and quantization options.
1240  - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id,
1241    for a stored model on Hugging Face.
1242  
1243  #### to_dict
1244  
1245  ```python
1246  to_dict() -> dict[str, Any]
1247  ```
1248  
1249  Serializes the component to a dictionary.
1250  
1251  **Returns:**
1252  
1253  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1254  
1255  #### from_dict
1256  
1257  ```python
1258  from_dict(data: dict[str, Any]) -> SentenceTransformersSparseDocumentEmbedder
1259  ```
1260  
1261  Deserializes the component from a dictionary.
1262  
1263  **Parameters:**
1264  
1265  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
1266  
1267  **Returns:**
1268  
1269  - <code>SentenceTransformersSparseDocumentEmbedder</code> – Deserialized component.
1270  
1271  #### warm_up
1272  
1273  ```python
1274  warm_up()
1275  ```
1276  
1277  Initializes the component.
1278  
1279  #### run
1280  
1281  ```python
1282  run(documents: list[Document])
1283  ```
1284  
1285  Embed a list of documents.
1286  
1287  **Parameters:**
1288  
1289  - **documents** (<code>list\[Document\]</code>) – Documents to embed.
1290  
1291  **Returns:**
1292  
1293  - – A dictionary with the following keys:
1294  - `documents`: Documents with sparse embeddings under the `sparse_embedding` field.
1295  
1296  ## sentence_transformers_sparse_text_embedder
1297  
1298  ### SentenceTransformersSparseTextEmbedder
1299  
1300  Embeds strings using sparse embedding models from Sentence Transformers.
1301  
1302  You can use it to embed user query and send it to a sparse embedding retriever.
1303  
1304  Usage example:
1305  
1306  ```python
1307  from haystack.components.embedders import SentenceTransformersSparseTextEmbedder
1308  
1309  text_to_embed = "I love pizza!"
1310  
1311  text_embedder = SentenceTransformersSparseTextEmbedder()
1312  
1313  print(text_embedder.run(text_to_embed))
1314  
1315  # {'sparse_embedding': SparseEmbedding(indices=[999, 1045, ...], values=[0.918, 0.867, ...])}
1316  ```
1317  
1318  #### __init__
1319  
1320  ```python
1321  __init__(
1322      *,
1323      model: str = "prithivida/Splade_PP_en_v2",
1324      device: ComponentDevice | None = None,
1325      token: Secret | None = Secret.from_env_var(
1326          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
1327      ),
1328      prefix: str = "",
1329      suffix: str = "",
1330      trust_remote_code: bool = False,
1331      local_files_only: bool = False,
1332      model_kwargs: dict[str, Any] | None = None,
1333      tokenizer_kwargs: dict[str, Any] | None = None,
1334      config_kwargs: dict[str, Any] | None = None,
1335      backend: Literal["torch", "onnx", "openvino"] = "torch",
1336      revision: str | None = None
1337  )
1338  ```
1339  
1340  Create a SentenceTransformersSparseTextEmbedder component.
1341  
1342  **Parameters:**
1343  
1344  - **model** (<code>str</code>) – The model to use for calculating sparse embeddings.
1345    Specify the path to a local model or the ID of the model on Hugging Face.
1346  - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model.
1347  - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face.
1348  - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded.
1349  - **suffix** (<code>str</code>) – A string to add at the end of each text to embed.
1350  - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures.
1351    If `True`, permits custom models and scripts.
1352  - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files.
1353  - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained`
1354    when loading the model. Refer to specific model documentation for available kwargs.
1355  - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer.
1356    Refer to specific model documentation for available kwargs.
1357  - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration.
1358  - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino".
1359    Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html)
1360    for more information on acceleration and quantization options.
1361  - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id,
1362    for a stored model on Hugging Face.
1363  
1364  #### to_dict
1365  
1366  ```python
1367  to_dict() -> dict[str, Any]
1368  ```
1369  
1370  Serializes the component to a dictionary.
1371  
1372  **Returns:**
1373  
1374  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1375  
1376  #### from_dict
1377  
1378  ```python
1379  from_dict(data: dict[str, Any]) -> SentenceTransformersSparseTextEmbedder
1380  ```
1381  
1382  Deserializes the component from a dictionary.
1383  
1384  **Parameters:**
1385  
1386  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
1387  
1388  **Returns:**
1389  
1390  - <code>SentenceTransformersSparseTextEmbedder</code> – Deserialized component.
1391  
1392  #### warm_up
1393  
1394  ```python
1395  warm_up()
1396  ```
1397  
1398  Initializes the component.
1399  
1400  #### run
1401  
1402  ```python
1403  run(text: str)
1404  ```
1405  
1406  Embed a single string.
1407  
1408  **Parameters:**
1409  
1410  - **text** (<code>str</code>) – Text to embed.
1411  
1412  **Returns:**
1413  
1414  - – A dictionary with the following keys:
1415  - `sparse_embedding`: The sparse embedding of the input text.
1416  
1417  ## sentence_transformers_text_embedder
1418  
1419  ### SentenceTransformersTextEmbedder
1420  
1421  Embeds strings using Sentence Transformers models.
1422  
1423  You can use it to embed user query and send it to an embedding retriever.
1424  
1425  Usage example:
1426  
1427  ```python
1428  from haystack.components.embedders import SentenceTransformersTextEmbedder
1429  
1430  text_to_embed = "I love pizza!"
1431  
1432  text_embedder = SentenceTransformersTextEmbedder()
1433  
1434  print(text_embedder.run(text_to_embed))
1435  
1436  # {'embedding': [-0.07804739475250244, 0.1498992145061493,, ...]}
1437  ```
1438  
1439  #### __init__
1440  
1441  ```python
1442  __init__(
1443      model: str = "sentence-transformers/all-mpnet-base-v2",
1444      device: ComponentDevice | None = None,
1445      token: Secret | None = Secret.from_env_var(
1446          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
1447      ),
1448      prefix: str = "",
1449      suffix: str = "",
1450      batch_size: int = 32,
1451      progress_bar: bool = True,
1452      normalize_embeddings: bool = False,
1453      trust_remote_code: bool = False,
1454      local_files_only: bool = False,
1455      truncate_dim: int | None = None,
1456      model_kwargs: dict[str, Any] | None = None,
1457      tokenizer_kwargs: dict[str, Any] | None = None,
1458      config_kwargs: dict[str, Any] | None = None,
1459      precision: Literal[
1460          "float32", "int8", "uint8", "binary", "ubinary"
1461      ] = "float32",
1462      encode_kwargs: dict[str, Any] | None = None,
1463      backend: Literal["torch", "onnx", "openvino"] = "torch",
1464      revision: str | None = None,
1465  )
1466  ```
1467  
1468  Create a SentenceTransformersTextEmbedder component.
1469  
1470  **Parameters:**
1471  
1472  - **model** (<code>str</code>) – The model to use for calculating embeddings.
1473    Specify the path to a local model or the ID of the model on Hugging Face.
1474  - **device** (<code>ComponentDevice | None</code>) – Overrides the default device used to load the model.
1475  - **token** (<code>Secret | None</code>) – An API token to use private models from Hugging Face.
1476  - **prefix** (<code>str</code>) – A string to add at the beginning of each text to be embedded.
1477    You can use it to prepend the text with an instruction, as required by some embedding models,
1478    such as E5 and bge.
1479  - **suffix** (<code>str</code>) – A string to add at the end of each text to embed.
1480  - **batch_size** (<code>int</code>) – Number of texts to embed at once.
1481  - **progress_bar** (<code>bool</code>) – If `True`, shows a progress bar for calculating embeddings.
1482    If `False`, disables the progress bar.
1483  - **normalize_embeddings** (<code>bool</code>) – If `True`, the embeddings are normalized using L2 normalization, so that the embeddings have a norm of 1.
1484  - **trust_remote_code** (<code>bool</code>) – If `False`, permits only Hugging Face verified model architectures.
1485    If `True`, permits custom models and scripts.
1486  - **local_files_only** (<code>bool</code>) – If `True`, does not attempt to download the model from Hugging Face Hub and only looks at local files.
1487  - **truncate_dim** (<code>int | None</code>) – The dimension to truncate sentence embeddings to. `None` does no truncation.
1488    If the model has not been trained with Matryoshka Representation Learning,
1489    truncation of embeddings can significantly affect performance.
1490  - **model_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoModelForSequenceClassification.from_pretrained`
1491    when loading the model. Refer to specific model documentation for available kwargs.
1492  - **tokenizer_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoTokenizer.from_pretrained` when loading the tokenizer.
1493    Refer to specific model documentation for available kwargs.
1494  - **config_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `AutoConfig.from_pretrained` when loading the model configuration.
1495  - **precision** (<code>Literal['float32', 'int8', 'uint8', 'binary', 'ubinary']</code>) – The precision to use for the embeddings.
1496    All non-float32 precisions are quantized embeddings.
1497    Quantized embeddings are smaller in size and faster to compute, but may have a lower accuracy.
1498    They are useful for reducing the size of the embeddings of a corpus for semantic search, among other tasks.
1499  - **encode_kwargs** (<code>dict\[str, Any\] | None</code>) – Additional keyword arguments for `SentenceTransformer.encode` when embedding texts.
1500    This parameter is provided for fine customization. Be careful not to clash with already set parameters and
1501    avoid passing parameters that change the output type.
1502  - **backend** (<code>Literal['torch', 'onnx', 'openvino']</code>) – The backend to use for the Sentence Transformers model. Choose from "torch", "onnx", or "openvino".
1503    Refer to the [Sentence Transformers documentation](https://sbert.net/docs/sentence_transformer/usage/efficiency.html)
1504    for more information on acceleration and quantization options.
1505  - **revision** (<code>str | None</code>) – The specific model version to use. It can be a branch name, a tag name, or a commit id,
1506    for a stored model on Hugging Face.
1507  
1508  #### to_dict
1509  
1510  ```python
1511  to_dict() -> dict[str, Any]
1512  ```
1513  
1514  Serializes the component to a dictionary.
1515  
1516  **Returns:**
1517  
1518  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
1519  
1520  #### from_dict
1521  
1522  ```python
1523  from_dict(data: dict[str, Any]) -> SentenceTransformersTextEmbedder
1524  ```
1525  
1526  Deserializes the component from a dictionary.
1527  
1528  **Parameters:**
1529  
1530  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
1531  
1532  **Returns:**
1533  
1534  - <code>SentenceTransformersTextEmbedder</code> – Deserialized component.
1535  
1536  #### warm_up
1537  
1538  ```python
1539  warm_up()
1540  ```
1541  
1542  Initializes the component.
1543  
1544  #### run
1545  
1546  ```python
1547  run(text: str)
1548  ```
1549  
1550  Embed a single string.
1551  
1552  **Parameters:**
1553  
1554  - **text** (<code>str</code>) – Text to embed.
1555  
1556  **Returns:**
1557  
1558  - – A dictionary with the following keys:
1559  - `embedding`: The embedding of the input text.