Cradicle Explorer

/ docs-website / docs / pipeline-components / embedders / fastembeddocumentembedder.mdx
fastembeddocumentembedder.mdx
  1  ---
  2  title: "FastembedDocumentEmbedder"
  3  id: fastembeddocumentembedder
  4  slug: "/fastembeddocumentembedder"
  5  description: "This component computes the embeddings of a list of documents using the models supported by FastEmbed."
  6  ---
  7  
  8  # FastembedDocumentEmbedder
  9  
 10  This component computes the embeddings of a list of documents using the models supported by FastEmbed.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx)  in an indexing pipeline                  |
 17  | **Mandatory run variables**            | `documents`: A list of documents                                                            |
 18  | **Output variables**                   | `documents`: A list of documents (enriched with embeddings)                                 |
 19  | **API reference**                      | [FastEmbed](/reference/fastembed-embedders)                                                        |
 20  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed |
 21  
 22  </div>
 23  
 24  This component should be used to embed a list of documents. To embed a string, use the [`FastembedTextEmbedder`](fastembedtextembedder.mdx).
 25  
 26  ## Overview
 27  
 28  `FastembedDocumentEmbedder` computes the embeddings of a list of documents and stores the obtained vectors in the embedding field of each document. It uses embedding [models supported by FastEmbed](https://qdrant.github.io/fastembed/examples/Supported_Models/).
 29  
 30  The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector that represents the query is compared with those of the documents in order to find the most similar or relevant documents.
 31  
 32  ### Compatible models
 33  
 34  You can find the original models in the [FastEmbed documentation](https://qdrant.github.io/fastembed/).
 35  
 36  Nowadays, most of the models in the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) are compatible with FastEmbed. You can look for compatibility in the [supported model list](https://qdrant.github.io/fastembed/examples/Supported_Models/).
 37  
 38  ### Installation
 39  
 40  To start using this integration with Haystack, install the package with:
 41  
 42  ```shell
 43  pip install fastembed-haystack
 44  ```
 45  
 46  ### Parameters
 47  
 48  You can set the path where the model will be stored in a cache directory. Also, you can set the number of threads a single `onnxruntime` session can use.
 49  
 50  ```python
 51  cache_dir= "/your_cacheDirectory"
 52  embedder = FastembedDocumentEmbedder(
 53  	*model="*BAAI/bge-large-en-v1.5",
 54  	cache_dir=cache_dir,
 55  	threads=2
 56  )
 57  ```
 58  
 59  If you want to use the data parallel encoding, you can set the parameters `parallel` and `batch_size`.
 60  
 61  - If parallel > 1, data-parallel encoding will be used. This is recommended for offline encoding of large datasets.
 62  - If parallel is 0, use all available cores.
 63  - If None, don't use data-parallel processing; use default `onnxruntime` threading instead.
 64  
 65  :::tip
 66  If you create a Text Embedder and a Document Embedder based on the same model, Haystack uses the same resource behind the scenes to save resources.
 67  :::
 68  
 69  ### Embedding Metadata
 70  
 71  Text documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the document to improve retrieval.
 72  
 73  You can do this easily by using the Document Embedder:
 74  
 75  ```python
 76  from haystack.preview import Document
 77  from haystack_integrations.components.embedders.fastembed import (
 78      FastembedDocumentEmbedder,
 79  )
 80  
 81  doc = Document(
 82      text="some text",
 83      metadata={"title": "relevant title", "page number": 18},
 84  )
 85  
 86  embedder = FastembedDocumentEmbedder(
 87      model="BAAI/bge-small-en-v1.5",
 88      batch_size=256,
 89      metadata_fields_to_embed=["title"],
 90  )
 91  
 92  docs_w_embeddings = embedder.run(documents=[doc])["documents"]
 93  ```
 94  
 95  ## Usage
 96  
 97  ### On its own
 98  
 99  ```python
100  from haystack.dataclasses import Document
101  from haystack_integrations.components.embedders.fastembed import (
102      FastembedDocumentEmbedder,
103  )
104  
105  document_list = [
106      Document(content="I love pizza!"),
107      Document(content="I like spaghetti"),
108  ]
109  
110  doc_embedder = FastembedDocumentEmbedder()
111  
112  result = doc_embedder.run(document_list)
113  print(result["documents"][0].embedding)
114  
115  ## [-0.04235665127635002, 0.021791068837046623, ...]
116  ```
117  
118  ### In a pipeline
119  
120  ```python
121  from haystack import Document, Pipeline
122  from haystack.components.writers import DocumentWriter
123  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
124  from haystack.document_stores.in_memory import InMemoryDocumentStore
125  from haystack.document_stores.types import DuplicatePolicy
126  from haystack_integrations.components.embedders.fastembed import (
127      FastembedDocumentEmbedder,
128      FastembedTextEmbedder,
129  )
130  
131  document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
132  
133  documents = [
134      Document(content="My name is Wolfgang and I live in Berlin"),
135      Document(content="I saw a black horse running"),
136      Document(content="Germany has many big cities"),
137      Document(content="fastembed is supported by and maintained by Qdrant."),
138  ]
139  
140  document_embedder = FastembedDocumentEmbedder()
141  writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
142  
143  indexing_pipeline = Pipeline()
144  indexing_pipeline.add_component("document_embedder", document_embedder)
145  indexing_pipeline.add_component("writer", writer)
146  indexing_pipeline.connect("document_embedder", "writer")
147  
148  indexing_pipeline.run({"document_embedder": {"documents": documents}})
149  
150  query_pipeline = Pipeline()
151  query_pipeline.add_component("text_embedder", FastembedTextEmbedder())
152  query_pipeline.add_component(
153      "retriever",
154      InMemoryEmbeddingRetriever(document_store=document_store),
155  )
156  query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
157  
158  query = "Who supports fastembed?"
159  
160  result = query_pipeline.run({"text_embedder": {"text": query}})
161  
162  print(result["retriever"]["documents"][0])  # noqa: T201
163  
164  ## Document(id=...,
165  ## content: 'fastembed is supported by and maintained by Qdrant.',
166  ## score: 0.758..)
167  ```
168  
169  ## Additional References
170  
171  🧑‍🍳 Cookbook: [RAG Pipeline Using FastEmbed for Embeddings Generation](https://haystack.deepset.ai/cookbook/rag_fastembed)