Cradicle Explorer

/ docs-website / docs / pipeline-components / embedders / fastembedsparsedocumentembedder.mdx
fastembedsparsedocumentembedder.mdx
  1  ---
  2  title: "FastembedSparseDocumentEmbedder"
  3  id: fastembedsparsedocumentembedder
  4  slug: "/fastembedsparsedocumentembedder"
  5  description: "Use this component to enrich a list of documents with their sparse embeddings."
  6  ---
  7  
  8  # FastembedSparseDocumentEmbedder
  9  
 10  Use this component to enrich a list of documents with their sparse embeddings.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx)  in an indexing pipeline                  |
 17  | **Mandatory run variables**            | `documents`: A list of documents                                                            |
 18  | **Output variables**                   | `documents`: A list of documents (enriched with sparse embeddings)                          |
 19  | **API reference**                      | [FastEmbed](/reference/fastembed-embedders)                                                        |
 20  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed |
 21  
 22  </div>
 23  
 24  To compute a sparse embedding for a string, use the [`FastembedSparseTextEmbedder`](fastembedsparsetextembedder.mdx).
 25  
 26  ## Overview
 27  
 28  `FastembedSparseDocumentEmbedder` computes the sparse embeddings of a list of documents and stores the obtained vectors in the `sparse_embedding` field of each document. It uses sparse embedding [models](https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-sparse-text-embedding-models) supported by FastEmbed.
 29  
 30  The vectors calculated by this component are necessary for performing sparse embedding retrieval on a set of documents. During retrieval, the sparse vector representing the query is compared to those of the documents to identify the most similar or relevant ones.
 31  
 32  ### Compatible models
 33  
 34  You can find the supported models in the [FastEmbed documentation](https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-sparse-text-embedding-models).
 35  
 36  Currently, supported models are based on SPLADE, a technique for producing sparse representations for text, where each non-zero value in the embedding is the importance weight of a term in the BERT WordPiece vocabulary. For more information, see [our docs](../retrievers.mdx#sparse-embedding-based-retrievers) that explain sparse embedding-based Retrievers further.
 37  
 38  ### Installation
 39  
 40  To start using this integration with Haystack, install the package with:
 41  
 42  ```shell
 43  pip install fastembed-haystack
 44  ```
 45  
 46  ### Parameters
 47  
 48  You can set the path where the model will be stored in a cache directory. Also, you can set the number of threads a single `onnxruntime` session can use:
 49  
 50  ```python
 51  cache_dir = "/your_cacheDirectory"
 52  embedder = FastembedSparseDocumentEmbedder(
 53      model="prithivida/Splade_PP_en_v1",
 54      cache_dir=cache_dir,
 55      threads=2,
 56  )
 57  ```
 58  
 59  If you want to use the data parallel encoding, you can set the parameters `parallel` and  `batch_size`.
 60  
 61  - If `parallel` > 1, data-parallel encoding will be used. This is recommended for offline encoding of large datasets.
 62  - If `parallel` is 0, use all available cores.
 63  - If None, don't use data-parallel processing; use default `onnxruntime` threading instead.
 64  
 65  :::tip
 66  If you create both a Sparse Text Embedder and a Sparse Document Embedder based on the same model, Haystack utilizes a shared resource behind the scenes to conserve resources.
 67  :::
 68  
 69  ### Embedding Metadata
 70  
 71  Text documents often include metadata. If the metadata is distinctive and semantically meaningful, you can embed it along with the document's text to improve retrieval.
 72  
 73  You can do this easily by using the sparse Document Embedder:
 74  
 75  ```python
 76  from haystack.preview import Document
 77  from haystack_integrations.components.embedders.fastembed import (
 78      FastembedSparseDocumentEmbedder,
 79  )
 80  
 81  doc = Document(
 82      text="some text",
 83      metadata={"title": "relevant title", "page number": 18},
 84  )
 85  
 86  embedder = FastembedSparseDocumentEmbedder(
 87      model="prithivida/Splade_PP_en_v1",
 88      metadata_fields_to_embed=["title"],
 89  )
 90  
 91  docs_w_sparse_embeddings = embedder.run(documents=[doc])["documents"]
 92  ```
 93  
 94  ## Usage
 95  
 96  ### On its own
 97  
 98  ```python
 99  from haystack.dataclasses import Document
100  from haystack_integrations.components.embedders.fastembed import (
101      FastembedSparseDocumentEmbedder,
102  )
103  
104  document_list = [
105      Document(content="I love pizza!"),
106      Document(content="I like spaghetti"),
107  ]
108  
109  doc_embedder = FastembedSparseDocumentEmbedder()
110  
111  result = doc_embedder.run(document_list)
112  print(result["documents"][0])
113  
114  ## Document(id=...,
115  ## content: 'I love pizza!',
116  ## sparse_embedding: vector with 24 non-zero elements)
117  ```
118  
119  ### In a pipeline
120  
121  Currently, sparse embedding retrieval is only supported by `QdrantDocumentStore`.
122  First, install the package with:
123  
124  ```shell
125  pip install qdrant-haystack
126  ```
127  
128  Then, try out this pipeline:
129  
130  ```python
131  from haystack import Document, Pipeline
132  from haystack.components.writers import DocumentWriter
133  from haystack_integrations.components.retrievers.qdrant import (
134      QdrantSparseEmbeddingRetriever,
135  )
136  from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
137  from haystack.document_stores.types import DuplicatePolicy
138  from haystack_integrations.components.embedders.fastembed import (
139      FastembedDocumentEmbedder,
140      FastembedTextEmbedder,
141  )
142  
143  document_store = QdrantDocumentStore(
144      ":memory:",
145      recreate_index=True,
146      use_sparse_embeddings=True,
147  )
148  
149  documents = [
150      Document(content="My name is Wolfgang and I live in Berlin"),
151      Document(content="I saw a black horse running"),
152      Document(content="Germany has many big cities"),
153      Document(content="fastembed is supported by and maintained by Qdrant."),
154  ]
155  
156  sparse_document_embedder = FastembedSparseDocumentEmbedder()
157  writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
158  
159  indexing_pipeline = Pipeline()
160  indexing_pipeline.add_component("sparse_document_embedder", sparse_document_embedder)
161  indexing_pipeline.add_component("writer", writer)
162  indexing_pipeline.connect("sparse_document_embedder", "writer")
163  
164  indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
165  
166  query_pipeline = Pipeline()
167  query_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder())
168  query_pipeline.add_component(
169      "sparse_retriever",
170      QdrantSparseEmbeddingRetriever(document_store=document_store),
171  )
172  query_pipeline.connect(
173      "sparse_text_embedder.sparse_embedding",
174      "sparse_retriever.query_sparse_embedding",
175  )
176  
177  query = "Who supports fastembed?"
178  
179  result = query_pipeline.run({"sparse_text_embedder": {"text": query}})
180  
181  print(result["sparse_retriever"]["documents"][0])  # noqa: T201
182  
183  ## Document(id=...,
184  ## content: 'fastembed is supported by and maintained by Qdrant.',
185  ## score: 0.758..)
186  ```
187  
188  ## Additional References
189  
190  🧑‍🍳 Cookbook: [Sparse Embedding Retrieval with Qdrant and FastEmbed](https://haystack.deepset.ai/cookbook/sparse_embedding_retrieval)