fastembedsparsedocumentembedder.mdx
1 --- 2 title: "FastembedSparseDocumentEmbedder" 3 id: fastembedsparsedocumentembedder 4 slug: "/fastembedsparsedocumentembedder" 5 description: "Use this component to enrich a list of documents with their sparse embeddings." 6 --- 7 8 # FastembedSparseDocumentEmbedder 9 10 Use this component to enrich a list of documents with their sparse embeddings. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx) in an indexing pipeline | 17 | **Mandatory run variables** | `documents`: A list of documents | 18 | **Output variables** | `documents`: A list of documents (enriched with sparse embeddings) | 19 | **API reference** | [FastEmbed](/reference/fastembed-embedders) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed | 21 22 </div> 23 24 To compute a sparse embedding for a string, use the [`FastembedSparseTextEmbedder`](fastembedsparsetextembedder.mdx). 25 26 ## Overview 27 28 `FastembedSparseDocumentEmbedder` computes the sparse embeddings of a list of documents and stores the obtained vectors in the `sparse_embedding` field of each document. It uses sparse embedding [models](https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-sparse-text-embedding-models) supported by FastEmbed. 29 30 The vectors calculated by this component are necessary for performing sparse embedding retrieval on a set of documents. During retrieval, the sparse vector representing the query is compared to those of the documents to identify the most similar or relevant ones. 31 32 ### Compatible models 33 34 You can find the supported models in the [FastEmbed documentation](https://qdrant.github.io/fastembed/examples/Supported_Models/#supported-sparse-text-embedding-models). 35 36 Currently, supported models are based on SPLADE, a technique for producing sparse representations for text, where each non-zero value in the embedding is the importance weight of a term in the BERT WordPiece vocabulary. For more information, see [our docs](../retrievers.mdx#sparse-embedding-based-retrievers) that explain sparse embedding-based Retrievers further. 37 38 ### Installation 39 40 To start using this integration with Haystack, install the package with: 41 42 ```shell 43 pip install fastembed-haystack 44 ``` 45 46 ### Parameters 47 48 You can set the path where the model will be stored in a cache directory. Also, you can set the number of threads a single `onnxruntime` session can use: 49 50 ```python 51 cache_dir = "/your_cacheDirectory" 52 embedder = FastembedSparseDocumentEmbedder( 53 model="prithivida/Splade_PP_en_v1", 54 cache_dir=cache_dir, 55 threads=2, 56 ) 57 ``` 58 59 If you want to use the data parallel encoding, you can set the parameters `parallel` and `batch_size`. 60 61 - If `parallel` > 1, data-parallel encoding will be used. This is recommended for offline encoding of large datasets. 62 - If `parallel` is 0, use all available cores. 63 - If None, don't use data-parallel processing; use default `onnxruntime` threading instead. 64 65 :::tip 66 If you create both a Sparse Text Embedder and a Sparse Document Embedder based on the same model, Haystack utilizes a shared resource behind the scenes to conserve resources. 67 ::: 68 69 ### Embedding Metadata 70 71 Text documents often include metadata. If the metadata is distinctive and semantically meaningful, you can embed it along with the document's text to improve retrieval. 72 73 You can do this easily by using the sparse Document Embedder: 74 75 ```python 76 from haystack.preview import Document 77 from haystack_integrations.components.embedders.fastembed import ( 78 FastembedSparseDocumentEmbedder, 79 ) 80 81 doc = Document( 82 text="some text", 83 metadata={"title": "relevant title", "page number": 18}, 84 ) 85 86 embedder = FastembedSparseDocumentEmbedder( 87 model="prithivida/Splade_PP_en_v1", 88 metadata_fields_to_embed=["title"], 89 ) 90 91 docs_w_sparse_embeddings = embedder.run(documents=[doc])["documents"] 92 ``` 93 94 ## Usage 95 96 ### On its own 97 98 ```python 99 from haystack.dataclasses import Document 100 from haystack_integrations.components.embedders.fastembed import ( 101 FastembedSparseDocumentEmbedder, 102 ) 103 104 document_list = [ 105 Document(content="I love pizza!"), 106 Document(content="I like spaghetti"), 107 ] 108 109 doc_embedder = FastembedSparseDocumentEmbedder() 110 111 result = doc_embedder.run(document_list) 112 print(result["documents"][0]) 113 114 ## Document(id=..., 115 ## content: 'I love pizza!', 116 ## sparse_embedding: vector with 24 non-zero elements) 117 ``` 118 119 ### In a pipeline 120 121 Currently, sparse embedding retrieval is only supported by `QdrantDocumentStore`. 122 First, install the package with: 123 124 ```shell 125 pip install qdrant-haystack 126 ``` 127 128 Then, try out this pipeline: 129 130 ```python 131 from haystack import Document, Pipeline 132 from haystack.components.writers import DocumentWriter 133 from haystack_integrations.components.retrievers.qdrant import ( 134 QdrantSparseEmbeddingRetriever, 135 ) 136 from haystack_integrations.document_stores.qdrant import QdrantDocumentStore 137 from haystack.document_stores.types import DuplicatePolicy 138 from haystack_integrations.components.embedders.fastembed import ( 139 FastembedDocumentEmbedder, 140 FastembedTextEmbedder, 141 ) 142 143 document_store = QdrantDocumentStore( 144 ":memory:", 145 recreate_index=True, 146 use_sparse_embeddings=True, 147 ) 148 149 documents = [ 150 Document(content="My name is Wolfgang and I live in Berlin"), 151 Document(content="I saw a black horse running"), 152 Document(content="Germany has many big cities"), 153 Document(content="fastembed is supported by and maintained by Qdrant."), 154 ] 155 156 sparse_document_embedder = FastembedSparseDocumentEmbedder() 157 writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE) 158 159 indexing_pipeline = Pipeline() 160 indexing_pipeline.add_component("sparse_document_embedder", sparse_document_embedder) 161 indexing_pipeline.add_component("writer", writer) 162 indexing_pipeline.connect("sparse_document_embedder", "writer") 163 164 indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}}) 165 166 query_pipeline = Pipeline() 167 query_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder()) 168 query_pipeline.add_component( 169 "sparse_retriever", 170 QdrantSparseEmbeddingRetriever(document_store=document_store), 171 ) 172 query_pipeline.connect( 173 "sparse_text_embedder.sparse_embedding", 174 "sparse_retriever.query_sparse_embedding", 175 ) 176 177 query = "Who supports fastembed?" 178 179 result = query_pipeline.run({"sparse_text_embedder": {"text": query}}) 180 181 print(result["sparse_retriever"]["documents"][0]) # noqa: T201 182 183 ## Document(id=..., 184 ## content: 'fastembed is supported by and maintained by Qdrant.', 185 ## score: 0.758..) 186 ``` 187 188 ## Additional References 189 190 🧑🍳 Cookbook: [Sparse Embedding Retrieval with Qdrant and FastEmbed](https://haystack.deepset.ai/cookbook/sparse_embedding_retrieval)