fastembeddocumentembedder.mdx
1 --- 2 title: "FastembedDocumentEmbedder" 3 id: fastembeddocumentembedder 4 slug: "/fastembeddocumentembedder" 5 description: "This component computes the embeddings of a list of documents using the models supported by FastEmbed." 6 --- 7 8 # FastembedDocumentEmbedder 9 10 This component computes the embeddings of a list of documents using the models supported by FastEmbed. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx) in an indexing pipeline | 17 | **Mandatory run variables** | `documents`: A list of documents | 18 | **Output variables** | `documents`: A list of documents (enriched with embeddings) | 19 | **API reference** | [FastEmbed](/reference/fastembed-embedders) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/fastembed | 21 22 </div> 23 24 This component should be used to embed a list of documents. To embed a string, use the [`FastembedTextEmbedder`](fastembedtextembedder.mdx). 25 26 ## Overview 27 28 `FastembedDocumentEmbedder` computes the embeddings of a list of documents and stores the obtained vectors in the embedding field of each document. It uses embedding [models supported by FastEmbed](https://qdrant.github.io/fastembed/examples/Supported_Models/). 29 30 The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector that represents the query is compared with those of the documents in order to find the most similar or relevant documents. 31 32 ### Compatible models 33 34 You can find the original models in the [FastEmbed documentation](https://qdrant.github.io/fastembed/). 35 36 Nowadays, most of the models in the [Massive Text Embedding Benchmark (MTEB) Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) are compatible with FastEmbed. You can look for compatibility in the [supported model list](https://qdrant.github.io/fastembed/examples/Supported_Models/). 37 38 ### Installation 39 40 To start using this integration with Haystack, install the package with: 41 42 ```shell 43 pip install fastembed-haystack 44 ``` 45 46 ### Parameters 47 48 You can set the path where the model will be stored in a cache directory. Also, you can set the number of threads a single `onnxruntime` session can use. 49 50 ```python 51 cache_dir= "/your_cacheDirectory" 52 embedder = FastembedDocumentEmbedder( 53 *model="*BAAI/bge-large-en-v1.5", 54 cache_dir=cache_dir, 55 threads=2 56 ) 57 ``` 58 59 If you want to use the data parallel encoding, you can set the parameters `parallel` and `batch_size`. 60 61 - If parallel > 1, data-parallel encoding will be used. This is recommended for offline encoding of large datasets. 62 - If parallel is 0, use all available cores. 63 - If None, don't use data-parallel processing; use default `onnxruntime` threading instead. 64 65 :::tip 66 If you create a Text Embedder and a Document Embedder based on the same model, Haystack uses the same resource behind the scenes to save resources. 67 ::: 68 69 ### Embedding Metadata 70 71 Text documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the document to improve retrieval. 72 73 You can do this easily by using the Document Embedder: 74 75 ```python 76 from haystack.preview import Document 77 from haystack_integrations.components.embedders.fastembed import ( 78 FastembedDocumentEmbedder, 79 ) 80 81 doc = Document( 82 text="some text", 83 metadata={"title": "relevant title", "page number": 18}, 84 ) 85 86 embedder = FastembedDocumentEmbedder( 87 model="BAAI/bge-small-en-v1.5", 88 batch_size=256, 89 metadata_fields_to_embed=["title"], 90 ) 91 92 docs_w_embeddings = embedder.run(documents=[doc])["documents"] 93 ``` 94 95 ## Usage 96 97 ### On its own 98 99 ```python 100 from haystack.dataclasses import Document 101 from haystack_integrations.components.embedders.fastembed import ( 102 FastembedDocumentEmbedder, 103 ) 104 105 document_list = [ 106 Document(content="I love pizza!"), 107 Document(content="I like spaghetti"), 108 ] 109 110 doc_embedder = FastembedDocumentEmbedder() 111 112 result = doc_embedder.run(document_list) 113 print(result["documents"][0].embedding) 114 115 ## [-0.04235665127635002, 0.021791068837046623, ...] 116 ``` 117 118 ### In a pipeline 119 120 ```python 121 from haystack import Document, Pipeline 122 from haystack.components.writers import DocumentWriter 123 from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever 124 from haystack.document_stores.in_memory import InMemoryDocumentStore 125 from haystack.document_stores.types import DuplicatePolicy 126 from haystack_integrations.components.embedders.fastembed import ( 127 FastembedDocumentEmbedder, 128 FastembedTextEmbedder, 129 ) 130 131 document_store = InMemoryDocumentStore(embedding_similarity_function="cosine") 132 133 documents = [ 134 Document(content="My name is Wolfgang and I live in Berlin"), 135 Document(content="I saw a black horse running"), 136 Document(content="Germany has many big cities"), 137 Document(content="fastembed is supported by and maintained by Qdrant."), 138 ] 139 140 document_embedder = FastembedDocumentEmbedder() 141 writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE) 142 143 indexing_pipeline = Pipeline() 144 indexing_pipeline.add_component("document_embedder", document_embedder) 145 indexing_pipeline.add_component("writer", writer) 146 indexing_pipeline.connect("document_embedder", "writer") 147 148 indexing_pipeline.run({"document_embedder": {"documents": documents}}) 149 150 query_pipeline = Pipeline() 151 query_pipeline.add_component("text_embedder", FastembedTextEmbedder()) 152 query_pipeline.add_component( 153 "retriever", 154 InMemoryEmbeddingRetriever(document_store=document_store), 155 ) 156 query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") 157 158 query = "Who supports fastembed?" 159 160 result = query_pipeline.run({"text_embedder": {"text": query}}) 161 162 print(result["retriever"]["documents"][0]) # noqa: T201 163 164 ## Document(id=..., 165 ## content: 'fastembed is supported by and maintained by Qdrant.', 166 ## score: 0.758..) 167 ``` 168 169 ## Additional References 170 171 🧑🍳 Cookbook: [RAG Pipeline Using FastEmbed for Embeddings Generation](https://haystack.deepset.ai/cookbook/rag_fastembed)