vllmdocumentembedder.mdx
  1  ---
  2  title: "VLLMDocumentEmbedder"
  3  id: vllmdocumentembedder
  4  slug: "/vllmdocumentembedder"
  5  description: "This component computes the embeddings of a list of documents using models served with vLLM."
  6  ---
  7  
  8  # VLLMDocumentEmbedder
  9  
 10  This component computes the embeddings of a list of documents using models served with [vLLM](https://docs.vllm.ai/).
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Before a [`DocumentWriter`](../writers/documentwriter.mdx) in an indexing pipeline         |
 17  | **Mandatory init variables**           | `model`: The name of the model served by vLLM                                              |
 18  | **Mandatory run variables**            | `documents`: A list of documents                                                           |
 19  | **Output variables**                   | `documents`: A list of documents (enriched with embeddings)                                |
 20  | **API reference**                      | [vLLM](/reference/integrations-vllm)                                                       |
 21  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/vllm      |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  [vLLM](https://docs.vllm.ai/) is a high-throughput and memory-efficient inference and serving engine for LLMs. It exposes an OpenAI-compatible HTTP server, which `VLLMDocumentEmbedder` uses to compute embeddings through the Embeddings API.
 28  
 29  `VLLMDocumentEmbedder` computes the embeddings of a list of documents and stores the obtained vectors in the `embedding` field of each document. It expects a vLLM server to be running and accessible at the `api_base_url` parameter (by default, `http://localhost:8000/v1`). To embed a string (such as a query), use the [`VLLMTextEmbedder`](vllmtextembedder.mdx).
 30  
 31  The vectors computed by this component are necessary to perform embedding retrieval on a collection of documents. At retrieval time, the vector that represents the query is compared with those of the documents to find the most similar or relevant ones.
 32  
 33  If the vLLM server was started with `--api-key`, provide the API key through the `VLLM_API_KEY` environment variable or the `api_key` init parameter using Haystack's [Secret](../../concepts/secret-management.mdx) API.
 34  
 35  ### Compatible models
 36  
 37  vLLM supports a range of embedding models. Check the [vLLM pooling models docs](https://docs.vllm.ai/en/stable/models/pooling_models) for the list of supported architectures and models.
 38  
 39  ### vLLM-specific parameters
 40  
 41  You can pass vLLM-specific parameters through the `extra_parameters` dictionary. These are forwarded as `extra_body` to the OpenAI-compatible embeddings endpoint. Use this to pass parameters that are not part of the standard OpenAI Embeddings API, such as `truncate_prompt_tokens` or `truncation_side`. See the [vLLM Embeddings API docs](https://docs.vllm.ai/en/stable/models/pooling_models/embed/#openai-compatible-embeddings-api) for details.
 42  
 43  ```python
 44  embedder = VLLMDocumentEmbedder(
 45      model="google/embeddinggemma-300m",
 46      extra_parameters={"truncate_prompt_tokens": 256, "truncation_side": "right"},
 47  )
 48  ```
 49  
 50  ### Matryoshka embeddings
 51  
 52  If the model was trained with Matryoshka Representation Learning, you can reduce the dimensionality of the output vector through the `dimensions` parameter. See the [vLLM Matryoshka docs](https://docs.vllm.ai/en/stable/models/pooling_models/embed/#matryoshka-embeddings) for details.
 53  
 54  ### Batching and failure handling
 55  
 56  `VLLMDocumentEmbedder` encodes documents in batches. Use `batch_size` (default `32`) to control how many documents are sent in a single request to the vLLM server, and `progress_bar` to toggle the progress indicator.
 57  
 58  By default (`raise_on_failure=False`), failed embedding requests are logged and processing continues with the remaining documents. Set `raise_on_failure=True` to raise an exception instead.
 59  
 60  ### Instructions
 61  
 62  Some embedding models require prepending the document text with an instruction to work better for retrieval. For example, if you use [intfloat/e5-large-v2](https://huggingface.co/intfloat/e5-large-v2), you should prefix your document with the following instruction: "passage:".
 63  
 64  This is how it works with `VLLMDocumentEmbedder`:
 65  
 66  ```python
 67  instruction = "passage:"
 68  embedder = VLLMDocumentEmbedder(
 69      model="intfloat/e5-large-v2",
 70      prefix=instruction,
 71  )
 72  ```
 73  
 74  ### Embedding metadata
 75  
 76  Documents often come with a set of metadata. If they are distinctive and semantically meaningful, you can embed them along with the text of the document to improve retrieval. Pass the relevant fields through `meta_fields_to_embed`; they are concatenated to the document text using `embedding_separator` (a newline by default):
 77  
 78  ```python
 79  from haystack import Document
 80  from haystack_integrations.components.embedders.vllm import VLLMDocumentEmbedder
 81  
 82  doc = Document(content="some text", meta={"title": "relevant title", "page_number": 18})
 83  
 84  embedder = VLLMDocumentEmbedder(
 85      model="google/embeddinggemma-300m",
 86      meta_fields_to_embed=["title"],
 87  )
 88  
 89  docs_with_embeddings = embedder.run(documents=[doc])["documents"]
 90  ```
 91  
 92  ## Usage
 93  
 94  Install the `vllm-haystack` package to use the `VLLMDocumentEmbedder`:
 95  
 96  ```shell
 97  pip install vllm-haystack
 98  ```
 99  
100  ### Starting the vLLM server
101  
102  Before using this component, start a vLLM server with an embedding model:
103  
104  ```bash
105  vllm serve google/embeddinggemma-300m
106  ```
107  
108  For details on server options, see the [vLLM CLI docs](https://docs.vllm.ai/en/stable/cli/serve/).
109  
110  ### On its own
111  
112  ```python
113  from haystack import Document
114  from haystack_integrations.components.embedders.vllm import VLLMDocumentEmbedder
115  
116  doc = Document(content="I love pizza!")
117  
118  document_embedder = VLLMDocumentEmbedder(model="google/embeddinggemma-300m")
119  
120  result = document_embedder.run([doc])
121  print(result["documents"][0].embedding)
122  
123  ## [-0.0215301513671875, 0.01499176025390625, ...]
124  ```
125  
126  ### In a pipeline
127  
128  ```python
129  from haystack import Document, Pipeline
130  from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
131  from haystack.components.writers import DocumentWriter
132  from haystack.document_stores.in_memory import InMemoryDocumentStore
133  from haystack.document_stores.types import DuplicatePolicy
134  from haystack_integrations.components.embedders.vllm import (
135      VLLMDocumentEmbedder,
136      VLLMTextEmbedder,
137  )
138  
139  document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")
140  
141  documents = [
142      Document(content="My name is Wolfgang and I live in Berlin"),
143      Document(content="I saw a black horse running"),
144      Document(content="Germany has many big cities"),
145  ]
146  
147  document_embedder = VLLMDocumentEmbedder(model="google/embeddinggemma-300m")
148  writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE)
149  
150  indexing_pipeline = Pipeline()
151  indexing_pipeline.add_component("document_embedder", document_embedder)
152  indexing_pipeline.add_component("writer", writer)
153  indexing_pipeline.connect("document_embedder", "writer")
154  
155  indexing_pipeline.run({"document_embedder": {"documents": documents}})
156  
157  query_pipeline = Pipeline()
158  query_pipeline.add_component(
159      "text_embedder",
160      VLLMTextEmbedder(model="google/embeddinggemma-300m"),
161  )
162  query_pipeline.add_component(
163      "retriever",
164      InMemoryEmbeddingRetriever(document_store=document_store),
165  )
166  query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
167  
168  query = "Who lives in Berlin?"
169  
170  result = query_pipeline.run({"text_embedder": {"text": query}})
171  
172  print(result["retriever"]["documents"][0])
173  
174  ## Document(id=..., content: 'My name is Wolfgang and I live in Berlin', score: ...)
175  ```