Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.26 / haystack-api / document_stores_api.md
document_stores_api.md
  1  ---
  2  title: "Document Stores"
  3  id: document-stores-api
  4  description: "Stores your texts and meta data and provides them to the Retriever at query time."
  5  slug: "/document-stores-api"
  6  ---
  7  
  8  
  9  ## document_store
 10  
 11  ### BM25DocumentStats
 12  
 13  A dataclass for managing document statistics for BM25 retrieval.
 14  
 15  **Parameters:**
 16  
 17  - **freq_token** (<code>dict\[str, int\]</code>) – A Counter of token frequencies in the document.
 18  - **doc_len** (<code>int</code>) – Number of tokens in the document.
 19  
 20  ### InMemoryDocumentStore
 21  
 22  Stores data in-memory. It's ephemeral and cannot be saved to disk.
 23  
 24  #### __init__
 25  
 26  ```python
 27  __init__(
 28      bm25_tokenization_regex: str = "(?u)\\b\\w\\w+\\b",
 29      bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
 30      bm25_parameters: dict | None = None,
 31      embedding_similarity_function: Literal[
 32          "dot_product", "cosine"
 33      ] = "dot_product",
 34      index: str | None = None,
 35      async_executor: ThreadPoolExecutor | None = None,
 36      return_embedding: bool = True,
 37  )
 38  ```
 39  
 40  Initializes the DocumentStore.
 41  
 42  **Parameters:**
 43  
 44  - **bm25_tokenization_regex** (<code>str</code>) – The regular expression used to tokenize the text for BM25 retrieval.
 45  - **bm25_algorithm** (<code>Literal['BM25Okapi', 'BM25L', 'BM25Plus']</code>) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".
 46  - **bm25_parameters** (<code>dict | None</code>) – Parameters for BM25 implementation in a dictionary format.
 47    For example: `{'k1':1.5, 'b':0.75, 'epsilon':0.25}`
 48    You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.
 49  - **embedding_similarity_function** (<code>Literal['dot_product', 'cosine']</code>) – The similarity function used to compare Documents embeddings.
 50    One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information
 51    about your embedding model.
 52  - **index** (<code>str | None</code>) – A specific index to store the documents. If not specified, a random UUID is used.
 53    Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.
 54  - **async_executor** (<code>ThreadPoolExecutor | None</code>) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded
 55    executor will be initialized and used.
 56  - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is True.
 57  
 58  #### shutdown
 59  
 60  ```python
 61  shutdown()
 62  ```
 63  
 64  Explicitly shutdown the executor if we own it.
 65  
 66  #### storage
 67  
 68  ```python
 69  storage: dict[str, Document]
 70  ```
 71  
 72  Utility property that returns the storage used by this instance of InMemoryDocumentStore.
 73  
 74  #### to_dict
 75  
 76  ```python
 77  to_dict() -> dict[str, Any]
 78  ```
 79  
 80  Serializes the component to a dictionary.
 81  
 82  **Returns:**
 83  
 84  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 85  
 86  #### from_dict
 87  
 88  ```python
 89  from_dict(data: dict[str, Any]) -> InMemoryDocumentStore
 90  ```
 91  
 92  Deserializes the component from a dictionary.
 93  
 94  **Parameters:**
 95  
 96  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 97  
 98  **Returns:**
 99  
100  - <code>InMemoryDocumentStore</code> – The deserialized component.
101  
102  #### save_to_disk
103  
104  ```python
105  save_to_disk(path: str) -> None
106  ```
107  
108  Write the database and its' data to disk as a JSON file.
109  
110  **Parameters:**
111  
112  - **path** (<code>str</code>) – The path to the JSON file.
113  
114  #### load_from_disk
115  
116  ```python
117  load_from_disk(path: str) -> InMemoryDocumentStore
118  ```
119  
120  Load the database and its' data from disk as a JSON file.
121  
122  **Parameters:**
123  
124  - **path** (<code>str</code>) – The path to the JSON file.
125  
126  **Returns:**
127  
128  - <code>InMemoryDocumentStore</code> – The loaded InMemoryDocumentStore.
129  
130  #### count_documents
131  
132  ```python
133  count_documents() -> int
134  ```
135  
136  Returns the number of how many documents are present in the DocumentStore.
137  
138  #### filter_documents
139  
140  ```python
141  filter_documents(filters: dict[str, Any] | None = None) -> list[Document]
142  ```
143  
144  Returns the documents that match the filters provided.
145  
146  For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol
147  documentation.
148  
149  **Parameters:**
150  
151  - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list.
152  
153  **Returns:**
154  
155  - <code>list\[Document\]</code> – A list of Documents that match the given filters.
156  
157  #### write_documents
158  
159  ```python
160  write_documents(
161      documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
162  ) -> int
163  ```
164  
165  Refer to the DocumentStore.write_documents() protocol documentation.
166  
167  If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`.
168  
169  #### delete_documents
170  
171  ```python
172  delete_documents(document_ids: list[str]) -> None
173  ```
174  
175  Deletes all documents with matching document_ids from the DocumentStore.
176  
177  **Parameters:**
178  
179  - **document_ids** (<code>list\[str\]</code>) – The object_ids to delete.
180  
181  #### delete_all_documents
182  
183  ```python
184  delete_all_documents() -> None
185  ```
186  
187  Deletes all documents in the document store.
188  
189  #### update_by_filter
190  
191  ```python
192  update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int
193  ```
194  
195  Updates the metadata of all documents that match the provided filters.
196  
197  **Parameters:**
198  
199  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating.
200    For filter syntax, see filter_documents.
201  - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata.
202  
203  **Returns:**
204  
205  - <code>int</code> – The number of documents updated.
206  
207  **Raises:**
208  
209  - <code>ValueError</code> – if filters have invalid syntax.
210  
211  #### delete_by_filter
212  
213  ```python
214  delete_by_filter(filters: dict[str, Any]) -> int
215  ```
216  
217  Deletes all documents that match the provided filters.
218  
219  **Parameters:**
220  
221  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion.
222    For filter syntax, see filter_documents.
223  
224  **Returns:**
225  
226  - <code>int</code> – The number of documents deleted.
227  
228  **Raises:**
229  
230  - <code>ValueError</code> – if filters have invalid syntax.
231  
232  #### bm25_retrieval
233  
234  ```python
235  bm25_retrieval(
236      query: str,
237      filters: dict[str, Any] | None = None,
238      top_k: int = 10,
239      scale_score: bool = False,
240  ) -> list[Document]
241  ```
242  
243  Retrieves documents that are most relevant to the query using BM25 algorithm.
244  
245  **Parameters:**
246  
247  - **query** (<code>str</code>) – The query string.
248  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
249  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
250  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False.
251  
252  **Returns:**
253  
254  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.
255  
256  #### embedding_retrieval
257  
258  ```python
259  embedding_retrieval(
260      query_embedding: list[float],
261      filters: dict[str, Any] | None = None,
262      top_k: int = 10,
263      scale_score: bool = False,
264      return_embedding: bool | None = False,
265  ) -> list[Document]
266  ```
267  
268  Retrieves documents that are most similar to the query embedding using a vector similarity metric.
269  
270  **Parameters:**
271  
272  - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query.
273  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
274  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
275  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False.
276  - **return_embedding** (<code>bool | None</code>) – Whether to return the embedding of the retrieved Documents.
277    If not provided, the value of the `return_embedding` parameter set at component
278    initialization will be used. Default is False.
279  
280  **Returns:**
281  
282  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.
283  
284  **Raises:**
285  
286  - <code>ValueError</code> – if filters have invalid syntax.
287  
288  #### count_documents_async
289  
290  ```python
291  count_documents_async() -> int
292  ```
293  
294  Returns the number of how many documents are present in the DocumentStore.
295  
296  #### filter_documents_async
297  
298  ```python
299  filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]
300  ```
301  
302  Returns the documents that match the filters provided.
303  
304  For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol
305  documentation.
306  
307  **Parameters:**
308  
309  - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list.
310  
311  **Returns:**
312  
313  - <code>list\[Document\]</code> – A list of Documents that match the given filters.
314  
315  #### write_documents_async
316  
317  ```python
318  write_documents_async(
319      documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
320  ) -> int
321  ```
322  
323  Refer to the DocumentStore.write_documents() protocol documentation.
324  
325  If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`.
326  
327  #### delete_documents_async
328  
329  ```python
330  delete_documents_async(document_ids: list[str]) -> None
331  ```
332  
333  Deletes all documents with matching document_ids from the DocumentStore.
334  
335  **Parameters:**
336  
337  - **document_ids** (<code>list\[str\]</code>) – The object_ids to delete.
338  
339  #### bm25_retrieval_async
340  
341  ```python
342  bm25_retrieval_async(
343      query: str,
344      filters: dict[str, Any] | None = None,
345      top_k: int = 10,
346      scale_score: bool = False,
347  ) -> list[Document]
348  ```
349  
350  Retrieves documents that are most relevant to the query using BM25 algorithm.
351  
352  **Parameters:**
353  
354  - **query** (<code>str</code>) – The query string.
355  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
356  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
357  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False.
358  
359  **Returns:**
360  
361  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.
362  
363  #### embedding_retrieval_async
364  
365  ```python
366  embedding_retrieval_async(
367      query_embedding: list[float],
368      filters: dict[str, Any] | None = None,
369      top_k: int = 10,
370      scale_score: bool = False,
371      return_embedding: bool = False,
372  ) -> list[Document]
373  ```
374  
375  Retrieves documents that are most similar to the query embedding using a vector similarity metric.
376  
377  **Parameters:**
378  
379  - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query.
380  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
381  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
382  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False.
383  - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is False.
384  
385  **Returns:**
386  
387  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.