Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.27 / haystack-api / document_stores_api.md
document_stores_api.md
  1  ---
  2  title: "Document Stores"
  3  id: document-stores-api
  4  description: "Stores your texts and meta data and provides them to the Retriever at query time."
  5  slug: "/document-stores-api"
  6  ---
  7  
  8  
  9  ## document_store
 10  
 11  ### BM25DocumentStats
 12  
 13  A dataclass for managing document statistics for BM25 retrieval.
 14  
 15  **Parameters:**
 16  
 17  - **freq_token** (<code>dict\[str, int\]</code>) – A Counter of token frequencies in the document.
 18  - **doc_len** (<code>int</code>) – Number of tokens in the document.
 19  
 20  ### InMemoryDocumentStore
 21  
 22  Stores data in-memory. It's ephemeral and cannot be saved to disk.
 23  
 24  #### __init__
 25  
 26  ```python
 27  __init__(
 28      bm25_tokenization_regex: str = "(?u)\\b\\w+\\b",
 29      bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L",
 30      bm25_parameters: dict | None = None,
 31      embedding_similarity_function: Literal[
 32          "dot_product", "cosine"
 33      ] = "dot_product",
 34      index: str | None = None,
 35      async_executor: ThreadPoolExecutor | None = None,
 36      return_embedding: bool = True,
 37  ) -> None
 38  ```
 39  
 40  Initializes the DocumentStore.
 41  
 42  **Parameters:**
 43  
 44  - **bm25_tokenization_regex** (<code>str</code>) – The regular expression used to tokenize the text for BM25 retrieval.
 45  - **bm25_algorithm** (<code>Literal['BM25Okapi', 'BM25L', 'BM25Plus']</code>) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".
 46  - **bm25_parameters** (<code>dict | None</code>) – Parameters for BM25 implementation in a dictionary format.
 47    For example: `{'k1':1.5, 'b':0.75, 'epsilon':0.25}`
 48    You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.
 49  - **embedding_similarity_function** (<code>Literal['dot_product', 'cosine']</code>) – The similarity function used to compare Documents embeddings.
 50    One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information
 51    about your embedding model.
 52  - **index** (<code>str | None</code>) – A specific index to store the documents. If not specified, a random UUID is used.
 53    Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.
 54  - **async_executor** (<code>ThreadPoolExecutor | None</code>) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded
 55    executor will be initialized and used.
 56  - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is True.
 57  
 58  #### shutdown
 59  
 60  ```python
 61  shutdown() -> None
 62  ```
 63  
 64  Explicitly shutdown the executor if we own it.
 65  
 66  #### storage
 67  
 68  ```python
 69  storage: dict[str, Document]
 70  ```
 71  
 72  Utility property that returns the storage used by this instance of InMemoryDocumentStore.
 73  
 74  #### to_dict
 75  
 76  ```python
 77  to_dict() -> dict[str, Any]
 78  ```
 79  
 80  Serializes the component to a dictionary.
 81  
 82  **Returns:**
 83  
 84  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 85  
 86  #### from_dict
 87  
 88  ```python
 89  from_dict(data: dict[str, Any]) -> InMemoryDocumentStore
 90  ```
 91  
 92  Deserializes the component from a dictionary.
 93  
 94  **Parameters:**
 95  
 96  - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from.
 97  
 98  **Returns:**
 99  
100  - <code>InMemoryDocumentStore</code> – The deserialized component.
101  
102  #### save_to_disk
103  
104  ```python
105  save_to_disk(path: str) -> None
106  ```
107  
108  Write the database and its data to disk as a JSON file.
109  
110  **Parameters:**
111  
112  - **path** (<code>str</code>) – The path to the JSON file.
113  
114  #### load_from_disk
115  
116  ```python
117  load_from_disk(path: str) -> InMemoryDocumentStore
118  ```
119  
120  Load the database and its data from disk as a JSON file.
121  
122  **Parameters:**
123  
124  - **path** (<code>str</code>) – The path to the JSON file.
125  
126  **Returns:**
127  
128  - <code>InMemoryDocumentStore</code> – The loaded InMemoryDocumentStore.
129  
130  #### count_documents
131  
132  ```python
133  count_documents() -> int
134  ```
135  
136  Returns the number of documents present in the DocumentStore.
137  
138  #### filter_documents
139  
140  ```python
141  filter_documents(filters: dict[str, Any] | None = None) -> list[Document]
142  ```
143  
144  Returns the documents that match the filters provided.
145  
146  **Parameters:**
147  
148  - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply. For a detailed specification of the filters, refer to the
149    [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering).
150  
151  **Returns:**
152  
153  - <code>list\[Document\]</code> – A list of Documents that match the given filters.
154  
155  #### write_documents
156  
157  ```python
158  write_documents(
159      documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
160  ) -> int
161  ```
162  
163  Refer to the DocumentStore.write_documents() protocol documentation.
164  
165  If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`.
166  
167  #### delete_documents
168  
169  ```python
170  delete_documents(document_ids: list[str]) -> None
171  ```
172  
173  Deletes all documents with matching document_ids from the DocumentStore.
174  
175  **Parameters:**
176  
177  - **document_ids** (<code>list\[str\]</code>) – The document_ids to delete.
178  
179  #### delete_all_documents
180  
181  ```python
182  delete_all_documents() -> None
183  ```
184  
185  Deletes all documents in the document store.
186  
187  #### update_by_filter
188  
189  ```python
190  update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int
191  ```
192  
193  Updates the metadata of all documents that match the provided filters.
194  
195  **Parameters:**
196  
197  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating.
198    For filter syntax, see filter_documents.
199  - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata.
200  
201  **Returns:**
202  
203  - <code>int</code> – The number of documents updated.
204  
205  **Raises:**
206  
207  - <code>ValueError</code> – if filters have invalid syntax.
208  
209  #### delete_by_filter
210  
211  ```python
212  delete_by_filter(filters: dict[str, Any]) -> int
213  ```
214  
215  Deletes all documents that match the provided filters.
216  
217  **Parameters:**
218  
219  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion.
220    For filter syntax, see filter_documents.
221  
222  **Returns:**
223  
224  - <code>int</code> – The number of documents deleted.
225  
226  **Raises:**
227  
228  - <code>ValueError</code> – if filters have invalid syntax.
229  
230  #### count_documents_by_filter
231  
232  ```python
233  count_documents_by_filter(filters: dict[str, Any]) -> int
234  ```
235  
236  Returns the number of documents that match the provided filters.
237  
238  **Parameters:**
239  
240  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply.
241    For a detailed specification of the filters, refer to the
242    [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering).
243  
244  **Returns:**
245  
246  - <code>int</code> – The number of documents that match the filters.
247  
248  #### count_unique_metadata_by_filter
249  
250  ```python
251  count_unique_metadata_by_filter(
252      filters: dict[str, Any], metadata_fields: list[str]
253  ) -> dict[str, int]
254  ```
255  
256  Returns the number of unique values for each specified metadata field from documents matching the filters.
257  
258  **Parameters:**
259  
260  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply.
261    For a detailed specification of the filters, refer to the
262    [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering).
263  - **metadata_fields** (<code>list\[str\]</code>) – List of field names to count unique values for.
264    Field names can include or omit the "meta." prefix.
265  
266  **Returns:**
267  
268  - <code>dict\[str, int\]</code> – A dictionary mapping each metadata field name (without "meta." prefix)
269    to the count of its unique values among the filtered documents.
270  
271  #### get_metadata_fields_info
272  
273  ```python
274  get_metadata_fields_info() -> dict[str, dict[str, str]]
275  ```
276  
277  Returns information about the metadata fields present in the stored documents.
278  
279  Types are inferred from the stored values (keyword, int, float, boolean).
280  
281  **Returns:**
282  
283  - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping each metadata field name to a dict with a "type" key.
284  
285  #### get_metadata_field_min_max
286  
287  ```python
288  get_metadata_field_min_max(metadata_field: str) -> dict[str, Any]
289  ```
290  
291  Returns the minimum and maximum values for the given metadata field across all documents.
292  
293  **Parameters:**
294  
295  - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix.
296  
297  **Returns:**
298  
299  - <code>dict\[str, Any\]</code> – A dictionary with "min" and "max" keys. Returns `{"min": None, "max": None}`
300    if the field is missing or has no values.
301  
302  #### get_metadata_field_unique_values
303  
304  ```python
305  get_metadata_field_unique_values(
306      metadata_field: str, search_term: str | None = None
307  ) -> tuple[list[str], int]
308  ```
309  
310  Returns unique values for a metadata field, optionally filtered by a search term in content.
311  
312  **Parameters:**
313  
314  - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix.
315  - **search_term** (<code>str | None</code>) – If set, only documents whose content contains this term (case-insensitive)
316    are considered.
317  
318  **Returns:**
319  
320  - <code>tuple\[list\[str\], int\]</code> – A tuple of (list of unique values, total count of unique values).
321  
322  #### bm25_retrieval
323  
324  ```python
325  bm25_retrieval(
326      query: str,
327      filters: dict[str, Any] | None = None,
328      top_k: int = 10,
329      scale_score: bool = False,
330  ) -> list[Document]
331  ```
332  
333  Retrieves documents that are most relevant to the query using BM25 algorithm.
334  
335  **Parameters:**
336  
337  - **query** (<code>str</code>) – The query string.
338  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
339  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
340  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False.
341  
342  **Returns:**
343  
344  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.
345  
346  #### embedding_retrieval
347  
348  ```python
349  embedding_retrieval(
350      query_embedding: list[float],
351      filters: dict[str, Any] | None = None,
352      top_k: int = 10,
353      scale_score: bool = False,
354      return_embedding: bool | None = False,
355  ) -> list[Document]
356  ```
357  
358  Retrieves documents that are most similar to the query embedding using a vector similarity metric.
359  
360  **Parameters:**
361  
362  - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query.
363  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
364  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
365  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False.
366  - **return_embedding** (<code>bool | None</code>) – Whether to return the embedding of the retrieved Documents.
367    If not provided, the value of the `return_embedding` parameter set at component
368    initialization will be used. Default is False.
369  
370  **Returns:**
371  
372  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.
373  
374  **Raises:**
375  
376  - <code>ValueError</code> – if filters have invalid syntax.
377  
378  #### count_documents_async
379  
380  ```python
381  count_documents_async() -> int
382  ```
383  
384  Returns the number of documents present in the DocumentStore.
385  
386  #### filter_documents_async
387  
388  ```python
389  filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]
390  ```
391  
392  Returns the documents that match the filters provided.
393  
394  **Parameters:**
395  
396  - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply. For a detailed specification of the filters, refer to the
397    [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering).
398  
399  **Returns:**
400  
401  - <code>list\[Document\]</code> – A list of Documents that match the given filters.
402  
403  #### write_documents_async
404  
405  ```python
406  write_documents_async(
407      documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
408  ) -> int
409  ```
410  
411  Refer to the DocumentStore.write_documents() protocol documentation.
412  
413  If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`.
414  
415  #### delete_documents_async
416  
417  ```python
418  delete_documents_async(document_ids: list[str]) -> None
419  ```
420  
421  Deletes all documents with matching document_ids from the DocumentStore.
422  
423  **Parameters:**
424  
425  - **document_ids** (<code>list\[str\]</code>) – The document_ids to delete.
426  
427  #### update_by_filter_async
428  
429  ```python
430  update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int
431  ```
432  
433  Updates the metadata of all documents that match the provided filters.
434  
435  **Parameters:**
436  
437  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating.
438    For filter syntax, see filter_documents.
439  - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata.
440  
441  **Returns:**
442  
443  - <code>int</code> – The number of documents updated.
444  
445  #### count_documents_by_filter_async
446  
447  ```python
448  count_documents_by_filter_async(filters: dict[str, Any]) -> int
449  ```
450  
451  Returns the number of documents that match the provided filters.
452  
453  **Parameters:**
454  
455  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply.
456    For a detailed specification of the filters, refer to the
457    [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering).
458  
459  **Returns:**
460  
461  - <code>int</code> – The number of documents that match the filters.
462  
463  #### count_unique_metadata_by_filter_async
464  
465  ```python
466  count_unique_metadata_by_filter_async(
467      filters: dict[str, Any], metadata_fields: list[str]
468  ) -> dict[str, int]
469  ```
470  
471  Returns the number of unique values for each specified metadata field from documents matching the filters.
472  
473  **Parameters:**
474  
475  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply.
476    For a detailed specification of the filters, refer to the
477    [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering).
478  - **metadata_fields** (<code>list\[str\]</code>) – List of field names to count unique values for.
479    Field names can include or omit the "meta." prefix.
480  
481  **Returns:**
482  
483  - <code>dict\[str, int\]</code> – A dictionary mapping each metadata field name (without "meta." prefix)
484    to the count of its unique values among the filtered documents.
485  
486  #### get_metadata_fields_info_async
487  
488  ```python
489  get_metadata_fields_info_async() -> dict[str, dict[str, str]]
490  ```
491  
492  Returns information about the metadata fields present in the stored documents.
493  
494  Types are inferred from the stored values (keyword, int, float, boolean).
495  
496  **Returns:**
497  
498  - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping each metadata field name to a dict with a "type" key.
499  
500  #### get_metadata_field_min_max_async
501  
502  ```python
503  get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any]
504  ```
505  
506  Returns the minimum and maximum values for the given metadata field across all documents.
507  
508  **Parameters:**
509  
510  - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix.
511  
512  **Returns:**
513  
514  - <code>dict\[str, Any\]</code> – A dictionary with "min" and "max" keys. Returns `{"min": None, "max": None}`
515    if the field is missing or has no values.
516  
517  #### get_metadata_field_unique_values_async
518  
519  ```python
520  get_metadata_field_unique_values_async(
521      metadata_field: str, search_term: str | None = None
522  ) -> tuple[list[str], int]
523  ```
524  
525  Returns unique values for a metadata field, optionally filtered by a search term in content.
526  
527  **Parameters:**
528  
529  - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix.
530  - **search_term** (<code>str | None</code>) – If set, only documents whose content contains this term (case-insensitive)
531    are considered.
532  
533  **Returns:**
534  
535  - <code>tuple\[list\[str\], int\]</code> – A tuple of (list of unique values, total count of unique values).
536  
537  #### delete_all_documents_async
538  
539  ```python
540  delete_all_documents_async() -> None
541  ```
542  
543  Deletes all documents in the document store.
544  
545  #### bm25_retrieval_async
546  
547  ```python
548  bm25_retrieval_async(
549      query: str,
550      filters: dict[str, Any] | None = None,
551      top_k: int = 10,
552      scale_score: bool = False,
553  ) -> list[Document]
554  ```
555  
556  Retrieves documents that are most relevant to the query using BM25 algorithm.
557  
558  **Parameters:**
559  
560  - **query** (<code>str</code>) – The query string.
561  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
562  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
563  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False.
564  
565  **Returns:**
566  
567  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.
568  
569  #### embedding_retrieval_async
570  
571  ```python
572  embedding_retrieval_async(
573      query_embedding: list[float],
574      filters: dict[str, Any] | None = None,
575      top_k: int = 10,
576      scale_score: bool = False,
577      return_embedding: bool = False,
578  ) -> list[Document]
579  ```
580  
581  Retrieves documents that are most similar to the query embedding using a vector similarity metric.
582  
583  **Parameters:**
584  
585  - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query.
586  - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space.
587  - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10.
588  - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False.
589  - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is False.
590  
591  **Returns:**
592  
593  - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.