Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.21 / haystack-api / document_stores_api.md
document_stores_api.md
  1  ---
  2  title: "Document Stores"
  3  id: document-stores-api
  4  description: "Stores your texts and meta data and provides them to the Retriever at query time."
  5  slug: "/document-stores-api"
  6  ---
  7  
  8  <a id="document_store"></a>
  9  
 10  ## Module document\_store
 11  
 12  <a id="document_store.BM25DocumentStats"></a>
 13  
 14  ### BM25DocumentStats
 15  
 16  A dataclass for managing document statistics for BM25 retrieval.
 17  
 18  **Arguments**:
 19  
 20  - `freq_token`: A Counter of token frequencies in the document.
 21  - `doc_len`: Number of tokens in the document.
 22  
 23  <a id="document_store.InMemoryDocumentStore"></a>
 24  
 25  ### InMemoryDocumentStore
 26  
 27  Stores data in-memory. It's ephemeral and cannot be saved to disk.
 28  
 29  <a id="document_store.InMemoryDocumentStore.__init__"></a>
 30  
 31  #### InMemoryDocumentStore.\_\_init\_\_
 32  
 33  ```python
 34  def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b",
 35               bm25_algorithm: Literal["BM25Okapi", "BM25L",
 36                                       "BM25Plus"] = "BM25L",
 37               bm25_parameters: Optional[dict] = None,
 38               embedding_similarity_function: Literal["dot_product",
 39                                                      "cosine"] = "dot_product",
 40               index: Optional[str] = None,
 41               async_executor: Optional[ThreadPoolExecutor] = None,
 42               return_embedding: bool = True)
 43  ```
 44  
 45  Initializes the DocumentStore.
 46  
 47  **Arguments**:
 48  
 49  - `bm25_tokenization_regex`: The regular expression used to tokenize the text for BM25 retrieval.
 50  - `bm25_algorithm`: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus".
 51  - `bm25_parameters`: Parameters for BM25 implementation in a dictionary format.
 52  For example: `{'k1':1.5, 'b':0.75, 'epsilon':0.25}`
 53  You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25.
 54  - `embedding_similarity_function`: The similarity function used to compare Documents embeddings.
 55  One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information
 56  about your embedding model.
 57  - `index`: A specific index to store the documents. If not specified, a random UUID is used.
 58  Using the same index allows you to store documents across multiple InMemoryDocumentStore instances.
 59  - `async_executor`: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded
 60  executor will be initialized and used.
 61  - `return_embedding`: Whether to return the embedding of the retrieved Documents. Default is True.
 62  
 63  <a id="document_store.InMemoryDocumentStore.__del__"></a>
 64  
 65  #### InMemoryDocumentStore.\_\_del\_\_
 66  
 67  ```python
 68  def __del__()
 69  ```
 70  
 71  Cleanup when the instance is being destroyed.
 72  
 73  <a id="document_store.InMemoryDocumentStore.shutdown"></a>
 74  
 75  #### InMemoryDocumentStore.shutdown
 76  
 77  ```python
 78  def shutdown()
 79  ```
 80  
 81  Explicitly shutdown the executor if we own it.
 82  
 83  <a id="document_store.InMemoryDocumentStore.storage"></a>
 84  
 85  #### InMemoryDocumentStore.storage
 86  
 87  ```python
 88  @property
 89  def storage() -> dict[str, Document]
 90  ```
 91  
 92  Utility property that returns the storage used by this instance of InMemoryDocumentStore.
 93  
 94  <a id="document_store.InMemoryDocumentStore.to_dict"></a>
 95  
 96  #### InMemoryDocumentStore.to\_dict
 97  
 98  ```python
 99  def to_dict() -> dict[str, Any]
100  ```
101  
102  Serializes the component to a dictionary.
103  
104  **Returns**:
105  
106  Dictionary with serialized data.
107  
108  <a id="document_store.InMemoryDocumentStore.from_dict"></a>
109  
110  #### InMemoryDocumentStore.from\_dict
111  
112  ```python
113  @classmethod
114  def from_dict(cls, data: dict[str, Any]) -> "InMemoryDocumentStore"
115  ```
116  
117  Deserializes the component from a dictionary.
118  
119  **Arguments**:
120  
121  - `data`: The dictionary to deserialize from.
122  
123  **Returns**:
124  
125  The deserialized component.
126  
127  <a id="document_store.InMemoryDocumentStore.save_to_disk"></a>
128  
129  #### InMemoryDocumentStore.save\_to\_disk
130  
131  ```python
132  def save_to_disk(path: str) -> None
133  ```
134  
135  Write the database and its' data to disk as a JSON file.
136  
137  **Arguments**:
138  
139  - `path`: The path to the JSON file.
140  
141  <a id="document_store.InMemoryDocumentStore.load_from_disk"></a>
142  
143  #### InMemoryDocumentStore.load\_from\_disk
144  
145  ```python
146  @classmethod
147  def load_from_disk(cls, path: str) -> "InMemoryDocumentStore"
148  ```
149  
150  Load the database and its' data from disk as a JSON file.
151  
152  **Arguments**:
153  
154  - `path`: The path to the JSON file.
155  
156  **Returns**:
157  
158  The loaded InMemoryDocumentStore.
159  
160  <a id="document_store.InMemoryDocumentStore.count_documents"></a>
161  
162  #### InMemoryDocumentStore.count\_documents
163  
164  ```python
165  def count_documents() -> int
166  ```
167  
168  Returns the number of how many documents are present in the DocumentStore.
169  
170  <a id="document_store.InMemoryDocumentStore.filter_documents"></a>
171  
172  #### InMemoryDocumentStore.filter\_documents
173  
174  ```python
175  def filter_documents(
176          filters: Optional[dict[str, Any]] = None) -> list[Document]
177  ```
178  
179  Returns the documents that match the filters provided.
180  
181  For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol
182  documentation.
183  
184  **Arguments**:
185  
186  - `filters`: The filters to apply to the document list.
187  
188  **Returns**:
189  
190  A list of Documents that match the given filters.
191  
192  <a id="document_store.InMemoryDocumentStore.write_documents"></a>
193  
194  #### InMemoryDocumentStore.write\_documents
195  
196  ```python
197  def write_documents(documents: list[Document],
198                      policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
199  ```
200  
201  Refer to the DocumentStore.write_documents() protocol documentation.
202  
203  If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`.
204  
205  <a id="document_store.InMemoryDocumentStore.delete_documents"></a>
206  
207  #### InMemoryDocumentStore.delete\_documents
208  
209  ```python
210  def delete_documents(document_ids: list[str]) -> None
211  ```
212  
213  Deletes all documents with matching document_ids from the DocumentStore.
214  
215  **Arguments**:
216  
217  - `document_ids`: The object_ids to delete.
218  
219  <a id="document_store.InMemoryDocumentStore.bm25_retrieval"></a>
220  
221  #### InMemoryDocumentStore.bm25\_retrieval
222  
223  ```python
224  def bm25_retrieval(query: str,
225                     filters: Optional[dict[str, Any]] = None,
226                     top_k: int = 10,
227                     scale_score: bool = False) -> list[Document]
228  ```
229  
230  Retrieves documents that are most relevant to the query using BM25 algorithm.
231  
232  **Arguments**:
233  
234  - `query`: The query string.
235  - `filters`: A dictionary with filters to narrow down the search space.
236  - `top_k`: The number of top documents to retrieve. Default is 10.
237  - `scale_score`: Whether to scale the scores of the retrieved documents. Default is False.
238  
239  **Returns**:
240  
241  A list of the top_k documents most relevant to the query.
242  
243  <a id="document_store.InMemoryDocumentStore.embedding_retrieval"></a>
244  
245  #### InMemoryDocumentStore.embedding\_retrieval
246  
247  ```python
248  def embedding_retrieval(
249          query_embedding: list[float],
250          filters: Optional[dict[str, Any]] = None,
251          top_k: int = 10,
252          scale_score: bool = False,
253          return_embedding: Optional[bool] = False) -> list[Document]
254  ```
255  
256  Retrieves documents that are most similar to the query embedding using a vector similarity metric.
257  
258  **Arguments**:
259  
260  - `query_embedding`: Embedding of the query.
261  - `filters`: A dictionary with filters to narrow down the search space.
262  - `top_k`: The number of top documents to retrieve. Default is 10.
263  - `scale_score`: Whether to scale the scores of the retrieved Documents. Default is False.
264  - `return_embedding`: Whether to return the embedding of the retrieved Documents.
265  If not provided, the value of the `return_embedding` parameter set at component
266  initialization will be used. Default is False.
267  
268  **Returns**:
269  
270  A list of the top_k documents most relevant to the query.
271  
272  <a id="document_store.InMemoryDocumentStore.count_documents_async"></a>
273  
274  #### InMemoryDocumentStore.count\_documents\_async
275  
276  ```python
277  async def count_documents_async() -> int
278  ```
279  
280  Returns the number of how many documents are present in the DocumentStore.
281  
282  <a id="document_store.InMemoryDocumentStore.filter_documents_async"></a>
283  
284  #### InMemoryDocumentStore.filter\_documents\_async
285  
286  ```python
287  async def filter_documents_async(
288          filters: Optional[dict[str, Any]] = None) -> list[Document]
289  ```
290  
291  Returns the documents that match the filters provided.
292  
293  For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol
294  documentation.
295  
296  **Arguments**:
297  
298  - `filters`: The filters to apply to the document list.
299  
300  **Returns**:
301  
302  A list of Documents that match the given filters.
303  
304  <a id="document_store.InMemoryDocumentStore.write_documents_async"></a>
305  
306  #### InMemoryDocumentStore.write\_documents\_async
307  
308  ```python
309  async def write_documents_async(
310          documents: list[Document],
311          policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int
312  ```
313  
314  Refer to the DocumentStore.write_documents() protocol documentation.
315  
316  If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`.
317  
318  <a id="document_store.InMemoryDocumentStore.delete_documents_async"></a>
319  
320  #### InMemoryDocumentStore.delete\_documents\_async
321  
322  ```python
323  async def delete_documents_async(document_ids: list[str]) -> None
324  ```
325  
326  Deletes all documents with matching document_ids from the DocumentStore.
327  
328  **Arguments**:
329  
330  - `document_ids`: The object_ids to delete.
331  
332  <a id="document_store.InMemoryDocumentStore.bm25_retrieval_async"></a>
333  
334  #### InMemoryDocumentStore.bm25\_retrieval\_async
335  
336  ```python
337  async def bm25_retrieval_async(query: str,
338                                 filters: Optional[dict[str, Any]] = None,
339                                 top_k: int = 10,
340                                 scale_score: bool = False) -> list[Document]
341  ```
342  
343  Retrieves documents that are most relevant to the query using BM25 algorithm.
344  
345  **Arguments**:
346  
347  - `query`: The query string.
348  - `filters`: A dictionary with filters to narrow down the search space.
349  - `top_k`: The number of top documents to retrieve. Default is 10.
350  - `scale_score`: Whether to scale the scores of the retrieved documents. Default is False.
351  
352  **Returns**:
353  
354  A list of the top_k documents most relevant to the query.
355  
356  <a id="document_store.InMemoryDocumentStore.embedding_retrieval_async"></a>
357  
358  #### InMemoryDocumentStore.embedding\_retrieval\_async
359  
360  ```python
361  async def embedding_retrieval_async(
362          query_embedding: list[float],
363          filters: Optional[dict[str, Any]] = None,
364          top_k: int = 10,
365          scale_score: bool = False,
366          return_embedding: bool = False) -> list[Document]
367  ```
368  
369  Retrieves documents that are most similar to the query embedding using a vector similarity metric.
370  
371  **Arguments**:
372  
373  - `query_embedding`: Embedding of the query.
374  - `filters`: A dictionary with filters to narrow down the search space.
375  - `top_k`: The number of top documents to retrieve. Default is 10.
376  - `scale_score`: Whether to scale the scores of the retrieved Documents. Default is False.
377  - `return_embedding`: Whether to return the embedding of the retrieved Documents. Default is False.
378  
379  **Returns**:
380  
381  A list of the top_k documents most relevant to the query.
382