document_stores_api.md
1 --- 2 title: "Document Stores" 3 id: document-stores-api 4 description: "Stores your texts and meta data and provides them to the Retriever at query time." 5 slug: "/document-stores-api" 6 --- 7 8 9 ## document_store 10 11 ### BM25DocumentStats 12 13 A dataclass for managing document statistics for BM25 retrieval. 14 15 **Parameters:** 16 17 - **freq_token** (<code>dict\[str, int\]</code>) – A Counter of token frequencies in the document. 18 - **doc_len** (<code>int</code>) – Number of tokens in the document. 19 20 ### InMemoryDocumentStore 21 22 Stores data in-memory. It's ephemeral and cannot be saved to disk. 23 24 #### __init__ 25 26 ```python 27 __init__( 28 bm25_tokenization_regex: str = "(?u)\\b\\w\\w+\\b", 29 bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L", 30 bm25_parameters: dict | None = None, 31 embedding_similarity_function: Literal[ 32 "dot_product", "cosine" 33 ] = "dot_product", 34 index: str | None = None, 35 async_executor: ThreadPoolExecutor | None = None, 36 return_embedding: bool = True, 37 ) 38 ``` 39 40 Initializes the DocumentStore. 41 42 **Parameters:** 43 44 - **bm25_tokenization_regex** (<code>str</code>) – The regular expression used to tokenize the text for BM25 retrieval. 45 - **bm25_algorithm** (<code>Literal['BM25Okapi', 'BM25L', 'BM25Plus']</code>) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus". 46 - **bm25_parameters** (<code>dict | None</code>) – Parameters for BM25 implementation in a dictionary format. 47 For example: `{'k1':1.5, 'b':0.75, 'epsilon':0.25}` 48 You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. 49 - **embedding_similarity_function** (<code>Literal['dot_product', 'cosine']</code>) – The similarity function used to compare Documents embeddings. 50 One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information 51 about your embedding model. 52 - **index** (<code>str | None</code>) – A specific index to store the documents. If not specified, a random UUID is used. 53 Using the same index allows you to store documents across multiple InMemoryDocumentStore instances. 54 - **async_executor** (<code>ThreadPoolExecutor | None</code>) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded 55 executor will be initialized and used. 56 - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is True. 57 58 #### shutdown 59 60 ```python 61 shutdown() 62 ``` 63 64 Explicitly shutdown the executor if we own it. 65 66 #### storage 67 68 ```python 69 storage: dict[str, Document] 70 ``` 71 72 Utility property that returns the storage used by this instance of InMemoryDocumentStore. 73 74 #### to_dict 75 76 ```python 77 to_dict() -> dict[str, Any] 78 ``` 79 80 Serializes the component to a dictionary. 81 82 **Returns:** 83 84 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 85 86 #### from_dict 87 88 ```python 89 from_dict(data: dict[str, Any]) -> InMemoryDocumentStore 90 ``` 91 92 Deserializes the component from a dictionary. 93 94 **Parameters:** 95 96 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 97 98 **Returns:** 99 100 - <code>InMemoryDocumentStore</code> – The deserialized component. 101 102 #### save_to_disk 103 104 ```python 105 save_to_disk(path: str) -> None 106 ``` 107 108 Write the database and its' data to disk as a JSON file. 109 110 **Parameters:** 111 112 - **path** (<code>str</code>) – The path to the JSON file. 113 114 #### load_from_disk 115 116 ```python 117 load_from_disk(path: str) -> InMemoryDocumentStore 118 ``` 119 120 Load the database and its' data from disk as a JSON file. 121 122 **Parameters:** 123 124 - **path** (<code>str</code>) – The path to the JSON file. 125 126 **Returns:** 127 128 - <code>InMemoryDocumentStore</code> – The loaded InMemoryDocumentStore. 129 130 #### count_documents 131 132 ```python 133 count_documents() -> int 134 ``` 135 136 Returns the number of how many documents are present in the DocumentStore. 137 138 #### filter_documents 139 140 ```python 141 filter_documents(filters: dict[str, Any] | None = None) -> list[Document] 142 ``` 143 144 Returns the documents that match the filters provided. 145 146 For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol 147 documentation. 148 149 **Parameters:** 150 151 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 152 153 **Returns:** 154 155 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 156 157 #### write_documents 158 159 ```python 160 write_documents( 161 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 162 ) -> int 163 ``` 164 165 Refer to the DocumentStore.write_documents() protocol documentation. 166 167 If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`. 168 169 #### delete_documents 170 171 ```python 172 delete_documents(document_ids: list[str]) -> None 173 ``` 174 175 Deletes all documents with matching document_ids from the DocumentStore. 176 177 **Parameters:** 178 179 - **document_ids** (<code>list\[str\]</code>) – The object_ids to delete. 180 181 #### delete_all_documents 182 183 ```python 184 delete_all_documents() -> None 185 ``` 186 187 Deletes all documents in the document store. 188 189 #### update_by_filter 190 191 ```python 192 update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int 193 ``` 194 195 Updates the metadata of all documents that match the provided filters. 196 197 **Parameters:** 198 199 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 200 For filter syntax, see filter_documents. 201 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata. 202 203 **Returns:** 204 205 - <code>int</code> – The number of documents updated. 206 207 **Raises:** 208 209 - <code>ValueError</code> – if filters have invalid syntax. 210 211 #### delete_by_filter 212 213 ```python 214 delete_by_filter(filters: dict[str, Any]) -> int 215 ``` 216 217 Deletes all documents that match the provided filters. 218 219 **Parameters:** 220 221 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 222 For filter syntax, see filter_documents. 223 224 **Returns:** 225 226 - <code>int</code> – The number of documents deleted. 227 228 **Raises:** 229 230 - <code>ValueError</code> – if filters have invalid syntax. 231 232 #### bm25_retrieval 233 234 ```python 235 bm25_retrieval( 236 query: str, 237 filters: dict[str, Any] | None = None, 238 top_k: int = 10, 239 scale_score: bool = False, 240 ) -> list[Document] 241 ``` 242 243 Retrieves documents that are most relevant to the query using BM25 algorithm. 244 245 **Parameters:** 246 247 - **query** (<code>str</code>) – The query string. 248 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 249 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 250 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False. 251 252 **Returns:** 253 254 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query. 255 256 #### embedding_retrieval 257 258 ```python 259 embedding_retrieval( 260 query_embedding: list[float], 261 filters: dict[str, Any] | None = None, 262 top_k: int = 10, 263 scale_score: bool = False, 264 return_embedding: bool | None = False, 265 ) -> list[Document] 266 ``` 267 268 Retrieves documents that are most similar to the query embedding using a vector similarity metric. 269 270 **Parameters:** 271 272 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 273 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 274 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 275 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False. 276 - **return_embedding** (<code>bool | None</code>) – Whether to return the embedding of the retrieved Documents. 277 If not provided, the value of the `return_embedding` parameter set at component 278 initialization will be used. Default is False. 279 280 **Returns:** 281 282 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query. 283 284 **Raises:** 285 286 - <code>ValueError</code> – if filters have invalid syntax. 287 288 #### count_documents_async 289 290 ```python 291 count_documents_async() -> int 292 ``` 293 294 Returns the number of how many documents are present in the DocumentStore. 295 296 #### filter_documents_async 297 298 ```python 299 filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document] 300 ``` 301 302 Returns the documents that match the filters provided. 303 304 For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol 305 documentation. 306 307 **Parameters:** 308 309 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 310 311 **Returns:** 312 313 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 314 315 #### write_documents_async 316 317 ```python 318 write_documents_async( 319 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 320 ) -> int 321 ``` 322 323 Refer to the DocumentStore.write_documents() protocol documentation. 324 325 If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`. 326 327 #### delete_documents_async 328 329 ```python 330 delete_documents_async(document_ids: list[str]) -> None 331 ``` 332 333 Deletes all documents with matching document_ids from the DocumentStore. 334 335 **Parameters:** 336 337 - **document_ids** (<code>list\[str\]</code>) – The object_ids to delete. 338 339 #### bm25_retrieval_async 340 341 ```python 342 bm25_retrieval_async( 343 query: str, 344 filters: dict[str, Any] | None = None, 345 top_k: int = 10, 346 scale_score: bool = False, 347 ) -> list[Document] 348 ``` 349 350 Retrieves documents that are most relevant to the query using BM25 algorithm. 351 352 **Parameters:** 353 354 - **query** (<code>str</code>) – The query string. 355 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 356 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 357 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False. 358 359 **Returns:** 360 361 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query. 362 363 #### embedding_retrieval_async 364 365 ```python 366 embedding_retrieval_async( 367 query_embedding: list[float], 368 filters: dict[str, Any] | None = None, 369 top_k: int = 10, 370 scale_score: bool = False, 371 return_embedding: bool = False, 372 ) -> list[Document] 373 ``` 374 375 Retrieves documents that are most similar to the query embedding using a vector similarity metric. 376 377 **Parameters:** 378 379 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 380 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 381 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 382 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False. 383 - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is False. 384 385 **Returns:** 386 387 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.