document_stores_api.md
1 --- 2 title: "Document Stores" 3 id: document-stores-api 4 description: "Stores your texts and meta data and provides them to the Retriever at query time." 5 slug: "/document-stores-api" 6 --- 7 8 9 ## document_store 10 11 ### BM25DocumentStats 12 13 A dataclass for managing document statistics for BM25 retrieval. 14 15 **Parameters:** 16 17 - **freq_token** (<code>dict\[str, int\]</code>) – A Counter of token frequencies in the document. 18 - **doc_len** (<code>int</code>) – Number of tokens in the document. 19 20 ### InMemoryDocumentStore 21 22 Stores data in-memory. It's ephemeral and cannot be saved to disk. 23 24 #### __init__ 25 26 ```python 27 __init__( 28 bm25_tokenization_regex: str = "(?u)\\b\\w+\\b", 29 bm25_algorithm: Literal["BM25Okapi", "BM25L", "BM25Plus"] = "BM25L", 30 bm25_parameters: dict | None = None, 31 embedding_similarity_function: Literal[ 32 "dot_product", "cosine" 33 ] = "dot_product", 34 index: str | None = None, 35 async_executor: ThreadPoolExecutor | None = None, 36 return_embedding: bool = True, 37 ) -> None 38 ``` 39 40 Initializes the DocumentStore. 41 42 **Parameters:** 43 44 - **bm25_tokenization_regex** (<code>str</code>) – The regular expression used to tokenize the text for BM25 retrieval. 45 - **bm25_algorithm** (<code>Literal['BM25Okapi', 'BM25L', 'BM25Plus']</code>) – The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus". 46 - **bm25_parameters** (<code>dict | None</code>) – Parameters for BM25 implementation in a dictionary format. 47 For example: `{'k1':1.5, 'b':0.75, 'epsilon':0.25}` 48 You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. 49 - **embedding_similarity_function** (<code>Literal['dot_product', 'cosine']</code>) – The similarity function used to compare Documents embeddings. 50 One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information 51 about your embedding model. 52 - **index** (<code>str | None</code>) – A specific index to store the documents. If not specified, a random UUID is used. 53 Using the same index allows you to store documents across multiple InMemoryDocumentStore instances. 54 - **async_executor** (<code>ThreadPoolExecutor | None</code>) – Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded 55 executor will be initialized and used. 56 - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is True. 57 58 #### shutdown 59 60 ```python 61 shutdown() -> None 62 ``` 63 64 Explicitly shutdown the executor if we own it. 65 66 #### storage 67 68 ```python 69 storage: dict[str, Document] 70 ``` 71 72 Utility property that returns the storage used by this instance of InMemoryDocumentStore. 73 74 #### to_dict 75 76 ```python 77 to_dict() -> dict[str, Any] 78 ``` 79 80 Serializes the component to a dictionary. 81 82 **Returns:** 83 84 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 85 86 #### from_dict 87 88 ```python 89 from_dict(data: dict[str, Any]) -> InMemoryDocumentStore 90 ``` 91 92 Deserializes the component from a dictionary. 93 94 **Parameters:** 95 96 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 97 98 **Returns:** 99 100 - <code>InMemoryDocumentStore</code> – The deserialized component. 101 102 #### save_to_disk 103 104 ```python 105 save_to_disk(path: str) -> None 106 ``` 107 108 Write the database and its data to disk as a JSON file. 109 110 **Parameters:** 111 112 - **path** (<code>str</code>) – The path to the JSON file. 113 114 #### load_from_disk 115 116 ```python 117 load_from_disk(path: str) -> InMemoryDocumentStore 118 ``` 119 120 Load the database and its data from disk as a JSON file. 121 122 **Parameters:** 123 124 - **path** (<code>str</code>) – The path to the JSON file. 125 126 **Returns:** 127 128 - <code>InMemoryDocumentStore</code> – The loaded InMemoryDocumentStore. 129 130 #### count_documents 131 132 ```python 133 count_documents() -> int 134 ``` 135 136 Returns the number of documents present in the DocumentStore. 137 138 #### filter_documents 139 140 ```python 141 filter_documents(filters: dict[str, Any] | None = None) -> list[Document] 142 ``` 143 144 Returns the documents that match the filters provided. 145 146 **Parameters:** 147 148 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply. For a detailed specification of the filters, refer to the 149 [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering). 150 151 **Returns:** 152 153 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 154 155 #### write_documents 156 157 ```python 158 write_documents( 159 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 160 ) -> int 161 ``` 162 163 Refer to the DocumentStore.write_documents() protocol documentation. 164 165 If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`. 166 167 #### delete_documents 168 169 ```python 170 delete_documents(document_ids: list[str]) -> None 171 ``` 172 173 Deletes all documents with matching document_ids from the DocumentStore. 174 175 **Parameters:** 176 177 - **document_ids** (<code>list\[str\]</code>) – The document_ids to delete. 178 179 #### delete_all_documents 180 181 ```python 182 delete_all_documents() -> None 183 ``` 184 185 Deletes all documents in the document store. 186 187 #### update_by_filter 188 189 ```python 190 update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int 191 ``` 192 193 Updates the metadata of all documents that match the provided filters. 194 195 **Parameters:** 196 197 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 198 For filter syntax, see filter_documents. 199 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata. 200 201 **Returns:** 202 203 - <code>int</code> – The number of documents updated. 204 205 **Raises:** 206 207 - <code>ValueError</code> – if filters have invalid syntax. 208 209 #### delete_by_filter 210 211 ```python 212 delete_by_filter(filters: dict[str, Any]) -> int 213 ``` 214 215 Deletes all documents that match the provided filters. 216 217 **Parameters:** 218 219 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 220 For filter syntax, see filter_documents. 221 222 **Returns:** 223 224 - <code>int</code> – The number of documents deleted. 225 226 **Raises:** 227 228 - <code>ValueError</code> – if filters have invalid syntax. 229 230 #### count_documents_by_filter 231 232 ```python 233 count_documents_by_filter(filters: dict[str, Any]) -> int 234 ``` 235 236 Returns the number of documents that match the provided filters. 237 238 **Parameters:** 239 240 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply. 241 For a detailed specification of the filters, refer to the 242 [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering). 243 244 **Returns:** 245 246 - <code>int</code> – The number of documents that match the filters. 247 248 #### count_unique_metadata_by_filter 249 250 ```python 251 count_unique_metadata_by_filter( 252 filters: dict[str, Any], metadata_fields: list[str] 253 ) -> dict[str, int] 254 ``` 255 256 Returns the number of unique values for each specified metadata field from documents matching the filters. 257 258 **Parameters:** 259 260 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply. 261 For a detailed specification of the filters, refer to the 262 [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering). 263 - **metadata_fields** (<code>list\[str\]</code>) – List of field names to count unique values for. 264 Field names can include or omit the "meta." prefix. 265 266 **Returns:** 267 268 - <code>dict\[str, int\]</code> – A dictionary mapping each metadata field name (without "meta." prefix) 269 to the count of its unique values among the filtered documents. 270 271 #### get_metadata_fields_info 272 273 ```python 274 get_metadata_fields_info() -> dict[str, dict[str, str]] 275 ``` 276 277 Returns information about the metadata fields present in the stored documents. 278 279 Types are inferred from the stored values (keyword, int, float, boolean). 280 281 **Returns:** 282 283 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping each metadata field name to a dict with a "type" key. 284 285 #### get_metadata_field_min_max 286 287 ```python 288 get_metadata_field_min_max(metadata_field: str) -> dict[str, Any] 289 ``` 290 291 Returns the minimum and maximum values for the given metadata field across all documents. 292 293 **Parameters:** 294 295 - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix. 296 297 **Returns:** 298 299 - <code>dict\[str, Any\]</code> – A dictionary with "min" and "max" keys. Returns `{"min": None, "max": None}` 300 if the field is missing or has no values. 301 302 #### get_metadata_field_unique_values 303 304 ```python 305 get_metadata_field_unique_values( 306 metadata_field: str, search_term: str | None = None 307 ) -> tuple[list[str], int] 308 ``` 309 310 Returns unique values for a metadata field, optionally filtered by a search term in content. 311 312 **Parameters:** 313 314 - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix. 315 - **search_term** (<code>str | None</code>) – If set, only documents whose content contains this term (case-insensitive) 316 are considered. 317 318 **Returns:** 319 320 - <code>tuple\[list\[str\], int\]</code> – A tuple of (list of unique values, total count of unique values). 321 322 #### bm25_retrieval 323 324 ```python 325 bm25_retrieval( 326 query: str, 327 filters: dict[str, Any] | None = None, 328 top_k: int = 10, 329 scale_score: bool = False, 330 ) -> list[Document] 331 ``` 332 333 Retrieves documents that are most relevant to the query using BM25 algorithm. 334 335 **Parameters:** 336 337 - **query** (<code>str</code>) – The query string. 338 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 339 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 340 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False. 341 342 **Returns:** 343 344 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query. 345 346 #### embedding_retrieval 347 348 ```python 349 embedding_retrieval( 350 query_embedding: list[float], 351 filters: dict[str, Any] | None = None, 352 top_k: int = 10, 353 scale_score: bool = False, 354 return_embedding: bool | None = False, 355 ) -> list[Document] 356 ``` 357 358 Retrieves documents that are most similar to the query embedding using a vector similarity metric. 359 360 **Parameters:** 361 362 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 363 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 364 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 365 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False. 366 - **return_embedding** (<code>bool | None</code>) – Whether to return the embedding of the retrieved Documents. 367 If not provided, the value of the `return_embedding` parameter set at component 368 initialization will be used. Default is False. 369 370 **Returns:** 371 372 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query. 373 374 **Raises:** 375 376 - <code>ValueError</code> – if filters have invalid syntax. 377 378 #### count_documents_async 379 380 ```python 381 count_documents_async() -> int 382 ``` 383 384 Returns the number of documents present in the DocumentStore. 385 386 #### filter_documents_async 387 388 ```python 389 filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document] 390 ``` 391 392 Returns the documents that match the filters provided. 393 394 **Parameters:** 395 396 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply. For a detailed specification of the filters, refer to the 397 [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering). 398 399 **Returns:** 400 401 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 402 403 #### write_documents_async 404 405 ```python 406 write_documents_async( 407 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 408 ) -> int 409 ``` 410 411 Refer to the DocumentStore.write_documents() protocol documentation. 412 413 If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`. 414 415 #### delete_documents_async 416 417 ```python 418 delete_documents_async(document_ids: list[str]) -> None 419 ``` 420 421 Deletes all documents with matching document_ids from the DocumentStore. 422 423 **Parameters:** 424 425 - **document_ids** (<code>list\[str\]</code>) – The document_ids to delete. 426 427 #### update_by_filter_async 428 429 ```python 430 update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int 431 ``` 432 433 Updates the metadata of all documents that match the provided filters. 434 435 **Parameters:** 436 437 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 438 For filter syntax, see filter_documents. 439 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata. 440 441 **Returns:** 442 443 - <code>int</code> – The number of documents updated. 444 445 #### count_documents_by_filter_async 446 447 ```python 448 count_documents_by_filter_async(filters: dict[str, Any]) -> int 449 ``` 450 451 Returns the number of documents that match the provided filters. 452 453 **Parameters:** 454 455 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply. 456 For a detailed specification of the filters, refer to the 457 [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering). 458 459 **Returns:** 460 461 - <code>int</code> – The number of documents that match the filters. 462 463 #### count_unique_metadata_by_filter_async 464 465 ```python 466 count_unique_metadata_by_filter_async( 467 filters: dict[str, Any], metadata_fields: list[str] 468 ) -> dict[str, int] 469 ``` 470 471 Returns the number of unique values for each specified metadata field from documents matching the filters. 472 473 **Parameters:** 474 475 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply. 476 For a detailed specification of the filters, refer to the 477 [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering). 478 - **metadata_fields** (<code>list\[str\]</code>) – List of field names to count unique values for. 479 Field names can include or omit the "meta." prefix. 480 481 **Returns:** 482 483 - <code>dict\[str, int\]</code> – A dictionary mapping each metadata field name (without "meta." prefix) 484 to the count of its unique values among the filtered documents. 485 486 #### get_metadata_fields_info_async 487 488 ```python 489 get_metadata_fields_info_async() -> dict[str, dict[str, str]] 490 ``` 491 492 Returns information about the metadata fields present in the stored documents. 493 494 Types are inferred from the stored values (keyword, int, float, boolean). 495 496 **Returns:** 497 498 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping each metadata field name to a dict with a "type" key. 499 500 #### get_metadata_field_min_max_async 501 502 ```python 503 get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any] 504 ``` 505 506 Returns the minimum and maximum values for the given metadata field across all documents. 507 508 **Parameters:** 509 510 - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix. 511 512 **Returns:** 513 514 - <code>dict\[str, Any\]</code> – A dictionary with "min" and "max" keys. Returns `{"min": None, "max": None}` 515 if the field is missing or has no values. 516 517 #### get_metadata_field_unique_values_async 518 519 ```python 520 get_metadata_field_unique_values_async( 521 metadata_field: str, search_term: str | None = None 522 ) -> tuple[list[str], int] 523 ``` 524 525 Returns unique values for a metadata field, optionally filtered by a search term in content. 526 527 **Parameters:** 528 529 - **metadata_field** (<code>str</code>) – The metadata field name. Can include or omit the "meta." prefix. 530 - **search_term** (<code>str | None</code>) – If set, only documents whose content contains this term (case-insensitive) 531 are considered. 532 533 **Returns:** 534 535 - <code>tuple\[list\[str\], int\]</code> – A tuple of (list of unique values, total count of unique values). 536 537 #### delete_all_documents_async 538 539 ```python 540 delete_all_documents_async() -> None 541 ``` 542 543 Deletes all documents in the document store. 544 545 #### bm25_retrieval_async 546 547 ```python 548 bm25_retrieval_async( 549 query: str, 550 filters: dict[str, Any] | None = None, 551 top_k: int = 10, 552 scale_score: bool = False, 553 ) -> list[Document] 554 ``` 555 556 Retrieves documents that are most relevant to the query using BM25 algorithm. 557 558 **Parameters:** 559 560 - **query** (<code>str</code>) – The query string. 561 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 562 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 563 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved documents. Default is False. 564 565 **Returns:** 566 567 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query. 568 569 #### embedding_retrieval_async 570 571 ```python 572 embedding_retrieval_async( 573 query_embedding: list[float], 574 filters: dict[str, Any] | None = None, 575 top_k: int = 10, 576 scale_score: bool = False, 577 return_embedding: bool = False, 578 ) -> list[Document] 579 ``` 580 581 Retrieves documents that are most similar to the query embedding using a vector similarity metric. 582 583 **Parameters:** 584 585 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 586 - **filters** (<code>dict\[str, Any\] | None</code>) – A dictionary with filters to narrow down the search space. 587 - **top_k** (<code>int</code>) – The number of top documents to retrieve. Default is 10. 588 - **scale_score** (<code>bool</code>) – Whether to scale the scores of the retrieved Documents. Default is False. 589 - **return_embedding** (<code>bool</code>) – Whether to return the embedding of the retrieved Documents. Default is False. 590 591 **Returns:** 592 593 - <code>list\[Document\]</code> – A list of the top_k documents most relevant to the query.