weaviate.md
1 --- 2 title: "Weaviate" 3 id: integrations-weaviate 4 description: "Weaviate integration for Haystack" 5 slug: "/integrations-weaviate" 6 --- 7 8 9 ## haystack_integrations.components.retrievers.weaviate.bm25_retriever 10 11 ### WeaviateBM25Retriever 12 13 A component for retrieving documents from Weaviate using the BM25 algorithm. 14 15 Example usage: 16 17 ```python 18 from haystack_integrations.document_stores.weaviate.document_store import ( 19 WeaviateDocumentStore, 20 ) 21 from haystack_integrations.components.retrievers.weaviate.bm25_retriever import ( 22 WeaviateBM25Retriever, 23 ) 24 25 document_store = WeaviateDocumentStore(url="http://localhost:8080") 26 retriever = WeaviateBM25Retriever(document_store=document_store) 27 retriever.run(query="How to make a pizza", top_k=3) 28 ``` 29 30 #### __init__ 31 32 ```python 33 __init__( 34 *, 35 document_store: WeaviateDocumentStore, 36 filters: dict[str, Any] | None = None, 37 top_k: int = 10, 38 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 39 ) 40 ``` 41 42 Create a new instance of WeaviateBM25Retriever. 43 44 **Parameters:** 45 46 - **document_store** (<code>WeaviateDocumentStore</code>) – Instance of WeaviateDocumentStore that will be used from this retriever. 47 - **filters** (<code>dict\[str, Any\] | None</code>) – Custom filters applied when running the retriever 48 - **top_k** (<code>int</code>) – Maximum number of documents to return 49 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 50 51 #### to_dict 52 53 ```python 54 to_dict() -> dict[str, Any] 55 ``` 56 57 Serializes the component to a dictionary. 58 59 **Returns:** 60 61 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 62 63 #### from_dict 64 65 ```python 66 from_dict(data: dict[str, Any]) -> WeaviateBM25Retriever 67 ``` 68 69 Deserializes the component from a dictionary. 70 71 **Parameters:** 72 73 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 74 75 **Returns:** 76 77 - <code>WeaviateBM25Retriever</code> – Deserialized component. 78 79 #### run 80 81 ```python 82 run( 83 query: str, filters: dict[str, Any] | None = None, top_k: int | None = None 84 ) -> dict[str, list[Document]] 85 ``` 86 87 Retrieves documents from Weaviate using the BM25 algorithm. 88 89 **Parameters:** 90 91 - **query** (<code>str</code>) – The query text. 92 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 93 the `filter_policy` chosen at retriever initialization. See init method docstring for more 94 details. 95 - **top_k** (<code>int | None</code>) – The maximum number of documents to return. 96 97 **Returns:** 98 99 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 100 - `documents`: List of documents returned by the search engine. 101 102 #### run_async 103 104 ```python 105 run_async( 106 query: str, filters: dict[str, Any] | None = None, top_k: int | None = None 107 ) -> dict[str, list[Document]] 108 ``` 109 110 Asynchronously retrieves documents from Weaviate using the BM25 algorithm. 111 112 **Parameters:** 113 114 - **query** (<code>str</code>) – The query text. 115 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 116 the `filter_policy` chosen at retriever initialization. See init method docstring for more 117 details. 118 - **top_k** (<code>int | None</code>) – The maximum number of documents to return. 119 120 **Returns:** 121 122 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 123 - `documents`: List of documents returned by the search engine. 124 125 ## haystack_integrations.components.retrievers.weaviate.embedding_retriever 126 127 ### WeaviateEmbeddingRetriever 128 129 A retriever that uses Weaviate's vector search to find similar documents based on the embeddings of the query. 130 131 #### __init__ 132 133 ```python 134 __init__( 135 *, 136 document_store: WeaviateDocumentStore, 137 filters: dict[str, Any] | None = None, 138 top_k: int = 10, 139 distance: float | None = None, 140 certainty: float | None = None, 141 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 142 ) 143 ``` 144 145 Creates a new instance of WeaviateEmbeddingRetriever. 146 147 **Parameters:** 148 149 - **document_store** (<code>WeaviateDocumentStore</code>) – Instance of WeaviateDocumentStore that will be used from this retriever. 150 - **filters** (<code>dict\[str, Any\] | None</code>) – Custom filters applied when running the retriever. 151 - **top_k** (<code>int</code>) – Maximum number of documents to return. 152 - **distance** (<code>float | None</code>) – The maximum allowed distance between Documents' embeddings. 153 - **certainty** (<code>float | None</code>) – Normalized distance between the result item and the search vector. 154 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 155 156 **Raises:** 157 158 - <code>ValueError</code> – If both `distance` and `certainty` are provided. 159 See https://weaviate.io/developers/weaviate/api/graphql/search-operators#variables to learn more about 160 `distance` and `certainty` parameters. 161 162 #### to_dict 163 164 ```python 165 to_dict() -> dict[str, Any] 166 ``` 167 168 Serializes the component to a dictionary. 169 170 **Returns:** 171 172 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 173 174 #### from_dict 175 176 ```python 177 from_dict(data: dict[str, Any]) -> WeaviateEmbeddingRetriever 178 ``` 179 180 Deserializes the component from a dictionary. 181 182 **Parameters:** 183 184 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 185 186 **Returns:** 187 188 - <code>WeaviateEmbeddingRetriever</code> – Deserialized component. 189 190 #### run 191 192 ```python 193 run( 194 query_embedding: list[float], 195 filters: dict[str, Any] | None = None, 196 top_k: int | None = None, 197 distance: float | None = None, 198 certainty: float | None = None, 199 ) -> dict[str, list[Document]] 200 ``` 201 202 Retrieves documents from Weaviate using the vector search. 203 204 **Parameters:** 205 206 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 207 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 208 the `filter_policy` chosen at retriever initialization. See init method docstring for more 209 details. 210 - **top_k** (<code>int | None</code>) – The maximum number of documents to return. 211 - **distance** (<code>float | None</code>) – The maximum allowed distance between Documents' embeddings. 212 - **certainty** (<code>float | None</code>) – Normalized distance between the result item and the search vector. 213 214 **Returns:** 215 216 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 217 - `documents`: List of documents returned by the search engine. 218 219 **Raises:** 220 221 - <code>ValueError</code> – If both `distance` and `certainty` are provided. 222 See https://weaviate.io/developers/weaviate/api/graphql/search-operators#variables to learn more about 223 `distance` and `certainty` parameters. 224 225 #### run_async 226 227 ```python 228 run_async( 229 query_embedding: list[float], 230 filters: dict[str, Any] | None = None, 231 top_k: int | None = None, 232 distance: float | None = None, 233 certainty: float | None = None, 234 ) -> dict[str, list[Document]] 235 ``` 236 237 Asynchronously retrieves documents from Weaviate using the vector search. 238 239 **Parameters:** 240 241 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 242 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 243 the `filter_policy` chosen at retriever initialization. See init method docstring for more 244 details. 245 - **top_k** (<code>int | None</code>) – The maximum number of documents to return. 246 - **distance** (<code>float | None</code>) – The maximum allowed distance between Documents' embeddings. 247 - **certainty** (<code>float | None</code>) – Normalized distance between the result item and the search vector. 248 249 **Returns:** 250 251 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 252 - `documents`: List of documents returned by the search engine. 253 254 **Raises:** 255 256 - <code>ValueError</code> – If both `distance` and `certainty` are provided. 257 See https://weaviate.io/developers/weaviate/api/graphql/search-operators#variables to learn more about 258 `distance` and `certainty` parameters. 259 260 ## haystack_integrations.components.retrievers.weaviate.hybrid_retriever 261 262 ### WeaviateHybridRetriever 263 264 A retriever that uses Weaviate's hybrid search to find similar documents based on the embeddings of the query. 265 266 #### __init__ 267 268 ```python 269 __init__( 270 *, 271 document_store: WeaviateDocumentStore, 272 filters: dict[str, Any] | None = None, 273 top_k: int = 10, 274 alpha: float = 0.7, 275 max_vector_distance: float | None = None, 276 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 277 ) 278 ``` 279 280 Creates a new instance of WeaviateHybridRetriever. 281 282 **Parameters:** 283 284 - **document_store** (<code>WeaviateDocumentStore</code>) – Instance of WeaviateDocumentStore that will be used from this retriever. 285 - **filters** (<code>dict\[str, Any\] | None</code>) – Custom filters applied when running the retriever. 286 - **top_k** (<code>int</code>) – Maximum number of documents to return. 287 - **alpha** (<code>float</code>) – Blending factor for hybrid retrieval in Weaviate. Must be in the range `[0.0, 1.0]`. 288 289 Weaviate hybrid search combines keyword (BM25) and vector scores into a single ranking. `alpha` controls 290 how much each part contributes to the final score: 291 292 - `alpha = 0.0`: only keyword (BM25) scoring is used. 293 - `alpha = 1.0`: only vector similarity scoring is used. 294 - Values in between blend the two; higher values favor the vector score, lower values favor BM25. 295 296 By default, 0.7 is used which is the Weaviate server default. 297 298 See the official Weaviate docs on Hybrid Search parameters for more details: 299 300 - [Hybrid search parameters](https://weaviate.io/developers/weaviate/search/hybrid#parameters) 301 - [Hybrid Search](https://docs.weaviate.io/weaviate/concepts/search/hybrid-search) 302 - **max_vector_distance** (<code>float | None</code>) – Optional threshold that restricts the vector part of the hybrid search to candidates within a maximum 303 vector distance. Candidates with a distance larger than this threshold are excluded from the vector portion 304 before blending. 305 306 Use this to prune low-quality vector matches while still benefitting from keyword recall. Leave `None` to 307 use Weaviate's default behavior without an explicit cutoff. 308 309 See the official Weaviate docs on Hybrid Search parameters for more details: 310 311 - [Hybrid search parameters](https://weaviate.io/developers/weaviate/search/hybrid#parameters) 312 - [Hybrid Search](https://docs.weaviate.io/weaviate/concepts/search/hybrid-search) 313 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 314 315 #### to_dict 316 317 ```python 318 to_dict() -> dict[str, Any] 319 ``` 320 321 Serializes the component to a dictionary. 322 323 **Returns:** 324 325 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 326 327 #### from_dict 328 329 ```python 330 from_dict(data: dict[str, Any]) -> WeaviateHybridRetriever 331 ``` 332 333 Deserializes the component from a dictionary. 334 335 **Parameters:** 336 337 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 338 339 **Returns:** 340 341 - <code>WeaviateHybridRetriever</code> – Deserialized component. 342 343 #### run 344 345 ```python 346 run( 347 query: str, 348 query_embedding: list[float], 349 filters: dict[str, Any] | None = None, 350 top_k: int | None = None, 351 alpha: float | None = None, 352 max_vector_distance: float | None = None, 353 ) -> dict[str, list[Document]] 354 ``` 355 356 Retrieves documents from Weaviate using hybrid search. 357 358 **Parameters:** 359 360 - **query** (<code>str</code>) – The query text. 361 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 362 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 363 the `filter_policy` chosen at retriever initialization. See init method docstring for more 364 details. 365 - **top_k** (<code>int | None</code>) – The maximum number of documents to return. 366 - **alpha** (<code>float | None</code>) – Blending factor for hybrid retrieval in Weaviate. Must be in the range `[0.0, 1.0]`. 367 368 Weaviate hybrid search combines keyword (BM25) and vector scores into a single ranking. `alpha` controls 369 how much each part contributes to the final score: 370 371 - `alpha = 0.0`: only keyword (BM25) scoring is used. 372 - `alpha = 1.0`: only vector similarity scoring is used. 373 - Values in between blend the two; higher values favor the vector score, lower values favor BM25. 374 375 If `None`, the Weaviate server default is used. 376 377 See the official Weaviate docs on Hybrid Search parameters for more details: 378 379 - [Hybrid search parameters](https://weaviate.io/developers/weaviate/search/hybrid#parameters) 380 - [Hybrid Search](https://docs.weaviate.io/weaviate/concepts/search/hybrid-search) 381 - **max_vector_distance** (<code>float | None</code>) – Optional threshold that restricts the vector part of the hybrid search to candidates within a maximum 382 vector distance. Candidates with a distance larger than this threshold are excluded from the vector portion 383 before blending. 384 385 Use this to prune low-quality vector matches while still benefitting from keyword recall. Leave `None` to 386 use Weaviate's default behavior without an explicit cutoff. 387 388 See the official Weaviate docs on Hybrid Search parameters for more details: 389 390 - [Hybrid search parameters](https://weaviate.io/developers/weaviate/search/hybrid#parameters) 391 - [Hybrid Search](https://docs.weaviate.io/weaviate/concepts/search/hybrid-search) 392 393 **Returns:** 394 395 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 396 - `documents`: List of documents returned by the search engine. 397 398 #### run_async 399 400 ```python 401 run_async( 402 query: str, 403 query_embedding: list[float], 404 filters: dict[str, Any] | None = None, 405 top_k: int | None = None, 406 alpha: float | None = None, 407 max_vector_distance: float | None = None, 408 ) -> dict[str, list[Document]] 409 ``` 410 411 Asynchronously retrieves documents from Weaviate using hybrid search. 412 413 **Parameters:** 414 415 - **query** (<code>str</code>) – The query text. 416 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 417 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 418 the `filter_policy` chosen at retriever initialization. See init method docstring for more 419 details. 420 - **top_k** (<code>int | None</code>) – The maximum number of documents to return. 421 - **alpha** (<code>float | None</code>) – Blending factor for hybrid retrieval in Weaviate. Must be in the range `[0.0, 1.0]`. 422 423 Weaviate hybrid search combines keyword (BM25) and vector scores into a single ranking. `alpha` controls 424 how much each part contributes to the final score: 425 426 - `alpha = 0.0`: only keyword (BM25) scoring is used. 427 - `alpha = 1.0`: only vector similarity scoring is used. 428 - Values in between blend the two; higher values favor the vector score, lower values favor BM25. 429 430 If `None`, the Weaviate server default is used. 431 432 See the official Weaviate docs on Hybrid Search parameters for more details: 433 434 - [Hybrid search parameters](https://weaviate.io/developers/weaviate/search/hybrid#parameters) 435 - [Hybrid Search](https://docs.weaviate.io/weaviate/concepts/search/hybrid-search) 436 - **max_vector_distance** (<code>float | None</code>) – Optional threshold that restricts the vector part of the hybrid search to candidates within a maximum 437 vector distance. Candidates with a distance larger than this threshold are excluded from the vector portion 438 before blending. 439 440 Use this to prune low-quality vector matches while still benefitting from keyword recall. Leave `None` to 441 use Weaviate's default behavior without an explicit cutoff. 442 443 See the official Weaviate docs on Hybrid Search parameters for more details: 444 445 - [Hybrid search parameters](https://weaviate.io/developers/weaviate/search/hybrid#parameters) 446 - [Hybrid Search](https://docs.weaviate.io/weaviate/concepts/search/hybrid-search) 447 448 **Returns:** 449 450 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 451 - `documents`: List of documents returned by the search engine. 452 453 ## haystack_integrations.document_stores.weaviate.auth 454 455 ### SupportedAuthTypes 456 457 Bases: <code>Enum</code> 458 459 Supported auth credentials for WeaviateDocumentStore. 460 461 ### AuthCredentials 462 463 Bases: <code>ABC</code> 464 465 Base class for all auth credentials supported by WeaviateDocumentStore. 466 Can be used to deserialize from dict any of the supported auth credentials. 467 468 #### to_dict 469 470 ```python 471 to_dict() -> dict[str, Any] 472 ``` 473 474 Converts the object to a dictionary representation for serialization. 475 476 #### from_dict 477 478 ```python 479 from_dict(data: dict[str, Any]) -> AuthCredentials 480 ``` 481 482 Converts a dictionary representation to an auth credentials object. 483 484 #### resolve_value 485 486 ```python 487 resolve_value() 488 ``` 489 490 Resolves all the secrets in the auth credentials object and returns the corresponding Weaviate object. 491 All subclasses must implement this method. 492 493 ### AuthApiKey 494 495 Bases: <code>AuthCredentials</code> 496 497 AuthCredentials for API key authentication. 498 By default it will load `api_key` from the environment variable `WEAVIATE_API_KEY`. 499 500 ### AuthBearerToken 501 502 Bases: <code>AuthCredentials</code> 503 504 AuthCredentials for Bearer token authentication. 505 By default it will load `access_token` from the environment variable `WEAVIATE_ACCESS_TOKEN`, 506 and `refresh_token` from the environment variable 507 `WEAVIATE_REFRESH_TOKEN`. 508 `WEAVIATE_REFRESH_TOKEN` environment variable is optional. 509 510 ### AuthClientCredentials 511 512 Bases: <code>AuthCredentials</code> 513 514 AuthCredentials for client credentials authentication. 515 By default it will load `client_secret` from the environment variable `WEAVIATE_CLIENT_SECRET`, and 516 `scope` from the environment variable `WEAVIATE_SCOPE`. 517 `WEAVIATE_SCOPE` environment variable is optional, if set it can either be a string or a list of space 518 separated strings. e.g "scope1" or "scope1 scope2". 519 520 ### AuthClientPassword 521 522 Bases: <code>AuthCredentials</code> 523 524 AuthCredentials for username and password authentication. 525 By default it will load `username` from the environment variable `WEAVIATE_USERNAME`, 526 `password` from the environment variable `WEAVIATE_PASSWORD`, and 527 `scope` from the environment variable `WEAVIATE_SCOPE`. 528 `WEAVIATE_SCOPE` environment variable is optional, if set it can either be a string or a list of space 529 separated strings. e.g "scope1" or "scope1 scope2". 530 531 ## haystack_integrations.document_stores.weaviate.document_store 532 533 ### WeaviateDocumentStore 534 535 A WeaviateDocumentStore instance you 536 can use with Weaviate Cloud Services or self-hosted instances. 537 538 Usage example with Weaviate Cloud Services: 539 540 ```python 541 import os 542 from haystack_integrations.document_stores.weaviate.auth import AuthApiKey 543 from haystack_integrations.document_stores.weaviate.document_store import ( 544 WeaviateDocumentStore, 545 ) 546 547 os.environ["WEAVIATE_API_KEY"] = "MY_API_KEY" 548 549 document_store = WeaviateDocumentStore( 550 url="rAnD0mD1g1t5.something.weaviate.cloud", 551 auth_client_secret=AuthApiKey(), 552 ) 553 ``` 554 555 Usage example with self-hosted Weaviate: 556 557 ```python 558 from haystack_integrations.document_stores.weaviate.document_store import ( 559 WeaviateDocumentStore, 560 ) 561 562 document_store = WeaviateDocumentStore(url="http://localhost:8080") 563 ``` 564 565 #### __init__ 566 567 ```python 568 __init__( 569 *, 570 url: str | None = None, 571 collection_settings: dict[str, Any] | None = None, 572 auth_client_secret: AuthCredentials | None = None, 573 additional_headers: dict | None = None, 574 embedded_options: EmbeddedOptions | None = None, 575 additional_config: AdditionalConfig | None = None, 576 grpc_port: int = 50051, 577 grpc_secure: bool = False 578 ) -> None 579 ``` 580 581 Create a new instance of WeaviateDocumentStore and connects to the Weaviate instance. 582 583 **Parameters:** 584 585 - **url** (<code>str | None</code>) – The URL to the weaviate instance. 586 - **collection_settings** (<code>dict\[str, Any\] | None</code>) – The collection settings to use. If `None`, it will use a collection named `default` with the following 587 properties: 588 - \_original_id: text 589 - content: text 590 - blob_data: blob 591 - blob_mime_type: text 592 - score: number 593 The Document `meta` fields are omitted in the default collection settings as we can't make assumptions 594 on the structure of the meta field. 595 We heavily recommend to create a custom collection with the correct meta properties 596 for your use case. 597 Another option is relying on the automatic schema generation, but that's not recommended for 598 production use. 599 See the official [Weaviate documentation](https://weaviate.io/developers/weaviate/manage-data/collections) 600 for more information on collections and their properties. 601 - **auth_client_secret** (<code>AuthCredentials | None</code>) – Authentication credentials. Can be one of the following types depending on the authentication mode: 602 - `AuthBearerToken` to use existing access and (optionally, but recommended) refresh tokens 603 - `AuthClientPassword` to use username and password for oidc Resource Owner Password flow 604 - `AuthClientCredentials` to use a client secret for oidc client credential flow 605 - `AuthApiKey` to use an API key 606 - **additional_headers** (<code>dict | None</code>) – Additional headers to include in the requests. Can be used to set OpenAI/HuggingFace keys. 607 OpenAI/HuggingFace key looks like this: 608 609 ``` 610 {"X-OpenAI-Api-Key": "<THE-KEY>"}, {"X-HuggingFace-Api-Key": "<THE-KEY>"} 611 ``` 612 613 - **embedded_options** (<code>EmbeddedOptions | None</code>) – If set, create an embedded Weaviate cluster inside the client. For a full list of options see 614 `weaviate.embedded.EmbeddedOptions`. 615 - **additional_config** (<code>AdditionalConfig | None</code>) – Additional and advanced configuration options for weaviate. 616 - **grpc_port** (<code>int</code>) – The port to use for the gRPC connection. 617 - **grpc_secure** (<code>bool</code>) – Whether to use a secure channel for the underlying gRPC API. 618 619 #### close 620 621 ```python 622 close() -> None 623 ``` 624 625 Close the synchronous Weaviate client connection. 626 627 #### close_async 628 629 ```python 630 close_async() -> None 631 ``` 632 633 Close the asynchronous Weaviate client connection. 634 635 #### to_dict 636 637 ```python 638 to_dict() -> dict[str, Any] 639 ``` 640 641 Serializes the component to a dictionary. 642 643 **Returns:** 644 645 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 646 647 #### from_dict 648 649 ```python 650 from_dict(data: dict[str, Any]) -> WeaviateDocumentStore 651 ``` 652 653 Deserializes the component from a dictionary. 654 655 **Parameters:** 656 657 - **data** (<code>dict\[str, Any\]</code>) – The dictionary to deserialize from. 658 659 **Returns:** 660 661 - <code>WeaviateDocumentStore</code> – The deserialized component. 662 663 #### count_documents 664 665 ```python 666 count_documents() -> int 667 ``` 668 669 Returns the number of documents present in the DocumentStore. 670 671 #### count_documents_async 672 673 ```python 674 count_documents_async() -> int 675 ``` 676 677 Asynchronously returns the number of documents present in the DocumentStore. 678 679 #### count_documents_by_filter 680 681 ```python 682 count_documents_by_filter(filters: dict[str, Any]) -> int 683 ``` 684 685 Returns the number of documents that match the provided filters. 686 687 **Parameters:** 688 689 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents. 690 For filter syntax, see 691 [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering). 692 693 **Returns:** 694 695 - <code>int</code> – The number of documents that match the filters. 696 697 #### count_documents_by_filter_async 698 699 ```python 700 count_documents_by_filter_async(filters: dict[str, Any]) -> int 701 ``` 702 703 Asynchronously returns the number of documents that match the provided filters. 704 705 **Parameters:** 706 707 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents. 708 For filter syntax, see 709 [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering). 710 711 **Returns:** 712 713 - <code>int</code> – The number of documents that match the filters. 714 715 #### get_metadata_fields_info 716 717 ```python 718 get_metadata_fields_info() -> dict[str, dict[str, str]] 719 ``` 720 721 Returns metadata field names and their types, excluding special fields. 722 723 Special fields (content, blob_data, blob_mime_type, \_original_id, score) are excluded 724 as they are not user metadata fields. 725 726 **Returns:** 727 728 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary where keys are field names and values are dictionaries 729 containing type information, e.g.: 730 731 ```python 732 { 733 'number': {'type': 'int'}, 734 'date': {'type': 'date'}, 735 'category': {'type': 'text'}, 736 'status': {'type': 'text'} 737 } 738 ``` 739 740 #### get_metadata_fields_info_async 741 742 ```python 743 get_metadata_fields_info_async() -> dict[str, dict[str, str]] 744 ``` 745 746 Asynchronously returns metadata field names and their types, excluding special fields. 747 748 Special fields (content, blob_data, blob_mime_type, \_original_id, score) are excluded 749 as they are not user metadata fields. 750 751 **Returns:** 752 753 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary where keys are field names and values are dictionaries 754 containing type information, e.g.: 755 756 ```python 757 { 758 'number': {'type': 'int'}, 759 'date': {'type': 'date'}, 760 'category': {'type': 'text'}, 761 'status': {'type': 'text'} 762 } 763 ``` 764 765 #### get_metadata_field_min_max 766 767 ```python 768 get_metadata_field_min_max(metadata_field: str) -> dict[str, Any] 769 ``` 770 771 Returns the minimum and maximum values for a numeric or date metadata field. 772 773 **Parameters:** 774 775 - **metadata_field** (<code>str</code>) – The metadata field name to get min/max for. 776 Can be prefixed with 'meta.' (e.g., 'meta.year' or 'year'). 777 778 **Returns:** 779 780 - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the respective values. 781 782 **Raises:** 783 784 - <code>ValueError</code> – If the field is not found or doesn't support min/max operations. 785 786 #### get_metadata_field_min_max_async 787 788 ```python 789 get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any] 790 ``` 791 792 Asynchronously returns the minimum and maximum values for a numeric or date metadata field. 793 794 **Parameters:** 795 796 - **metadata_field** (<code>str</code>) – The metadata field name to get min/max for. 797 Can be prefixed with 'meta.' (e.g., 'meta.year' or 'year'). 798 799 **Returns:** 800 801 - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the respective values. 802 803 **Raises:** 804 805 - <code>ValueError</code> – If the field is not found or doesn't support min/max operations. 806 807 #### count_unique_metadata_by_filter 808 809 ```python 810 count_unique_metadata_by_filter( 811 filters: dict[str, Any], metadata_fields: list[str] 812 ) -> dict[str, int] 813 ``` 814 815 Returns the count of unique values for each specified metadata field. 816 817 **Parameters:** 818 819 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply when counting unique values. 820 For filter syntax, see 821 [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering). 822 - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for. 823 Field names can be prefixed with 'meta.' (e.g., 'meta.category' or 'category'). 824 825 **Returns:** 826 827 - <code>dict\[str, int\]</code> – A dictionary mapping field names to counts of unique values. 828 829 **Raises:** 830 831 - <code>ValueError</code> – If any of the requested fields don't exist in the collection schema. 832 833 #### count_unique_metadata_by_filter_async 834 835 ```python 836 count_unique_metadata_by_filter_async( 837 filters: dict[str, Any], metadata_fields: list[str] 838 ) -> dict[str, int] 839 ``` 840 841 Asynchronously returns the count of unique values for each specified metadata field. 842 843 **Parameters:** 844 845 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply when counting unique values. 846 For filter syntax, see 847 [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering). 848 - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for. 849 Field names can be prefixed with 'meta.' (e.g., 'meta.category' or 'category'). 850 851 **Returns:** 852 853 - <code>dict\[str, int\]</code> – A dictionary mapping field names to counts of unique values. 854 855 **Raises:** 856 857 - <code>ValueError</code> – If any of the requested fields don't exist in the collection schema. 858 859 #### get_metadata_field_unique_values 860 861 ```python 862 get_metadata_field_unique_values( 863 metadata_field: str, 864 search_term: str | None = None, 865 from_: int = 0, 866 size: int = 10000, 867 ) -> tuple[list[str], int] 868 ``` 869 870 Returns unique values for a metadata field with pagination support. 871 872 **Parameters:** 873 874 - **metadata_field** (<code>str</code>) – The metadata field name to get unique values for. 875 Can be prefixed with 'meta.' (e.g., 'meta.category' or 'category'). 876 - **search_term** (<code>str | None</code>) – Optional term to filter documents by content before 877 extracting unique values. If provided, only documents whose content 878 contains this term will be considered. 879 Note: Uses substring matching (case-sensitive, no stemming). 880 - **from\_** (<code>int</code>) – The starting offset for pagination (0-indexed). Defaults to 0. 881 - **size** (<code>int</code>) – The maximum number of unique values to return. Defaults to 10000. 882 883 **Returns:** 884 885 - <code>tuple\[list\[str\], int\]</code> – A tuple of (list of unique values, total count of unique values). 886 887 **Raises:** 888 889 - <code>ValueError</code> – If the field is not found in the collection schema. 890 891 #### get_metadata_field_unique_values_async 892 893 ```python 894 get_metadata_field_unique_values_async( 895 metadata_field: str, 896 search_term: str | None = None, 897 from_: int = 0, 898 size: int = 10000, 899 ) -> tuple[list[str], int] 900 ``` 901 902 Asynchronously returns unique values for a metadata field with pagination support. 903 904 **Parameters:** 905 906 - **metadata_field** (<code>str</code>) – The metadata field name to get unique values for. 907 Can be prefixed with 'meta.' (e.g., 'meta.category' or 'category'). 908 - **search_term** (<code>str | None</code>) – Optional term to filter documents by content before 909 extracting unique values. If provided, only documents whose content 910 contains this term will be considered. 911 Note: Uses substring matching (case-sensitive, no stemming). 912 - **from\_** (<code>int</code>) – The starting offset for pagination (0-indexed). Defaults to 0. 913 - **size** (<code>int</code>) – The maximum number of unique values to return. Defaults to 10000. 914 915 **Returns:** 916 917 - <code>tuple\[list\[str\], int\]</code> – A tuple of (list of unique values, total count of unique values). 918 919 **Raises:** 920 921 - <code>ValueError</code> – If the field is not found in the collection schema. 922 923 #### filter_documents 924 925 ```python 926 filter_documents(filters: dict[str, Any] | None = None) -> list[Document] 927 ``` 928 929 Returns the documents that match the filters provided. 930 931 For a detailed specification of the filters, refer to the 932 DocumentStore.filter_documents() protocol documentation. 933 934 Note: The `contains` filter operator is case-sensitive (substring 935 matching). For case-insensitive matching, normalize the value before 936 building the filter. 937 938 **Parameters:** 939 940 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 941 942 **Returns:** 943 944 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 945 946 #### filter_documents_async 947 948 ```python 949 filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document] 950 ``` 951 952 Asynchronously returns the documents that match the filters provided. 953 954 For a detailed specification of the filters, refer to the 955 DocumentStore.filter_documents() protocol documentation. 956 957 Note: The `contains` filter operator is case-sensitive (substring 958 matching). For case-insensitive matching, normalize the value before 959 building the filter. 960 961 **Parameters:** 962 963 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 964 965 **Returns:** 966 967 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 968 969 #### write_documents 970 971 ```python 972 write_documents( 973 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 974 ) -> int 975 ``` 976 977 Writes documents to Weaviate using the specified policy. 978 We recommend using a OVERWRITE policy as it's faster than other policies for Weaviate since it uses 979 the batch API. 980 We can't use the batch API for other policies as it doesn't return any information whether the document 981 already exists or not. That prevents us from returning errors when using the FAIL policy or skipping a 982 Document when using the SKIP policy. 983 984 **Parameters:** 985 986 - **documents** (<code>list\[Document\]</code>) – A list of documents to write into the document store. 987 - **policy** (<code>DuplicatePolicy</code>) – DuplicatePolicy to apply when a document with the same ID already exists in the document store. 988 989 **Returns:** 990 991 - <code>int</code> – The number of documents written. 992 993 **Raises:** 994 995 - <code>ValueError</code> – When input is not valid. 996 - <code>DuplicateDocumentError</code> – When duplicate documents are found and using a FAIL policy. 997 - <code>DocumentStoreError</code> – When documents have failed to be batch written. 998 999 #### write_documents_async 1000 1001 ```python 1002 write_documents_async( 1003 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 1004 ) -> int 1005 ``` 1006 1007 Asynchronously writes documents to Weaviate using the specified policy. 1008 We recommend using a OVERWRITE policy as it's faster than other policies for Weaviate since it uses 1009 the batch API. 1010 We can't use the batch API for other policies as it doesn't return any information whether the document 1011 already exists or not. That prevents us from returning errors when using the FAIL policy or skipping a 1012 Document when using the SKIP policy. 1013 1014 **Parameters:** 1015 1016 - **documents** (<code>list\[Document\]</code>) – A list of documents to write into the document store. 1017 - **policy** (<code>DuplicatePolicy</code>) – DuplicatePolicy to apply when a document with the same ID already exists in the document store. 1018 1019 **Returns:** 1020 1021 - <code>int</code> – The number of documents written. 1022 1023 **Raises:** 1024 1025 - <code>ValueError</code> – When input is not valid. 1026 - <code>DuplicateDocumentError</code> – When duplicate documents are found and using a FAIL policy. 1027 - <code>DocumentStoreError</code> – When documents have failed to be batch written. 1028 1029 #### delete_documents 1030 1031 ```python 1032 delete_documents(document_ids: list[str]) -> None 1033 ``` 1034 1035 Deletes all documents with matching document_ids from the DocumentStore. 1036 1037 **Parameters:** 1038 1039 - **document_ids** (<code>list\[str\]</code>) – The object_ids to delete. 1040 1041 #### delete_documents_async 1042 1043 ```python 1044 delete_documents_async(document_ids: list[str]) -> None 1045 ``` 1046 1047 Asynchronously deletes all documents with matching document_ids from the DocumentStore. 1048 1049 **Parameters:** 1050 1051 - **document_ids** (<code>list\[str\]</code>) – The object_ids to delete. 1052 1053 #### delete_all_documents 1054 1055 ```python 1056 delete_all_documents( 1057 *, recreate_index: bool = False, batch_size: int = 1000 1058 ) -> None 1059 ``` 1060 1061 Deletes all documents in a collection. 1062 1063 If recreate_index is False, it keeps the collection but deletes documents iteratively. 1064 If recreate_index is True, the collection is dropped and faithfully recreated. 1065 This is recommended for performance reasons. 1066 1067 **Parameters:** 1068 1069 - **recreate_index** (<code>bool</code>) – Use drop and recreate strategy. (recommended for performance) 1070 - **batch_size** (<code>int</code>) – Only relevant if recreate_index is false. Defines the deletion batch size. 1071 Note that this parameter needs to be less or equal to the set `QUERY_MAXIMUM_RESULTS` variable 1072 set for the weaviate deployment (default is 10000). 1073 Reference: https://docs.weaviate.io/weaviate/manage-objects/delete#delete-all-objects 1074 1075 #### delete_all_documents_async 1076 1077 ```python 1078 delete_all_documents_async( 1079 *, recreate_index: bool = False, batch_size: int = 1000 1080 ) -> None 1081 ``` 1082 1083 Asynchronously deletes all documents in a collection. 1084 1085 If recreate_index is False, it keeps the collection but deletes documents iteratively. 1086 If recreate_index is True, the collection is dropped and faithfully recreated. 1087 This is recommended for performance reasons. 1088 1089 **Parameters:** 1090 1091 - **recreate_index** (<code>bool</code>) – Use drop and recreate strategy. (recommended for performance) 1092 - **batch_size** (<code>int</code>) – Only relevant if recreate_index is false. Defines the deletion batch size. 1093 Note that this parameter needs to be less or equal to the set `QUERY_MAXIMUM_RESULTS` variable 1094 set for the weaviate deployment (default is 10000). 1095 Reference: https://docs.weaviate.io/weaviate/manage-objects/delete#delete-all-objects 1096 1097 #### delete_by_filter 1098 1099 ```python 1100 delete_by_filter(filters: dict[str, Any]) -> int 1101 ``` 1102 1103 Deletes all documents that match the provided filters. 1104 1105 **Parameters:** 1106 1107 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 1108 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 1109 1110 **Returns:** 1111 1112 - <code>int</code> – The number of documents deleted. 1113 1114 #### delete_by_filter_async 1115 1116 ```python 1117 delete_by_filter_async(filters: dict[str, Any]) -> int 1118 ``` 1119 1120 Asynchronously deletes all documents that match the provided filters. 1121 1122 **Parameters:** 1123 1124 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 1125 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 1126 1127 **Returns:** 1128 1129 - <code>int</code> – The number of documents deleted. 1130 1131 #### update_by_filter 1132 1133 ```python 1134 update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int 1135 ``` 1136 1137 Updates the metadata of all documents that match the provided filters. 1138 1139 **Parameters:** 1140 1141 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 1142 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 1143 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata. 1144 1145 **Returns:** 1146 1147 - <code>int</code> – The number of documents updated. 1148 1149 #### update_by_filter_async 1150 1151 ```python 1152 update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int 1153 ``` 1154 1155 Asynchronously updates the metadata of all documents that match the provided filters. 1156 1157 **Parameters:** 1158 1159 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 1160 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 1161 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. These will be merged with existing metadata. 1162 1163 **Returns:** 1164 1165 - <code>int</code> – The number of documents updated.