pgvector.md
1 --- 2 title: "Pgvector" 3 id: integrations-pgvector 4 description: "Pgvector integration for Haystack" 5 slug: "/integrations-pgvector" 6 --- 7 8 9 ## haystack_integrations.components.retrievers.pgvector.embedding_retriever 10 11 ### PgvectorEmbeddingRetriever 12 13 Retrieves documents from the `PgvectorDocumentStore`, based on their dense embeddings. 14 15 Example usage: 16 17 ```python 18 from haystack.document_stores import DuplicatePolicy 19 from haystack import Document, Pipeline 20 from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder 21 22 from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore 23 from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever 24 25 # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. 26 # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME" 27 28 document_store = PgvectorDocumentStore( 29 embedding_dimension=768, 30 vector_function="cosine_similarity", 31 recreate_table=True, 32 ) 33 34 documents = [Document(content="There are over 7,000 languages spoken around the world today."), 35 Document(content="Elephants have been observed to behave in a way that indicates..."), 36 Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")] 37 38 document_embedder = SentenceTransformersDocumentEmbedder() 39 document_embedder.warm_up() 40 documents_with_embeddings = document_embedder.run(documents) 41 42 document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE) 43 44 query_pipeline = Pipeline() 45 query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder()) 46 query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store)) 47 query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") 48 49 query = "How many languages are there?" 50 51 res = query_pipeline.run({"text_embedder": {"text": query}}) 52 53 assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today." 54 ``` 55 56 #### __init__ 57 58 ```python 59 __init__( 60 *, 61 document_store: PgvectorDocumentStore, 62 filters: dict[str, Any] | None = None, 63 top_k: int = 10, 64 vector_function: ( 65 Literal["cosine_similarity", "inner_product", "l2_distance"] | None 66 ) = None, 67 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 68 ) 69 ``` 70 71 **Parameters:** 72 73 - **document_store** (<code>PgvectorDocumentStore</code>) – An instance of `PgvectorDocumentStore`. 74 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. 75 - **top_k** (<code>int</code>) – Maximum number of Documents to return. 76 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings. 77 Defaults to the one set in the `document_store` instance. 78 `"cosine_similarity"` and `"inner_product"` are similarity functions and 79 higher scores indicate greater similarity between the documents. 80 `"l2_distance"` returns the straight-line distance between vectors, 81 and the most similar documents are the ones with the smallest score. 82 **Important**: if the document store is using the `"hnsw"` search strategy, the vector function 83 should match the one utilized during index creation to take advantage of the index. 84 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 85 86 **Raises:** 87 88 - <code>ValueError</code> – If `document_store` is not an instance of `PgvectorDocumentStore` or if `vector_function` 89 is not one of the valid options. 90 91 #### to_dict 92 93 ```python 94 to_dict() -> dict[str, Any] 95 ``` 96 97 Serializes the component to a dictionary. 98 99 **Returns:** 100 101 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 102 103 #### from_dict 104 105 ```python 106 from_dict(data: dict[str, Any]) -> PgvectorEmbeddingRetriever 107 ``` 108 109 Deserializes the component from a dictionary. 110 111 **Parameters:** 112 113 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 114 115 **Returns:** 116 117 - <code>PgvectorEmbeddingRetriever</code> – Deserialized component. 118 119 #### run 120 121 ```python 122 run( 123 query_embedding: list[float], 124 filters: dict[str, Any] | None = None, 125 top_k: int | None = None, 126 vector_function: ( 127 Literal["cosine_similarity", "inner_product", "l2_distance"] | None 128 ) = None, 129 ) -> dict[str, list[Document]] 130 ``` 131 132 Retrieve documents from the `PgvectorDocumentStore`, based on their embeddings. 133 134 **Parameters:** 135 136 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 137 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 138 the `filter_policy` chosen at retriever initialization. See init method docstring for more 139 details. 140 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 141 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings. 142 143 **Returns:** 144 145 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 146 - `documents`: List of `Document`s that are similar to `query_embedding`. 147 148 #### run_async 149 150 ```python 151 run_async( 152 query_embedding: list[float], 153 filters: dict[str, Any] | None = None, 154 top_k: int | None = None, 155 vector_function: ( 156 Literal["cosine_similarity", "inner_product", "l2_distance"] | None 157 ) = None, 158 ) -> dict[str, list[Document]] 159 ``` 160 161 Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on their embeddings. 162 163 **Parameters:** 164 165 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 166 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 167 the `filter_policy` chosen at retriever initialization. See init method docstring for more 168 details. 169 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 170 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings. 171 172 **Returns:** 173 174 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 175 - `documents`: List of `Document`s that are similar to `query_embedding`. 176 177 ## haystack_integrations.components.retrievers.pgvector.keyword_retriever 178 179 ### PgvectorKeywordRetriever 180 181 Retrieve documents from the `PgvectorDocumentStore`, based on keywords. 182 183 To rank the documents, the `ts_rank_cd` function of PostgreSQL is used. 184 It considers how often the query terms appear in the document, how close together the terms are in the document, 185 and how important is the part of the document where they occur. 186 For more details, see 187 [Postgres documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING). 188 189 Usage example: 190 191 ````python 192 from haystack.document_stores import DuplicatePolicy 193 from haystack import Document 194 195 from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore 196 from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever 197 198 # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. 199 # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME" 200 201 document_store = PgvectorDocumentStore(language="english", recreate_table=True) 202 203 documents = [Document(content="There are over 7,000 languages spoken around the world today."), 204 Document(content="Elephants have been observed to behave in a way that indicates..."), 205 Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")] 206 207 document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE) 208 209 retriever = PgvectorKeywordRetriever(document_store=document_store) 210 211 result = retriever.run(query="languages") 212 213 assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today." 214 215 216 217 218 219 220 221 222 223 224 225 226 #### __init__ 227 228 ```python 229 __init__( 230 *, 231 document_store: PgvectorDocumentStore, 232 filters: dict[str, Any] | None = None, 233 top_k: int = 10, 234 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 235 ) 236 ```` 237 238 **Parameters:** 239 240 - **document_store** (<code>PgvectorDocumentStore</code>) – An instance of `PgvectorDocumentStore`. 241 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. 242 - **top_k** (<code>int</code>) – Maximum number of Documents to return. 243 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 244 245 **Raises:** 246 247 - <code>ValueError</code> – If `document_store` is not an instance of `PgvectorDocumentStore`. 248 249 #### to_dict 250 251 ```python 252 to_dict() -> dict[str, Any] 253 ``` 254 255 Serializes the component to a dictionary. 256 257 **Returns:** 258 259 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 260 261 #### from_dict 262 263 ```python 264 from_dict(data: dict[str, Any]) -> PgvectorKeywordRetriever 265 ``` 266 267 Deserializes the component from a dictionary. 268 269 **Parameters:** 270 271 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 272 273 **Returns:** 274 275 - <code>PgvectorKeywordRetriever</code> – Deserialized component. 276 277 #### run 278 279 ```python 280 run( 281 query: str, filters: dict[str, Any] | None = None, top_k: int | None = None 282 ) -> dict[str, list[Document]] 283 ``` 284 285 Retrieve documents from the `PgvectorDocumentStore`, based on keywords. 286 287 **Parameters:** 288 289 - **query** (<code>str</code>) – String to search in `Document`s' content. 290 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 291 the `filter_policy` chosen at retriever initialization. See init method docstring for more 292 details. 293 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 294 295 **Returns:** 296 297 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 298 - `documents`: List of `Document`s that match the query. 299 300 #### run_async 301 302 ```python 303 run_async( 304 query: str, filters: dict[str, Any] | None = None, top_k: int | None = None 305 ) -> dict[str, list[Document]] 306 ``` 307 308 Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on keywords. 309 310 **Parameters:** 311 312 - **query** (<code>str</code>) – String to search in `Document`s' content. 313 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 314 the `filter_policy` chosen at retriever initialization. See init method docstring for more 315 details. 316 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 317 318 **Returns:** 319 320 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 321 - `documents`: List of `Document`s that match the query. 322 323 ## haystack_integrations.document_stores.pgvector.document_store 324 325 ### PgvectorDocumentStore 326 327 A Document Store using PostgreSQL with the [pgvector extension](https://github.com/pgvector/pgvector) installed. 328 329 #### __init__ 330 331 ```python 332 __init__( 333 *, 334 connection_string: Secret = Secret.from_env_var("PG_CONN_STR"), 335 create_extension: bool = True, 336 schema_name: str = "public", 337 table_name: str = "haystack_documents", 338 language: str = "english", 339 embedding_dimension: int = 768, 340 vector_type: Literal["vector", "halfvec"] = "vector", 341 vector_function: Literal[ 342 "cosine_similarity", "inner_product", "l2_distance" 343 ] = "cosine_similarity", 344 recreate_table: bool = False, 345 search_strategy: Literal[ 346 "exact_nearest_neighbor", "hnsw" 347 ] = "exact_nearest_neighbor", 348 hnsw_recreate_index_if_exists: bool = False, 349 hnsw_index_creation_kwargs: dict[str, int] | None = None, 350 hnsw_index_name: str = "haystack_hnsw_index", 351 hnsw_ef_search: int | None = None, 352 keyword_index_name: str = "haystack_keyword_index" 353 ) 354 ``` 355 356 Creates a new PgvectorDocumentStore instance. 357 It is meant to be connected to a PostgreSQL database with the pgvector extension installed. 358 A specific table to store Haystack documents will be created if it doesn't exist yet. 359 360 **Parameters:** 361 362 - **connection_string** (<code>Secret</code>) – The connection string to use to connect to the PostgreSQL database, defined as an 363 environment variable. Supported formats: 364 - URI, e.g. `PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"` (use percent-encoding for special 365 characters) 366 - keyword/value format, e.g. `PG_CONN_STR="host=HOST port=PORT dbname=DBNAME user=USER password=PASSWORD"` 367 See [PostgreSQL Documentation](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) 368 for more details. 369 - **create_extension** (<code>bool</code>) – Whether to create the pgvector extension if it doesn't exist. 370 Set this to `True` (default) to automatically create the extension if it is missing. 371 Creating the extension may require superuser privileges. 372 If set to `False`, ensure the extension is already installed; otherwise, an error will be raised. 373 - **schema_name** (<code>str</code>) – The name of the schema the table is created in. The schema must already exist. 374 - **table_name** (<code>str</code>) – The name of the table to use to store Haystack documents. 375 - **language** (<code>str</code>) – The language to be used to parse query and document content in keyword retrieval. 376 To see the list of available languages, you can run the following SQL query in your PostgreSQL database: 377 `SELECT cfgname FROM pg_ts_config;`. 378 More information can be found in this [StackOverflow answer](https://stackoverflow.com/a/39752553). 379 - **embedding_dimension** (<code>int</code>) – The dimension of the embedding. 380 - **vector_type** (<code>Literal['vector', 'halfvec']</code>) – The type of vector used for embedding storage. 381 "vector" is the default. 382 "halfvec" stores embeddings in half-precision, which is particularly useful for high-dimensional embeddings 383 (dimension greater than 2,000 and up to 4,000). Requires pgvector versions 0.7.0 or later. For more 384 information, see the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file). 385 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance']</code>) – The similarity function to use when searching for similar embeddings. 386 `"cosine_similarity"` and `"inner_product"` are similarity functions and 387 higher scores indicate greater similarity between the documents. 388 `"l2_distance"` returns the straight-line distance between vectors, 389 and the most similar documents are the ones with the smallest score. 390 **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the 391 `vector_function` passed here. Make sure subsequent queries will keep using the same 392 vector similarity function in order to take advantage of the index. 393 - **recreate_table** (<code>bool</code>) – Whether to recreate the table if it already exists. 394 - **search_strategy** (<code>Literal['exact_nearest_neighbor', 'hnsw']</code>) – The search strategy to use when searching for similar embeddings. 395 `"exact_nearest_neighbor"` provides perfect recall but can be slow for large numbers of documents. 396 `"hnsw"` is an approximate nearest neighbor search strategy, 397 which trades off some accuracy for speed; it is recommended for large numbers of documents. 398 **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the 399 `vector_function` passed here. Make sure subsequent queries will keep using the same 400 vector similarity function in order to take advantage of the index. 401 - **hnsw_recreate_index_if_exists** (<code>bool</code>) – Whether to recreate the HNSW index if it already exists. 402 Only used if search_strategy is set to `"hnsw"`. 403 - **hnsw_index_creation_kwargs** (<code>dict\[str, int\] | None</code>) – Additional keyword arguments to pass to the HNSW index creation. 404 Only used if search_strategy is set to `"hnsw"`. You can find the list of valid arguments in the 405 [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw) 406 - **hnsw_index_name** (<code>str</code>) – Index name for the HNSW index. 407 - **hnsw_ef_search** (<code>int | None</code>) – The `ef_search` parameter to use at query time. Only used if search_strategy is set to 408 `"hnsw"`. You can find more information about this parameter in the 409 [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw). 410 - **keyword_index_name** (<code>str</code>) – Index name for the Keyword index. 411 412 #### to_dict 413 414 ```python 415 to_dict() -> dict[str, Any] 416 ``` 417 418 Serializes the component to a dictionary. 419 420 **Returns:** 421 422 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 423 424 #### from_dict 425 426 ```python 427 from_dict(data: dict[str, Any]) -> PgvectorDocumentStore 428 ``` 429 430 Deserializes the component from a dictionary. 431 432 **Parameters:** 433 434 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 435 436 **Returns:** 437 438 - <code>PgvectorDocumentStore</code> – Deserialized component. 439 440 #### delete_table 441 442 ```python 443 delete_table() 444 ``` 445 446 Deletes the table used to store Haystack documents. 447 The name of the schema (`schema_name`) and the name of the table (`table_name`) 448 are defined when initializing the `PgvectorDocumentStore`. 449 450 #### delete_table_async 451 452 ```python 453 delete_table_async() 454 ``` 455 456 Async method to delete the table used to store Haystack documents. 457 458 #### count_documents 459 460 ```python 461 count_documents() -> int 462 ``` 463 464 Returns how many documents are present in the document store. 465 466 **Returns:** 467 468 - <code>int</code> – Number of documents in the document store. 469 470 #### count_documents_async 471 472 ```python 473 count_documents_async() -> int 474 ``` 475 476 Returns how many documents are present in the document store. 477 478 **Returns:** 479 480 - <code>int</code> – Number of documents in the document store. 481 482 #### filter_documents 483 484 ```python 485 filter_documents(filters: dict[str, Any] | None = None) -> list[Document] 486 ``` 487 488 Returns the documents that match the filters provided. 489 490 For a detailed specification of the filters, 491 refer to the [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering) 492 493 **Parameters:** 494 495 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 496 497 **Returns:** 498 499 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 500 501 **Raises:** 502 503 - <code>TypeError</code> – If `filters` is not a dictionary. 504 - <code>ValueError</code> – If `filters` syntax is invalid. 505 506 #### filter_documents_async 507 508 ```python 509 filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document] 510 ``` 511 512 Asynchronously returns the documents that match the filters provided. 513 514 For a detailed specification of the filters, 515 refer to the [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering) 516 517 **Parameters:** 518 519 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 520 521 **Returns:** 522 523 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 524 525 **Raises:** 526 527 - <code>TypeError</code> – If `filters` is not a dictionary. 528 - <code>ValueError</code> – If `filters` syntax is invalid. 529 530 #### write_documents 531 532 ```python 533 write_documents( 534 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 535 ) -> int 536 ``` 537 538 Writes documents to the document store. 539 540 **Parameters:** 541 542 - **documents** (<code>list\[Document\]</code>) – A list of Documents to write to the document store. 543 - **policy** (<code>DuplicatePolicy</code>) – The duplicate policy to use when writing documents. 544 545 **Returns:** 546 547 - <code>int</code> – The number of documents written to the document store. 548 549 **Raises:** 550 551 - <code>ValueError</code> – If `documents` contains objects that are not of type `Document`. 552 - <code>DuplicateDocumentError</code> – If a document with the same id already exists in the document store 553 and the policy is set to `DuplicatePolicy.FAIL` (or not specified). 554 - <code>DocumentStoreError</code> – If the write operation fails for any other reason. 555 556 #### write_documents_async 557 558 ```python 559 write_documents_async( 560 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 561 ) -> int 562 ``` 563 564 Asynchronously writes documents to the document store. 565 566 **Parameters:** 567 568 - **documents** (<code>list\[Document\]</code>) – A list of Documents to write to the document store. 569 - **policy** (<code>DuplicatePolicy</code>) – The duplicate policy to use when writing documents. 570 571 **Returns:** 572 573 - <code>int</code> – The number of documents written to the document store. 574 575 **Raises:** 576 577 - <code>ValueError</code> – If `documents` contains objects that are not of type `Document`. 578 - <code>DuplicateDocumentError</code> – If a document with the same id already exists in the document store 579 and the policy is set to `DuplicatePolicy.FAIL` (or not specified). 580 - <code>DocumentStoreError</code> – If the write operation fails for any other reason. 581 582 #### delete_documents 583 584 ```python 585 delete_documents(document_ids: list[str]) -> None 586 ``` 587 588 Deletes documents that match the provided `document_ids` from the document store. 589 590 **Parameters:** 591 592 - **document_ids** (<code>list\[str\]</code>) – the document ids to delete 593 594 #### delete_documents_async 595 596 ```python 597 delete_documents_async(document_ids: list[str]) -> None 598 ``` 599 600 Asynchronously deletes documents that match the provided `document_ids` from the document store. 601 602 **Parameters:** 603 604 - **document_ids** (<code>list\[str\]</code>) – the document ids to delete 605 606 #### delete_all_documents 607 608 ```python 609 delete_all_documents() -> None 610 ``` 611 612 Deletes all documents in the document store. 613 614 #### delete_all_documents_async 615 616 ```python 617 delete_all_documents_async() -> None 618 ``` 619 620 Asynchronously deletes all documents in the document store. 621 622 #### delete_by_filter 623 624 ```python 625 delete_by_filter(filters: dict[str, Any]) -> int 626 ``` 627 628 Deletes all documents that match the provided filters. 629 630 **Parameters:** 631 632 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 633 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 634 635 **Returns:** 636 637 - <code>int</code> – The number of documents deleted. 638 639 #### delete_by_filter_async 640 641 ```python 642 delete_by_filter_async(filters: dict[str, Any]) -> int 643 ``` 644 645 Asynchronously deletes all documents that match the provided filters. 646 647 **Parameters:** 648 649 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 650 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 651 652 **Returns:** 653 654 - <code>int</code> – The number of documents deleted. 655 656 #### update_by_filter 657 658 ```python 659 update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int 660 ``` 661 662 Updates the metadata of all documents that match the provided filters. 663 664 **Parameters:** 665 666 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 667 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 668 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. 669 670 **Returns:** 671 672 - <code>int</code> – The number of documents updated. 673 674 #### update_by_filter_async 675 676 ```python 677 update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int 678 ``` 679 680 Asynchronously updates the metadata of all documents that match the provided filters. 681 682 **Parameters:** 683 684 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 685 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 686 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. 687 688 **Returns:** 689 690 - <code>int</code> – The number of documents updated. 691 692 #### count_documents_by_filter 693 694 ```python 695 count_documents_by_filter(filters: dict[str, Any]) -> int 696 ``` 697 698 Returns the number of documents that match the provided filters. 699 700 **Parameters:** 701 702 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents. 703 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 704 705 **Returns:** 706 707 - <code>int</code> – The number of documents that match the filters. 708 709 #### count_documents_by_filter_async 710 711 ```python 712 count_documents_by_filter_async(filters: dict[str, Any]) -> int 713 ``` 714 715 Asynchronously returns the number of documents that match the provided filters. 716 717 **Parameters:** 718 719 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents. 720 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 721 722 **Returns:** 723 724 - <code>int</code> – The number of documents that match the filters. 725 726 #### count_unique_metadata_by_filter 727 728 ```python 729 count_unique_metadata_by_filter( 730 filters: dict[str, Any], metadata_fields: list[str] 731 ) -> dict[str, int] 732 ``` 733 734 Returns the count of unique values for each specified metadata field, 735 considering only documents that match the provided filters. 736 737 **Parameters:** 738 739 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents. 740 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 741 - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for. 742 Field names can include or omit the "meta." prefix. 743 744 **Returns:** 745 746 - <code>dict\[str, int\]</code> – A dictionary mapping field names to their unique value counts. 747 748 #### count_unique_metadata_by_filter_async 749 750 ```python 751 count_unique_metadata_by_filter_async( 752 filters: dict[str, Any], metadata_fields: list[str] 753 ) -> dict[str, int] 754 ``` 755 756 Asynchronously returns the count of unique values for each specified metadata field, 757 considering only documents that match the provided filters. 758 759 **Parameters:** 760 761 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents. 762 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 763 - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for. 764 Field names can include or omit the "meta." prefix. 765 766 **Returns:** 767 768 - <code>dict\[str, int\]</code> – A dictionary mapping field names to their unique value counts. 769 770 #### get_metadata_fields_info 771 772 ```python 773 get_metadata_fields_info() -> dict[str, dict[str, str]] 774 ``` 775 776 Returns the information about the metadata fields in the document store. 777 778 Since metadata is stored in a JSONB field, this method analyzes actual data 779 to infer field types. 780 781 Example return: 782 783 ```python 784 { 785 'content': {'type': 'text'}, 786 'category': {'type': 'text'}, 787 'status': {'type': 'text'}, 788 'priority': {'type': 'integer'}, 789 } 790 ``` 791 792 **Returns:** 793 794 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping field names to their type information. 795 796 #### get_metadata_fields_info_async 797 798 ```python 799 get_metadata_fields_info_async() -> dict[str, dict[str, str]] 800 ``` 801 802 Asynchronously returns the information about the metadata fields in the document store. 803 804 Since metadata is stored in a JSONB field, this method analyzes actual data 805 to infer field types. 806 807 **Returns:** 808 809 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping field names to their type information. 810 811 #### get_metadata_field_min_max 812 813 ```python 814 get_metadata_field_min_max(metadata_field: str) -> dict[str, Any] 815 ``` 816 817 Returns the minimum and maximum values for a given metadata field. 818 819 **Parameters:** 820 821 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 822 823 **Returns:** 824 825 - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the minimum and maximum values. 826 For numeric fields (integer, real), returns numeric min/max. 827 For text fields, returns lexicographic min/max based on database collation. 828 829 **Raises:** 830 831 - <code>ValueError</code> – If the field doesn't exist or has no values. 832 833 #### get_metadata_field_min_max_async 834 835 ```python 836 get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any] 837 ``` 838 839 Asynchronously returns the minimum and maximum values for a given metadata field. 840 841 **Parameters:** 842 843 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 844 845 **Returns:** 846 847 - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the minimum and maximum values. 848 For numeric fields (integer, real), returns numeric min/max. 849 For text fields, returns lexicographic min/max based on database collation. 850 851 **Raises:** 852 853 - <code>ValueError</code> – If the field doesn't exist or has no values. 854 855 #### get_metadata_field_unique_values 856 857 ```python 858 get_metadata_field_unique_values( 859 metadata_field: str, search_term: str | None, from_: int, size: int 860 ) -> tuple[list[str], int] 861 ``` 862 863 Returns unique values for a given metadata field, optionally filtered by a search term. 864 865 **Parameters:** 866 867 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 868 - **search_term** (<code>str | None</code>) – Optional search term to filter documents by content before extracting unique values. 869 If None, all documents are considered. 870 - **from\_** (<code>int</code>) – The offset for pagination (0-based). 871 - **size** (<code>int</code>) – The number of unique values to return. 872 873 **Returns:** 874 875 - <code>tuple\[list\[str\], int\]</code> – A tuple containing: 876 - A list of unique values (as strings) 877 - The total count of unique values 878 879 #### get_metadata_field_unique_values_async 880 881 ```python 882 get_metadata_field_unique_values_async( 883 metadata_field: str, search_term: str | None, from_: int, size: int 884 ) -> tuple[list[str], int] 885 ``` 886 887 Asynchronously returns unique values for a given metadata field, optionally filtered by a search term. 888 889 **Parameters:** 890 891 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 892 - **search_term** (<code>str | None</code>) – Optional search term to filter documents by content before extracting unique values. 893 If None, all documents are considered. 894 - **from\_** (<code>int</code>) – The offset for pagination (0-based). 895 - **size** (<code>int</code>) – The number of unique values to return. 896 897 **Returns:** 898 899 - <code>tuple\[list\[str\], int\]</code> – A tuple containing: 900 - A list of unique values (as strings) 901 - The total count of unique values