pgvector.md
1 --- 2 title: "Pgvector" 3 id: integrations-pgvector 4 description: "Pgvector integration for Haystack" 5 slug: "/integrations-pgvector" 6 --- 7 8 9 ## haystack_integrations.components.retrievers.pgvector.embedding_retriever 10 11 ### PgvectorEmbeddingRetriever 12 13 Retrieves documents from the `PgvectorDocumentStore`, based on their dense embeddings. 14 15 Example usage: 16 17 ```python 18 from haystack.document_stores import DuplicatePolicy 19 from haystack import Document, Pipeline 20 from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder 21 22 from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore 23 from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever 24 25 # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. 26 # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME" 27 28 document_store = PgvectorDocumentStore( 29 embedding_dimension=768, 30 vector_function="cosine_similarity", 31 recreate_table=True, 32 ) 33 34 documents = [Document(content="There are over 7,000 languages spoken around the world today."), 35 Document(content="Elephants have been observed to behave in a way that indicates..."), 36 Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")] 37 38 document_embedder = SentenceTransformersDocumentEmbedder() 39 document_embedder.warm_up() 40 documents_with_embeddings = document_embedder.run(documents) 41 42 document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE) 43 44 query_pipeline = Pipeline() 45 query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder()) 46 query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store)) 47 query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding") 48 49 query = "How many languages are there?" 50 51 res = query_pipeline.run({"text_embedder": {"text": query}}) 52 53 assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today." 54 ``` 55 56 #### __init__ 57 58 ```python 59 __init__( 60 *, 61 document_store: PgvectorDocumentStore, 62 filters: dict[str, Any] | None = None, 63 top_k: int = 10, 64 vector_function: ( 65 Literal["cosine_similarity", "inner_product", "l2_distance"] | None 66 ) = None, 67 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 68 ) 69 ``` 70 71 **Parameters:** 72 73 - **document_store** (<code>PgvectorDocumentStore</code>) – An instance of `PgvectorDocumentStore`. 74 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. 75 - **top_k** (<code>int</code>) – Maximum number of Documents to return. 76 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings. 77 Defaults to the one set in the `document_store` instance. 78 `"cosine_similarity"` and `"inner_product"` are similarity functions and 79 higher scores indicate greater similarity between the documents. 80 `"l2_distance"` returns the straight-line distance between vectors, 81 and the most similar documents are the ones with the smallest score. 82 **Important**: if the document store is using the `"hnsw"` search strategy, the vector function 83 should match the one utilized during index creation to take advantage of the index. 84 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 85 86 **Raises:** 87 88 - <code>ValueError</code> – If `document_store` is not an instance of `PgvectorDocumentStore` or if `vector_function` 89 is not one of the valid options. 90 91 #### to_dict 92 93 ```python 94 to_dict() -> dict[str, Any] 95 ``` 96 97 Serializes the component to a dictionary. 98 99 **Returns:** 100 101 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 102 103 #### from_dict 104 105 ```python 106 from_dict(data: dict[str, Any]) -> PgvectorEmbeddingRetriever 107 ``` 108 109 Deserializes the component from a dictionary. 110 111 **Parameters:** 112 113 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 114 115 **Returns:** 116 117 - <code>PgvectorEmbeddingRetriever</code> – Deserialized component. 118 119 #### run 120 121 ```python 122 run( 123 query_embedding: list[float], 124 filters: dict[str, Any] | None = None, 125 top_k: int | None = None, 126 vector_function: ( 127 Literal["cosine_similarity", "inner_product", "l2_distance"] | None 128 ) = None, 129 ) -> dict[str, list[Document]] 130 ``` 131 132 Retrieve documents from the `PgvectorDocumentStore`, based on their embeddings. 133 134 **Parameters:** 135 136 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 137 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 138 the `filter_policy` chosen at retriever initialization. See init method docstring for more 139 details. 140 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 141 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings. 142 143 **Returns:** 144 145 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 146 - `documents`: List of `Document`s that are similar to `query_embedding`. 147 148 #### run_async 149 150 ```python 151 run_async( 152 query_embedding: list[float], 153 filters: dict[str, Any] | None = None, 154 top_k: int | None = None, 155 vector_function: ( 156 Literal["cosine_similarity", "inner_product", "l2_distance"] | None 157 ) = None, 158 ) -> dict[str, list[Document]] 159 ``` 160 161 Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on their embeddings. 162 163 **Parameters:** 164 165 - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query. 166 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 167 the `filter_policy` chosen at retriever initialization. See init method docstring for more 168 details. 169 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 170 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings. 171 172 **Returns:** 173 174 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 175 - `documents`: List of `Document`s that are similar to `query_embedding`. 176 177 ## haystack_integrations.components.retrievers.pgvector.keyword_retriever 178 179 ### PgvectorKeywordRetriever 180 181 Retrieve documents from the `PgvectorDocumentStore`, based on keywords. 182 183 To rank the documents, the `ts_rank_cd` function of PostgreSQL is used. 184 It considers how often the query terms appear in the document, how close together the terms are in the document, 185 and how important is the part of the document where they occur. 186 For more details, see 187 [Postgres documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING). 188 189 Usage example: 190 191 ````python 192 from haystack.document_stores import DuplicatePolicy 193 from haystack import Document 194 195 from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore 196 from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever 197 198 # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database. 199 # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME" 200 201 document_store = PgvectorDocumentStore(language="english", recreate_table=True) 202 203 documents = [Document(content="There are over 7,000 languages spoken around the world today."), 204 Document(content="Elephants have been observed to behave in a way that indicates..."), 205 Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")] 206 207 document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE) 208 209 retriever = PgvectorKeywordRetriever(document_store=document_store) 210 211 result = retriever.run(query="languages") 212 213 assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today." 214 215 #### __init__ 216 217 ```python 218 __init__( 219 *, 220 document_store: PgvectorDocumentStore, 221 filters: dict[str, Any] | None = None, 222 top_k: int = 10, 223 filter_policy: str | FilterPolicy = FilterPolicy.REPLACE 224 ) 225 ```` 226 227 **Parameters:** 228 229 - **document_store** (<code>PgvectorDocumentStore</code>) – An instance of `PgvectorDocumentStore`. 230 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. 231 - **top_k** (<code>int</code>) – Maximum number of Documents to return. 232 - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied. 233 234 **Raises:** 235 236 - <code>ValueError</code> – If `document_store` is not an instance of `PgvectorDocumentStore`. 237 238 #### to_dict 239 240 ```python 241 to_dict() -> dict[str, Any] 242 ``` 243 244 Serializes the component to a dictionary. 245 246 **Returns:** 247 248 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 249 250 #### from_dict 251 252 ```python 253 from_dict(data: dict[str, Any]) -> PgvectorKeywordRetriever 254 ``` 255 256 Deserializes the component from a dictionary. 257 258 **Parameters:** 259 260 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 261 262 **Returns:** 263 264 - <code>PgvectorKeywordRetriever</code> – Deserialized component. 265 266 #### run 267 268 ```python 269 run( 270 query: str, filters: dict[str, Any] | None = None, top_k: int | None = None 271 ) -> dict[str, list[Document]] 272 ``` 273 274 Retrieve documents from the `PgvectorDocumentStore`, based on keywords. 275 276 **Parameters:** 277 278 - **query** (<code>str</code>) – String to search in `Document`s' content. 279 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 280 the `filter_policy` chosen at retriever initialization. See init method docstring for more 281 details. 282 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 283 284 **Returns:** 285 286 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 287 - `documents`: List of `Document`s that match the query. 288 289 #### run_async 290 291 ```python 292 run_async( 293 query: str, filters: dict[str, Any] | None = None, top_k: int | None = None 294 ) -> dict[str, list[Document]] 295 ``` 296 297 Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on keywords. 298 299 **Parameters:** 300 301 - **query** (<code>str</code>) – String to search in `Document`s' content. 302 - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on 303 the `filter_policy` chosen at retriever initialization. See init method docstring for more 304 details. 305 - **top_k** (<code>int | None</code>) – Maximum number of Documents to return. 306 307 **Returns:** 308 309 - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys: 310 - `documents`: List of `Document`s that match the query. 311 312 ## haystack_integrations.document_stores.pgvector.document_store 313 314 ### PgvectorDocumentStore 315 316 A Document Store using PostgreSQL with the [pgvector extension](https://github.com/pgvector/pgvector) installed. 317 318 #### __init__ 319 320 ```python 321 __init__( 322 *, 323 connection_string: Secret = Secret.from_env_var("PG_CONN_STR"), 324 create_extension: bool = True, 325 schema_name: str = "public", 326 table_name: str = "haystack_documents", 327 language: str = "english", 328 embedding_dimension: int = 768, 329 vector_type: Literal["vector", "halfvec"] = "vector", 330 vector_function: Literal[ 331 "cosine_similarity", "inner_product", "l2_distance" 332 ] = "cosine_similarity", 333 recreate_table: bool = False, 334 search_strategy: Literal[ 335 "exact_nearest_neighbor", "hnsw" 336 ] = "exact_nearest_neighbor", 337 hnsw_recreate_index_if_exists: bool = False, 338 hnsw_index_creation_kwargs: dict[str, int] | None = None, 339 hnsw_index_name: str = "haystack_hnsw_index", 340 hnsw_ef_search: int | None = None, 341 keyword_index_name: str = "haystack_keyword_index" 342 ) 343 ``` 344 345 Creates a new PgvectorDocumentStore instance. 346 It is meant to be connected to a PostgreSQL database with the pgvector extension installed. 347 A specific table to store Haystack documents will be created if it doesn't exist yet. 348 349 **Parameters:** 350 351 - **connection_string** (<code>Secret</code>) – The connection string to use to connect to the PostgreSQL database, defined as an 352 environment variable. Supported formats: 353 - URI, e.g. `PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"` (use percent-encoding for special 354 characters) 355 - keyword/value format, e.g. `PG_CONN_STR="host=HOST port=PORT dbname=DBNAME user=USER password=PASSWORD"` 356 See [PostgreSQL Documentation](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING) 357 for more details. 358 - **create_extension** (<code>bool</code>) – Whether to create the pgvector extension if it doesn't exist. 359 Set this to `True` (default) to automatically create the extension if it is missing. 360 Creating the extension may require superuser privileges. 361 If set to `False`, ensure the extension is already installed; otherwise, an error will be raised. 362 - **schema_name** (<code>str</code>) – The name of the schema the table is created in. The schema must already exist. 363 - **table_name** (<code>str</code>) – The name of the table to use to store Haystack documents. 364 - **language** (<code>str</code>) – The language to be used to parse query and document content in keyword retrieval. 365 To see the list of available languages, you can run the following SQL query in your PostgreSQL database: 366 `SELECT cfgname FROM pg_ts_config;`. 367 More information can be found in this [StackOverflow answer](https://stackoverflow.com/a/39752553). 368 - **embedding_dimension** (<code>int</code>) – The dimension of the embedding. 369 - **vector_type** (<code>Literal['vector', 'halfvec']</code>) – The type of vector used for embedding storage. 370 "vector" is the default. 371 "halfvec" stores embeddings in half-precision, which is particularly useful for high-dimensional embeddings 372 (dimension greater than 2,000 and up to 4,000). Requires pgvector versions 0.7.0 or later. For more 373 information, see the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file). 374 - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance']</code>) – The similarity function to use when searching for similar embeddings. 375 `"cosine_similarity"` and `"inner_product"` are similarity functions and 376 higher scores indicate greater similarity between the documents. 377 `"l2_distance"` returns the straight-line distance between vectors, 378 and the most similar documents are the ones with the smallest score. 379 **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the 380 `vector_function` passed here. Make sure subsequent queries will keep using the same 381 vector similarity function in order to take advantage of the index. 382 - **recreate_table** (<code>bool</code>) – Whether to recreate the table if it already exists. 383 - **search_strategy** (<code>Literal['exact_nearest_neighbor', 'hnsw']</code>) – The search strategy to use when searching for similar embeddings. 384 `"exact_nearest_neighbor"` provides perfect recall but can be slow for large numbers of documents. 385 `"hnsw"` is an approximate nearest neighbor search strategy, 386 which trades off some accuracy for speed; it is recommended for large numbers of documents. 387 **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the 388 `vector_function` passed here. Make sure subsequent queries will keep using the same 389 vector similarity function in order to take advantage of the index. 390 - **hnsw_recreate_index_if_exists** (<code>bool</code>) – Whether to recreate the HNSW index if it already exists. 391 Only used if search_strategy is set to `"hnsw"`. 392 - **hnsw_index_creation_kwargs** (<code>dict\[str, int\] | None</code>) – Additional keyword arguments to pass to the HNSW index creation. 393 Only used if search_strategy is set to `"hnsw"`. You can find the list of valid arguments in the 394 [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw) 395 - **hnsw_index_name** (<code>str</code>) – Index name for the HNSW index. 396 - **hnsw_ef_search** (<code>int | None</code>) – The `ef_search` parameter to use at query time. Only used if search_strategy is set to 397 `"hnsw"`. You can find more information about this parameter in the 398 [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw). 399 - **keyword_index_name** (<code>str</code>) – Index name for the Keyword index. 400 401 #### to_dict 402 403 ```python 404 to_dict() -> dict[str, Any] 405 ``` 406 407 Serializes the component to a dictionary. 408 409 **Returns:** 410 411 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 412 413 #### from_dict 414 415 ```python 416 from_dict(data: dict[str, Any]) -> PgvectorDocumentStore 417 ``` 418 419 Deserializes the component from a dictionary. 420 421 **Parameters:** 422 423 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 424 425 **Returns:** 426 427 - <code>PgvectorDocumentStore</code> – Deserialized component. 428 429 #### delete_table 430 431 ```python 432 delete_table() 433 ``` 434 435 Deletes the table used to store Haystack documents. 436 The name of the schema (`schema_name`) and the name of the table (`table_name`) 437 are defined when initializing the `PgvectorDocumentStore`. 438 439 #### delete_table_async 440 441 ```python 442 delete_table_async() 443 ``` 444 445 Async method to delete the table used to store Haystack documents. 446 447 #### count_documents 448 449 ```python 450 count_documents() -> int 451 ``` 452 453 Returns how many documents are present in the document store. 454 455 **Returns:** 456 457 - <code>int</code> – Number of documents in the document store. 458 459 #### count_documents_async 460 461 ```python 462 count_documents_async() -> int 463 ``` 464 465 Returns how many documents are present in the document store. 466 467 **Returns:** 468 469 - <code>int</code> – Number of documents in the document store. 470 471 #### filter_documents 472 473 ```python 474 filter_documents(filters: dict[str, Any] | None = None) -> list[Document] 475 ``` 476 477 Returns the documents that match the filters provided. 478 479 For a detailed specification of the filters, 480 refer to the [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering) 481 482 **Parameters:** 483 484 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 485 486 **Returns:** 487 488 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 489 490 **Raises:** 491 492 - <code>TypeError</code> – If `filters` is not a dictionary. 493 - <code>ValueError</code> – If `filters` syntax is invalid. 494 495 #### filter_documents_async 496 497 ```python 498 filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document] 499 ``` 500 501 Asynchronously returns the documents that match the filters provided. 502 503 For a detailed specification of the filters, 504 refer to the [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering) 505 506 **Parameters:** 507 508 - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list. 509 510 **Returns:** 511 512 - <code>list\[Document\]</code> – A list of Documents that match the given filters. 513 514 **Raises:** 515 516 - <code>TypeError</code> – If `filters` is not a dictionary. 517 - <code>ValueError</code> – If `filters` syntax is invalid. 518 519 #### write_documents 520 521 ```python 522 write_documents( 523 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 524 ) -> int 525 ``` 526 527 Writes documents to the document store. 528 529 **Parameters:** 530 531 - **documents** (<code>list\[Document\]</code>) – A list of Documents to write to the document store. 532 - **policy** (<code>DuplicatePolicy</code>) – The duplicate policy to use when writing documents. 533 534 **Returns:** 535 536 - <code>int</code> – The number of documents written to the document store. 537 538 **Raises:** 539 540 - <code>ValueError</code> – If `documents` contains objects that are not of type `Document`. 541 - <code>DuplicateDocumentError</code> – If a document with the same id already exists in the document store 542 and the policy is set to `DuplicatePolicy.FAIL` (or not specified). 543 - <code>DocumentStoreError</code> – If the write operation fails for any other reason. 544 545 #### write_documents_async 546 547 ```python 548 write_documents_async( 549 documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE 550 ) -> int 551 ``` 552 553 Asynchronously writes documents to the document store. 554 555 **Parameters:** 556 557 - **documents** (<code>list\[Document\]</code>) – A list of Documents to write to the document store. 558 - **policy** (<code>DuplicatePolicy</code>) – The duplicate policy to use when writing documents. 559 560 **Returns:** 561 562 - <code>int</code> – The number of documents written to the document store. 563 564 **Raises:** 565 566 - <code>ValueError</code> – If `documents` contains objects that are not of type `Document`. 567 - <code>DuplicateDocumentError</code> – If a document with the same id already exists in the document store 568 and the policy is set to `DuplicatePolicy.FAIL` (or not specified). 569 - <code>DocumentStoreError</code> – If the write operation fails for any other reason. 570 571 #### delete_documents 572 573 ```python 574 delete_documents(document_ids: list[str]) -> None 575 ``` 576 577 Deletes documents that match the provided `document_ids` from the document store. 578 579 **Parameters:** 580 581 - **document_ids** (<code>list\[str\]</code>) – the document ids to delete 582 583 #### delete_documents_async 584 585 ```python 586 delete_documents_async(document_ids: list[str]) -> None 587 ``` 588 589 Asynchronously deletes documents that match the provided `document_ids` from the document store. 590 591 **Parameters:** 592 593 - **document_ids** (<code>list\[str\]</code>) – the document ids to delete 594 595 #### delete_all_documents 596 597 ```python 598 delete_all_documents() -> None 599 ``` 600 601 Deletes all documents in the document store. 602 603 #### delete_all_documents_async 604 605 ```python 606 delete_all_documents_async() -> None 607 ``` 608 609 Asynchronously deletes all documents in the document store. 610 611 #### delete_by_filter 612 613 ```python 614 delete_by_filter(filters: dict[str, Any]) -> int 615 ``` 616 617 Deletes all documents that match the provided filters. 618 619 **Parameters:** 620 621 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 622 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 623 624 **Returns:** 625 626 - <code>int</code> – The number of documents deleted. 627 628 #### delete_by_filter_async 629 630 ```python 631 delete_by_filter_async(filters: dict[str, Any]) -> int 632 ``` 633 634 Asynchronously deletes all documents that match the provided filters. 635 636 **Parameters:** 637 638 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion. 639 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 640 641 **Returns:** 642 643 - <code>int</code> – The number of documents deleted. 644 645 #### update_by_filter 646 647 ```python 648 update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int 649 ``` 650 651 Updates the metadata of all documents that match the provided filters. 652 653 **Parameters:** 654 655 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 656 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 657 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. 658 659 **Returns:** 660 661 - <code>int</code> – The number of documents updated. 662 663 #### update_by_filter_async 664 665 ```python 666 update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int 667 ``` 668 669 Asynchronously updates the metadata of all documents that match the provided filters. 670 671 **Parameters:** 672 673 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating. 674 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 675 - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update. 676 677 **Returns:** 678 679 - <code>int</code> – The number of documents updated. 680 681 #### count_documents_by_filter 682 683 ```python 684 count_documents_by_filter(filters: dict[str, Any]) -> int 685 ``` 686 687 Returns the number of documents that match the provided filters. 688 689 **Parameters:** 690 691 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents. 692 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 693 694 **Returns:** 695 696 - <code>int</code> – The number of documents that match the filters. 697 698 #### count_documents_by_filter_async 699 700 ```python 701 count_documents_by_filter_async(filters: dict[str, Any]) -> int 702 ``` 703 704 Asynchronously returns the number of documents that match the provided filters. 705 706 **Parameters:** 707 708 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents. 709 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 710 711 **Returns:** 712 713 - <code>int</code> – The number of documents that match the filters. 714 715 #### count_unique_metadata_by_filter 716 717 ```python 718 count_unique_metadata_by_filter( 719 filters: dict[str, Any], metadata_fields: list[str] 720 ) -> dict[str, int] 721 ``` 722 723 Returns the count of unique values for each specified metadata field, 724 considering only documents that match the provided filters. 725 726 **Parameters:** 727 728 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents. 729 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 730 - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for. 731 Field names can include or omit the "meta." prefix. 732 733 **Returns:** 734 735 - <code>dict\[str, int\]</code> – A dictionary mapping field names to their unique value counts. 736 737 #### count_unique_metadata_by_filter_async 738 739 ```python 740 count_unique_metadata_by_filter_async( 741 filters: dict[str, Any], metadata_fields: list[str] 742 ) -> dict[str, int] 743 ``` 744 745 Asynchronously returns the count of unique values for each specified metadata field, 746 considering only documents that match the provided filters. 747 748 **Parameters:** 749 750 - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents. 751 For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering) 752 - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for. 753 Field names can include or omit the "meta." prefix. 754 755 **Returns:** 756 757 - <code>dict\[str, int\]</code> – A dictionary mapping field names to their unique value counts. 758 759 #### get_metadata_fields_info 760 761 ```python 762 get_metadata_fields_info() -> dict[str, dict[str, str]] 763 ``` 764 765 Returns the information about the metadata fields in the document store. 766 767 Since metadata is stored in a JSONB field, this method analyzes actual data 768 to infer field types. 769 770 Example return: 771 772 ```python 773 { 774 'content': {'type': 'text'}, 775 'category': {'type': 'text'}, 776 'status': {'type': 'text'}, 777 'priority': {'type': 'integer'}, 778 } 779 ``` 780 781 **Returns:** 782 783 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping field names to their type information. 784 785 #### get_metadata_fields_info_async 786 787 ```python 788 get_metadata_fields_info_async() -> dict[str, dict[str, str]] 789 ``` 790 791 Asynchronously returns the information about the metadata fields in the document store. 792 793 Since metadata is stored in a JSONB field, this method analyzes actual data 794 to infer field types. 795 796 **Returns:** 797 798 - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping field names to their type information. 799 800 #### get_metadata_field_min_max 801 802 ```python 803 get_metadata_field_min_max(metadata_field: str) -> dict[str, Any] 804 ``` 805 806 Returns the minimum and maximum values for a given metadata field. 807 808 **Parameters:** 809 810 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 811 812 **Returns:** 813 814 - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the minimum and maximum values. 815 For numeric fields (integer, real), returns numeric min/max. 816 For text fields, returns lexicographic min/max based on database collation. 817 818 **Raises:** 819 820 - <code>ValueError</code> – If the field doesn't exist or has no values. 821 822 #### get_metadata_field_min_max_async 823 824 ```python 825 get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any] 826 ``` 827 828 Asynchronously returns the minimum and maximum values for a given metadata field. 829 830 **Parameters:** 831 832 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 833 834 **Returns:** 835 836 - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the minimum and maximum values. 837 For numeric fields (integer, real), returns numeric min/max. 838 For text fields, returns lexicographic min/max based on database collation. 839 840 **Raises:** 841 842 - <code>ValueError</code> – If the field doesn't exist or has no values. 843 844 #### get_metadata_field_unique_values 845 846 ```python 847 get_metadata_field_unique_values( 848 metadata_field: str, search_term: str | None, from_: int, size: int 849 ) -> tuple[list[str], int] 850 ``` 851 852 Returns unique values for a given metadata field, optionally filtered by a search term. 853 854 **Parameters:** 855 856 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 857 - **search_term** (<code>str | None</code>) – Optional search term to filter documents by content before extracting unique values. 858 If None, all documents are considered. 859 - **from\_** (<code>int</code>) – The offset for pagination (0-based). 860 - **size** (<code>int</code>) – The number of unique values to return. 861 862 **Returns:** 863 864 - <code>tuple\[list\[str\], int\]</code> – A tuple containing: 865 - A list of unique values (as strings) 866 - The total count of unique values 867 868 #### get_metadata_field_unique_values_async 869 870 ```python 871 get_metadata_field_unique_values_async( 872 metadata_field: str, search_term: str | None, from_: int, size: int 873 ) -> tuple[list[str], int] 874 ``` 875 876 Asynchronously returns unique values for a given metadata field, optionally filtered by a search term. 877 878 **Parameters:** 879 880 - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix. 881 - **search_term** (<code>str | None</code>) – Optional search term to filter documents by content before extracting unique values. 882 If None, all documents are considered. 883 - **from\_** (<code>int</code>) – The offset for pagination (0-based). 884 - **size** (<code>int</code>) – The number of unique values to return. 885 886 **Returns:** 887 888 - <code>tuple\[list\[str\], int\]</code> – A tuple containing: 889 - A list of unique values (as strings) 890 - The total count of unique values