Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.23 / integrations-api / pgvector.md
pgvector.md
  1  ---
  2  title: "Pgvector"
  3  id: integrations-pgvector
  4  description: "Pgvector integration for Haystack"
  5  slug: "/integrations-pgvector"
  6  ---
  7  
  8  
  9  ## haystack_integrations.components.retrievers.pgvector.embedding_retriever
 10  
 11  ### PgvectorEmbeddingRetriever
 12  
 13  Retrieves documents from the `PgvectorDocumentStore`, based on their dense embeddings.
 14  
 15  Example usage:
 16  
 17  ```python
 18  from haystack.document_stores import DuplicatePolicy
 19  from haystack import Document, Pipeline
 20  from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
 21  
 22  from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
 23  from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
 24  
 25  # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
 26  # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
 27  
 28  document_store = PgvectorDocumentStore(
 29      embedding_dimension=768,
 30      vector_function="cosine_similarity",
 31      recreate_table=True,
 32  )
 33  
 34  documents = [Document(content="There are over 7,000 languages spoken around the world today."),
 35               Document(content="Elephants have been observed to behave in a way that indicates..."),
 36               Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]
 37  
 38  document_embedder = SentenceTransformersDocumentEmbedder()
 39  document_embedder.warm_up()
 40  documents_with_embeddings = document_embedder.run(documents)
 41  
 42  document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)
 43  
 44  query_pipeline = Pipeline()
 45  query_pipeline.add_component("text_embedder", SentenceTransformersTextEmbedder())
 46  query_pipeline.add_component("retriever", PgvectorEmbeddingRetriever(document_store=document_store))
 47  query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
 48  
 49  query = "How many languages are there?"
 50  
 51  res = query_pipeline.run({"text_embedder": {"text": query}})
 52  
 53  assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."
 54  ```
 55  
 56  #### __init__
 57  
 58  ```python
 59  __init__(
 60      *,
 61      document_store: PgvectorDocumentStore,
 62      filters: dict[str, Any] | None = None,
 63      top_k: int = 10,
 64      vector_function: (
 65          Literal["cosine_similarity", "inner_product", "l2_distance"] | None
 66      ) = None,
 67      filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
 68  )
 69  ```
 70  
 71  **Parameters:**
 72  
 73  - **document_store** (<code>PgvectorDocumentStore</code>) – An instance of `PgvectorDocumentStore`.
 74  - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents.
 75  - **top_k** (<code>int</code>) – Maximum number of Documents to return.
 76  - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings.
 77    Defaults to the one set in the `document_store` instance.
 78    `"cosine_similarity"` and `"inner_product"` are similarity functions and
 79    higher scores indicate greater similarity between the documents.
 80    `"l2_distance"` returns the straight-line distance between vectors,
 81    and the most similar documents are the ones with the smallest score.
 82    **Important**: if the document store is using the `"hnsw"` search strategy, the vector function
 83    should match the one utilized during index creation to take advantage of the index.
 84  - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied.
 85  
 86  **Raises:**
 87  
 88  - <code>ValueError</code> – If `document_store` is not an instance of `PgvectorDocumentStore` or if `vector_function`
 89    is not one of the valid options.
 90  
 91  #### to_dict
 92  
 93  ```python
 94  to_dict() -> dict[str, Any]
 95  ```
 96  
 97  Serializes the component to a dictionary.
 98  
 99  **Returns:**
100  
101  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
102  
103  #### from_dict
104  
105  ```python
106  from_dict(data: dict[str, Any]) -> PgvectorEmbeddingRetriever
107  ```
108  
109  Deserializes the component from a dictionary.
110  
111  **Parameters:**
112  
113  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
114  
115  **Returns:**
116  
117  - <code>PgvectorEmbeddingRetriever</code> – Deserialized component.
118  
119  #### run
120  
121  ```python
122  run(
123      query_embedding: list[float],
124      filters: dict[str, Any] | None = None,
125      top_k: int | None = None,
126      vector_function: (
127          Literal["cosine_similarity", "inner_product", "l2_distance"] | None
128      ) = None,
129  ) -> dict[str, list[Document]]
130  ```
131  
132  Retrieve documents from the `PgvectorDocumentStore`, based on their embeddings.
133  
134  **Parameters:**
135  
136  - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query.
137  - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on
138    the `filter_policy` chosen at retriever initialization. See init method docstring for more
139    details.
140  - **top_k** (<code>int | None</code>) – Maximum number of Documents to return.
141  - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings.
142  
143  **Returns:**
144  
145  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
146  - `documents`: List of `Document`s that are similar to `query_embedding`.
147  
148  #### run_async
149  
150  ```python
151  run_async(
152      query_embedding: list[float],
153      filters: dict[str, Any] | None = None,
154      top_k: int | None = None,
155      vector_function: (
156          Literal["cosine_similarity", "inner_product", "l2_distance"] | None
157      ) = None,
158  ) -> dict[str, list[Document]]
159  ```
160  
161  Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on their embeddings.
162  
163  **Parameters:**
164  
165  - **query_embedding** (<code>list\[float\]</code>) – Embedding of the query.
166  - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on
167    the `filter_policy` chosen at retriever initialization. See init method docstring for more
168    details.
169  - **top_k** (<code>int | None</code>) – Maximum number of Documents to return.
170  - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance'] | None</code>) – The similarity function to use when searching for similar embeddings.
171  
172  **Returns:**
173  
174  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
175  - `documents`: List of `Document`s that are similar to `query_embedding`.
176  
177  ## haystack_integrations.components.retrievers.pgvector.keyword_retriever
178  
179  ### PgvectorKeywordRetriever
180  
181  Retrieve documents from the `PgvectorDocumentStore`, based on keywords.
182  
183  To rank the documents, the `ts_rank_cd` function of PostgreSQL is used.
184  It considers how often the query terms appear in the document, how close together the terms are in the document,
185  and how important is the part of the document where they occur.
186  For more details, see
187  [Postgres documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING).
188  
189  Usage example:
190  
191  ````python
192  from haystack.document_stores import DuplicatePolicy
193  from haystack import Document
194  
195  from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
196  from haystack_integrations.components.retrievers.pgvector import PgvectorKeywordRetriever
197  
198  # Set an environment variable `PG_CONN_STR` with the connection string to your PostgreSQL database.
199  # e.g., "postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"
200  
201  document_store = PgvectorDocumentStore(language="english", recreate_table=True)
202  
203  documents = [Document(content="There are over 7,000 languages spoken around the world today."),
204      Document(content="Elephants have been observed to behave in a way that indicates..."),
205      Document(content="In certain places, you can witness the phenomenon of bioluminescent waves.")]
206  
207  document_store.write_documents(documents_with_embeddings.get("documents"), policy=DuplicatePolicy.OVERWRITE)
208  
209  retriever = PgvectorKeywordRetriever(document_store=document_store)
210  
211  result = retriever.run(query="languages")
212  
213  assert res['retriever']['documents'][0].content == "There are over 7,000 languages spoken around the world today."
214  
215  #### __init__
216  
217  ```python
218  __init__(
219      *,
220      document_store: PgvectorDocumentStore,
221      filters: dict[str, Any] | None = None,
222      top_k: int = 10,
223      filter_policy: str | FilterPolicy = FilterPolicy.REPLACE
224  )
225  ````
226  
227  **Parameters:**
228  
229  - **document_store** (<code>PgvectorDocumentStore</code>) – An instance of `PgvectorDocumentStore`.
230  - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents.
231  - **top_k** (<code>int</code>) – Maximum number of Documents to return.
232  - **filter_policy** (<code>str | FilterPolicy</code>) – Policy to determine how filters are applied.
233  
234  **Raises:**
235  
236  - <code>ValueError</code> – If `document_store` is not an instance of `PgvectorDocumentStore`.
237  
238  #### to_dict
239  
240  ```python
241  to_dict() -> dict[str, Any]
242  ```
243  
244  Serializes the component to a dictionary.
245  
246  **Returns:**
247  
248  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
249  
250  #### from_dict
251  
252  ```python
253  from_dict(data: dict[str, Any]) -> PgvectorKeywordRetriever
254  ```
255  
256  Deserializes the component from a dictionary.
257  
258  **Parameters:**
259  
260  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
261  
262  **Returns:**
263  
264  - <code>PgvectorKeywordRetriever</code> – Deserialized component.
265  
266  #### run
267  
268  ```python
269  run(
270      query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
271  ) -> dict[str, list[Document]]
272  ```
273  
274  Retrieve documents from the `PgvectorDocumentStore`, based on keywords.
275  
276  **Parameters:**
277  
278  - **query** (<code>str</code>) – String to search in `Document`s' content.
279  - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on
280    the `filter_policy` chosen at retriever initialization. See init method docstring for more
281    details.
282  - **top_k** (<code>int | None</code>) – Maximum number of Documents to return.
283  
284  **Returns:**
285  
286  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
287  - `documents`: List of `Document`s that match the query.
288  
289  #### run_async
290  
291  ```python
292  run_async(
293      query: str, filters: dict[str, Any] | None = None, top_k: int | None = None
294  ) -> dict[str, list[Document]]
295  ```
296  
297  Asynchronously retrieve documents from the `PgvectorDocumentStore`, based on keywords.
298  
299  **Parameters:**
300  
301  - **query** (<code>str</code>) – String to search in `Document`s' content.
302  - **filters** (<code>dict\[str, Any\] | None</code>) – Filters applied to the retrieved Documents. The way runtime filters are applied depends on
303    the `filter_policy` chosen at retriever initialization. See init method docstring for more
304    details.
305  - **top_k** (<code>int | None</code>) – Maximum number of Documents to return.
306  
307  **Returns:**
308  
309  - <code>dict\[str, list\[Document\]\]</code> – A dictionary with the following keys:
310  - `documents`: List of `Document`s that match the query.
311  
312  ## haystack_integrations.document_stores.pgvector.document_store
313  
314  ### PgvectorDocumentStore
315  
316  A Document Store using PostgreSQL with the [pgvector extension](https://github.com/pgvector/pgvector) installed.
317  
318  #### __init__
319  
320  ```python
321  __init__(
322      *,
323      connection_string: Secret = Secret.from_env_var("PG_CONN_STR"),
324      create_extension: bool = True,
325      schema_name: str = "public",
326      table_name: str = "haystack_documents",
327      language: str = "english",
328      embedding_dimension: int = 768,
329      vector_type: Literal["vector", "halfvec"] = "vector",
330      vector_function: Literal[
331          "cosine_similarity", "inner_product", "l2_distance"
332      ] = "cosine_similarity",
333      recreate_table: bool = False,
334      search_strategy: Literal[
335          "exact_nearest_neighbor", "hnsw"
336      ] = "exact_nearest_neighbor",
337      hnsw_recreate_index_if_exists: bool = False,
338      hnsw_index_creation_kwargs: dict[str, int] | None = None,
339      hnsw_index_name: str = "haystack_hnsw_index",
340      hnsw_ef_search: int | None = None,
341      keyword_index_name: str = "haystack_keyword_index"
342  )
343  ```
344  
345  Creates a new PgvectorDocumentStore instance.
346  It is meant to be connected to a PostgreSQL database with the pgvector extension installed.
347  A specific table to store Haystack documents will be created if it doesn't exist yet.
348  
349  **Parameters:**
350  
351  - **connection_string** (<code>Secret</code>) – The connection string to use to connect to the PostgreSQL database, defined as an
352    environment variable. Supported formats:
353  - URI, e.g. `PG_CONN_STR="postgresql://USER:PASSWORD@HOST:PORT/DB_NAME"` (use percent-encoding for special
354    characters)
355  - keyword/value format, e.g. `PG_CONN_STR="host=HOST port=PORT dbname=DBNAME user=USER password=PASSWORD"`
356    See [PostgreSQL Documentation](https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-CONNSTRING)
357    for more details.
358  - **create_extension** (<code>bool</code>) – Whether to create the pgvector extension if it doesn't exist.
359    Set this to `True` (default) to automatically create the extension if it is missing.
360    Creating the extension may require superuser privileges.
361    If set to `False`, ensure the extension is already installed; otherwise, an error will be raised.
362  - **schema_name** (<code>str</code>) – The name of the schema the table is created in. The schema must already exist.
363  - **table_name** (<code>str</code>) – The name of the table to use to store Haystack documents.
364  - **language** (<code>str</code>) – The language to be used to parse query and document content in keyword retrieval.
365    To see the list of available languages, you can run the following SQL query in your PostgreSQL database:
366    `SELECT cfgname FROM pg_ts_config;`.
367    More information can be found in this [StackOverflow answer](https://stackoverflow.com/a/39752553).
368  - **embedding_dimension** (<code>int</code>) – The dimension of the embedding.
369  - **vector_type** (<code>Literal['vector', 'halfvec']</code>) – The type of vector used for embedding storage.
370    "vector" is the default.
371    "halfvec" stores embeddings in half-precision, which is particularly useful for high-dimensional embeddings
372    (dimension greater than 2,000 and up to 4,000). Requires pgvector versions 0.7.0 or later. For more
373    information, see the [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file).
374  - **vector_function** (<code>Literal['cosine_similarity', 'inner_product', 'l2_distance']</code>) – The similarity function to use when searching for similar embeddings.
375    `"cosine_similarity"` and `"inner_product"` are similarity functions and
376    higher scores indicate greater similarity between the documents.
377    `"l2_distance"` returns the straight-line distance between vectors,
378    and the most similar documents are the ones with the smallest score.
379    **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the
380    `vector_function` passed here. Make sure subsequent queries will keep using the same
381    vector similarity function in order to take advantage of the index.
382  - **recreate_table** (<code>bool</code>) – Whether to recreate the table if it already exists.
383  - **search_strategy** (<code>Literal['exact_nearest_neighbor', 'hnsw']</code>) – The search strategy to use when searching for similar embeddings.
384    `"exact_nearest_neighbor"` provides perfect recall but can be slow for large numbers of documents.
385    `"hnsw"` is an approximate nearest neighbor search strategy,
386    which trades off some accuracy for speed; it is recommended for large numbers of documents.
387    **Important**: when using the `"hnsw"` search strategy, an index will be created that depends on the
388    `vector_function` passed here. Make sure subsequent queries will keep using the same
389    vector similarity function in order to take advantage of the index.
390  - **hnsw_recreate_index_if_exists** (<code>bool</code>) – Whether to recreate the HNSW index if it already exists.
391    Only used if search_strategy is set to `"hnsw"`.
392  - **hnsw_index_creation_kwargs** (<code>dict\[str, int\] | None</code>) – Additional keyword arguments to pass to the HNSW index creation.
393    Only used if search_strategy is set to `"hnsw"`. You can find the list of valid arguments in the
394    [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw)
395  - **hnsw_index_name** (<code>str</code>) – Index name for the HNSW index.
396  - **hnsw_ef_search** (<code>int | None</code>) – The `ef_search` parameter to use at query time. Only used if search_strategy is set to
397    `"hnsw"`. You can find more information about this parameter in the
398    [pgvector documentation](https://github.com/pgvector/pgvector?tab=readme-ov-file#hnsw).
399  - **keyword_index_name** (<code>str</code>) – Index name for the Keyword index.
400  
401  #### to_dict
402  
403  ```python
404  to_dict() -> dict[str, Any]
405  ```
406  
407  Serializes the component to a dictionary.
408  
409  **Returns:**
410  
411  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
412  
413  #### from_dict
414  
415  ```python
416  from_dict(data: dict[str, Any]) -> PgvectorDocumentStore
417  ```
418  
419  Deserializes the component from a dictionary.
420  
421  **Parameters:**
422  
423  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
424  
425  **Returns:**
426  
427  - <code>PgvectorDocumentStore</code> – Deserialized component.
428  
429  #### delete_table
430  
431  ```python
432  delete_table()
433  ```
434  
435  Deletes the table used to store Haystack documents.
436  The name of the schema (`schema_name`) and the name of the table (`table_name`)
437  are defined when initializing the `PgvectorDocumentStore`.
438  
439  #### delete_table_async
440  
441  ```python
442  delete_table_async()
443  ```
444  
445  Async method to delete the table used to store Haystack documents.
446  
447  #### count_documents
448  
449  ```python
450  count_documents() -> int
451  ```
452  
453  Returns how many documents are present in the document store.
454  
455  **Returns:**
456  
457  - <code>int</code> – Number of documents in the document store.
458  
459  #### count_documents_async
460  
461  ```python
462  count_documents_async() -> int
463  ```
464  
465  Returns how many documents are present in the document store.
466  
467  **Returns:**
468  
469  - <code>int</code> – Number of documents in the document store.
470  
471  #### filter_documents
472  
473  ```python
474  filter_documents(filters: dict[str, Any] | None = None) -> list[Document]
475  ```
476  
477  Returns the documents that match the filters provided.
478  
479  For a detailed specification of the filters,
480  refer to the [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering)
481  
482  **Parameters:**
483  
484  - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list.
485  
486  **Returns:**
487  
488  - <code>list\[Document\]</code> – A list of Documents that match the given filters.
489  
490  **Raises:**
491  
492  - <code>TypeError</code> – If `filters` is not a dictionary.
493  - <code>ValueError</code> – If `filters` syntax is invalid.
494  
495  #### filter_documents_async
496  
497  ```python
498  filter_documents_async(filters: dict[str, Any] | None = None) -> list[Document]
499  ```
500  
501  Asynchronously returns the documents that match the filters provided.
502  
503  For a detailed specification of the filters,
504  refer to the [documentation](https://docs.haystack.deepset.ai/docs/metadata-filtering)
505  
506  **Parameters:**
507  
508  - **filters** (<code>dict\[str, Any\] | None</code>) – The filters to apply to the document list.
509  
510  **Returns:**
511  
512  - <code>list\[Document\]</code> – A list of Documents that match the given filters.
513  
514  **Raises:**
515  
516  - <code>TypeError</code> – If `filters` is not a dictionary.
517  - <code>ValueError</code> – If `filters` syntax is invalid.
518  
519  #### write_documents
520  
521  ```python
522  write_documents(
523      documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
524  ) -> int
525  ```
526  
527  Writes documents to the document store.
528  
529  **Parameters:**
530  
531  - **documents** (<code>list\[Document\]</code>) – A list of Documents to write to the document store.
532  - **policy** (<code>DuplicatePolicy</code>) – The duplicate policy to use when writing documents.
533  
534  **Returns:**
535  
536  - <code>int</code> – The number of documents written to the document store.
537  
538  **Raises:**
539  
540  - <code>ValueError</code> – If `documents` contains objects that are not of type `Document`.
541  - <code>DuplicateDocumentError</code> – If a document with the same id already exists in the document store
542    and the policy is set to `DuplicatePolicy.FAIL` (or not specified).
543  - <code>DocumentStoreError</code> – If the write operation fails for any other reason.
544  
545  #### write_documents_async
546  
547  ```python
548  write_documents_async(
549      documents: list[Document], policy: DuplicatePolicy = DuplicatePolicy.NONE
550  ) -> int
551  ```
552  
553  Asynchronously writes documents to the document store.
554  
555  **Parameters:**
556  
557  - **documents** (<code>list\[Document\]</code>) – A list of Documents to write to the document store.
558  - **policy** (<code>DuplicatePolicy</code>) – The duplicate policy to use when writing documents.
559  
560  **Returns:**
561  
562  - <code>int</code> – The number of documents written to the document store.
563  
564  **Raises:**
565  
566  - <code>ValueError</code> – If `documents` contains objects that are not of type `Document`.
567  - <code>DuplicateDocumentError</code> – If a document with the same id already exists in the document store
568    and the policy is set to `DuplicatePolicy.FAIL` (or not specified).
569  - <code>DocumentStoreError</code> – If the write operation fails for any other reason.
570  
571  #### delete_documents
572  
573  ```python
574  delete_documents(document_ids: list[str]) -> None
575  ```
576  
577  Deletes documents that match the provided `document_ids` from the document store.
578  
579  **Parameters:**
580  
581  - **document_ids** (<code>list\[str\]</code>) – the document ids to delete
582  
583  #### delete_documents_async
584  
585  ```python
586  delete_documents_async(document_ids: list[str]) -> None
587  ```
588  
589  Asynchronously deletes documents that match the provided `document_ids` from the document store.
590  
591  **Parameters:**
592  
593  - **document_ids** (<code>list\[str\]</code>) – the document ids to delete
594  
595  #### delete_all_documents
596  
597  ```python
598  delete_all_documents() -> None
599  ```
600  
601  Deletes all documents in the document store.
602  
603  #### delete_all_documents_async
604  
605  ```python
606  delete_all_documents_async() -> None
607  ```
608  
609  Asynchronously deletes all documents in the document store.
610  
611  #### delete_by_filter
612  
613  ```python
614  delete_by_filter(filters: dict[str, Any]) -> int
615  ```
616  
617  Deletes all documents that match the provided filters.
618  
619  **Parameters:**
620  
621  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion.
622    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
623  
624  **Returns:**
625  
626  - <code>int</code> – The number of documents deleted.
627  
628  #### delete_by_filter_async
629  
630  ```python
631  delete_by_filter_async(filters: dict[str, Any]) -> int
632  ```
633  
634  Asynchronously deletes all documents that match the provided filters.
635  
636  **Parameters:**
637  
638  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for deletion.
639    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
640  
641  **Returns:**
642  
643  - <code>int</code> – The number of documents deleted.
644  
645  #### update_by_filter
646  
647  ```python
648  update_by_filter(filters: dict[str, Any], meta: dict[str, Any]) -> int
649  ```
650  
651  Updates the metadata of all documents that match the provided filters.
652  
653  **Parameters:**
654  
655  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating.
656    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
657  - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update.
658  
659  **Returns:**
660  
661  - <code>int</code> – The number of documents updated.
662  
663  #### update_by_filter_async
664  
665  ```python
666  update_by_filter_async(filters: dict[str, Any], meta: dict[str, Any]) -> int
667  ```
668  
669  Asynchronously updates the metadata of all documents that match the provided filters.
670  
671  **Parameters:**
672  
673  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents for updating.
674    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
675  - **meta** (<code>dict\[str, Any\]</code>) – The metadata fields to update.
676  
677  **Returns:**
678  
679  - <code>int</code> – The number of documents updated.
680  
681  #### count_documents_by_filter
682  
683  ```python
684  count_documents_by_filter(filters: dict[str, Any]) -> int
685  ```
686  
687  Returns the number of documents that match the provided filters.
688  
689  **Parameters:**
690  
691  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents.
692    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
693  
694  **Returns:**
695  
696  - <code>int</code> – The number of documents that match the filters.
697  
698  #### count_documents_by_filter_async
699  
700  ```python
701  count_documents_by_filter_async(filters: dict[str, Any]) -> int
702  ```
703  
704  Asynchronously returns the number of documents that match the provided filters.
705  
706  **Parameters:**
707  
708  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to count documents.
709    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
710  
711  **Returns:**
712  
713  - <code>int</code> – The number of documents that match the filters.
714  
715  #### count_unique_metadata_by_filter
716  
717  ```python
718  count_unique_metadata_by_filter(
719      filters: dict[str, Any], metadata_fields: list[str]
720  ) -> dict[str, int]
721  ```
722  
723  Returns the count of unique values for each specified metadata field,
724  considering only documents that match the provided filters.
725  
726  **Parameters:**
727  
728  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents.
729    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
730  - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for.
731    Field names can include or omit the "meta." prefix.
732  
733  **Returns:**
734  
735  - <code>dict\[str, int\]</code> – A dictionary mapping field names to their unique value counts.
736  
737  #### count_unique_metadata_by_filter_async
738  
739  ```python
740  count_unique_metadata_by_filter_async(
741      filters: dict[str, Any], metadata_fields: list[str]
742  ) -> dict[str, int]
743  ```
744  
745  Asynchronously returns the count of unique values for each specified metadata field,
746  considering only documents that match the provided filters.
747  
748  **Parameters:**
749  
750  - **filters** (<code>dict\[str, Any\]</code>) – The filters to apply to select documents.
751    For filter syntax, see [Haystack metadata filtering](https://docs.haystack.deepset.ai/docs/metadata-filtering)
752  - **metadata_fields** (<code>list\[str\]</code>) – List of metadata field names to count unique values for.
753    Field names can include or omit the "meta." prefix.
754  
755  **Returns:**
756  
757  - <code>dict\[str, int\]</code> – A dictionary mapping field names to their unique value counts.
758  
759  #### get_metadata_fields_info
760  
761  ```python
762  get_metadata_fields_info() -> dict[str, dict[str, str]]
763  ```
764  
765  Returns the information about the metadata fields in the document store.
766  
767  Since metadata is stored in a JSONB field, this method analyzes actual data
768  to infer field types.
769  
770  Example return:
771  
772  ```python
773  {
774      'content': {'type': 'text'},
775      'category': {'type': 'text'},
776      'status': {'type': 'text'},
777      'priority': {'type': 'integer'},
778  }
779  ```
780  
781  **Returns:**
782  
783  - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping field names to their type information.
784  
785  #### get_metadata_fields_info_async
786  
787  ```python
788  get_metadata_fields_info_async() -> dict[str, dict[str, str]]
789  ```
790  
791  Asynchronously returns the information about the metadata fields in the document store.
792  
793  Since metadata is stored in a JSONB field, this method analyzes actual data
794  to infer field types.
795  
796  **Returns:**
797  
798  - <code>dict\[str, dict\[str, str\]\]</code> – A dictionary mapping field names to their type information.
799  
800  #### get_metadata_field_min_max
801  
802  ```python
803  get_metadata_field_min_max(metadata_field: str) -> dict[str, Any]
804  ```
805  
806  Returns the minimum and maximum values for a given metadata field.
807  
808  **Parameters:**
809  
810  - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix.
811  
812  **Returns:**
813  
814  - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the minimum and maximum values.
815    For numeric fields (integer, real), returns numeric min/max.
816    For text fields, returns lexicographic min/max based on database collation.
817  
818  **Raises:**
819  
820  - <code>ValueError</code> – If the field doesn't exist or has no values.
821  
822  #### get_metadata_field_min_max_async
823  
824  ```python
825  get_metadata_field_min_max_async(metadata_field: str) -> dict[str, Any]
826  ```
827  
828  Asynchronously returns the minimum and maximum values for a given metadata field.
829  
830  **Parameters:**
831  
832  - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix.
833  
834  **Returns:**
835  
836  - <code>dict\[str, Any\]</code> – A dictionary with 'min' and 'max' keys containing the minimum and maximum values.
837    For numeric fields (integer, real), returns numeric min/max.
838    For text fields, returns lexicographic min/max based on database collation.
839  
840  **Raises:**
841  
842  - <code>ValueError</code> – If the field doesn't exist or has no values.
843  
844  #### get_metadata_field_unique_values
845  
846  ```python
847  get_metadata_field_unique_values(
848      metadata_field: str, search_term: str | None, from_: int, size: int
849  ) -> tuple[list[str], int]
850  ```
851  
852  Returns unique values for a given metadata field, optionally filtered by a search term.
853  
854  **Parameters:**
855  
856  - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix.
857  - **search_term** (<code>str | None</code>) – Optional search term to filter documents by content before extracting unique values.
858    If None, all documents are considered.
859  - **from\_** (<code>int</code>) – The offset for pagination (0-based).
860  - **size** (<code>int</code>) – The number of unique values to return.
861  
862  **Returns:**
863  
864  - <code>tuple\[list\[str\], int\]</code> – A tuple containing:
865  - A list of unique values (as strings)
866  - The total count of unique values
867  
868  #### get_metadata_field_unique_values_async
869  
870  ```python
871  get_metadata_field_unique_values_async(
872      metadata_field: str, search_term: str | None, from_: int, size: int
873  ) -> tuple[list[str], int]
874  ```
875  
876  Asynchronously returns unique values for a given metadata field, optionally filtered by a search term.
877  
878  **Parameters:**
879  
880  - **metadata_field** (<code>str</code>) – The name of the metadata field. Can include or omit the "meta." prefix.
881  - **search_term** (<code>str | None</code>) – Optional search term to filter documents by content before extracting unique values.
882    If None, all documents are considered.
883  - **from\_** (<code>int</code>) – The offset for pagination (0-based).
884  - **size** (<code>int</code>) – The number of unique values to return.
885  
886  **Returns:**
887  
888  - <code>tuple\[list\[str\], int\]</code> – A tuple containing:
889  - A list of unique values (as strings)
890  - The total count of unique values