document_stores_api.md
1 --- 2 title: "Document Stores" 3 id: document-stores-api 4 description: "Stores your texts and meta data and provides them to the Retriever at query time." 5 slug: "/document-stores-api" 6 --- 7 8 <a id="document_store"></a> 9 10 ## Module document\_store 11 12 <a id="document_store.BM25DocumentStats"></a> 13 14 ### BM25DocumentStats 15 16 A dataclass for managing document statistics for BM25 retrieval. 17 18 **Arguments**: 19 20 - `freq_token`: A Counter of token frequencies in the document. 21 - `doc_len`: Number of tokens in the document. 22 23 <a id="document_store.InMemoryDocumentStore"></a> 24 25 ### InMemoryDocumentStore 26 27 Stores data in-memory. It's ephemeral and cannot be saved to disk. 28 29 <a id="document_store.InMemoryDocumentStore.__init__"></a> 30 31 #### InMemoryDocumentStore.\_\_init\_\_ 32 33 ```python 34 def __init__(bm25_tokenization_regex: str = r"(?u)\b\w\w+\b", 35 bm25_algorithm: Literal["BM25Okapi", "BM25L", 36 "BM25Plus"] = "BM25L", 37 bm25_parameters: Optional[dict] = None, 38 embedding_similarity_function: Literal["dot_product", 39 "cosine"] = "dot_product", 40 index: Optional[str] = None, 41 async_executor: Optional[ThreadPoolExecutor] = None, 42 return_embedding: bool = True) 43 ``` 44 45 Initializes the DocumentStore. 46 47 **Arguments**: 48 49 - `bm25_tokenization_regex`: The regular expression used to tokenize the text for BM25 retrieval. 50 - `bm25_algorithm`: The BM25 algorithm to use. One of "BM25Okapi", "BM25L", or "BM25Plus". 51 - `bm25_parameters`: Parameters for BM25 implementation in a dictionary format. 52 For example: `{'k1':1.5, 'b':0.75, 'epsilon':0.25}` 53 You can learn more about these parameters by visiting https://github.com/dorianbrown/rank_bm25. 54 - `embedding_similarity_function`: The similarity function used to compare Documents embeddings. 55 One of "dot_product" (default) or "cosine". To choose the most appropriate function, look for information 56 about your embedding model. 57 - `index`: A specific index to store the documents. If not specified, a random UUID is used. 58 Using the same index allows you to store documents across multiple InMemoryDocumentStore instances. 59 - `async_executor`: Optional ThreadPoolExecutor to use for async calls. If not provided, a single-threaded 60 executor will be initialized and used. 61 - `return_embedding`: Whether to return the embedding of the retrieved Documents. Default is True. 62 63 <a id="document_store.InMemoryDocumentStore.__del__"></a> 64 65 #### InMemoryDocumentStore.\_\_del\_\_ 66 67 ```python 68 def __del__() 69 ``` 70 71 Cleanup when the instance is being destroyed. 72 73 <a id="document_store.InMemoryDocumentStore.shutdown"></a> 74 75 #### InMemoryDocumentStore.shutdown 76 77 ```python 78 def shutdown() 79 ``` 80 81 Explicitly shutdown the executor if we own it. 82 83 <a id="document_store.InMemoryDocumentStore.storage"></a> 84 85 #### InMemoryDocumentStore.storage 86 87 ```python 88 @property 89 def storage() -> dict[str, Document] 90 ``` 91 92 Utility property that returns the storage used by this instance of InMemoryDocumentStore. 93 94 <a id="document_store.InMemoryDocumentStore.to_dict"></a> 95 96 #### InMemoryDocumentStore.to\_dict 97 98 ```python 99 def to_dict() -> dict[str, Any] 100 ``` 101 102 Serializes the component to a dictionary. 103 104 **Returns**: 105 106 Dictionary with serialized data. 107 108 <a id="document_store.InMemoryDocumentStore.from_dict"></a> 109 110 #### InMemoryDocumentStore.from\_dict 111 112 ```python 113 @classmethod 114 def from_dict(cls, data: dict[str, Any]) -> "InMemoryDocumentStore" 115 ``` 116 117 Deserializes the component from a dictionary. 118 119 **Arguments**: 120 121 - `data`: The dictionary to deserialize from. 122 123 **Returns**: 124 125 The deserialized component. 126 127 <a id="document_store.InMemoryDocumentStore.save_to_disk"></a> 128 129 #### InMemoryDocumentStore.save\_to\_disk 130 131 ```python 132 def save_to_disk(path: str) -> None 133 ``` 134 135 Write the database and its' data to disk as a JSON file. 136 137 **Arguments**: 138 139 - `path`: The path to the JSON file. 140 141 <a id="document_store.InMemoryDocumentStore.load_from_disk"></a> 142 143 #### InMemoryDocumentStore.load\_from\_disk 144 145 ```python 146 @classmethod 147 def load_from_disk(cls, path: str) -> "InMemoryDocumentStore" 148 ``` 149 150 Load the database and its' data from disk as a JSON file. 151 152 **Arguments**: 153 154 - `path`: The path to the JSON file. 155 156 **Returns**: 157 158 The loaded InMemoryDocumentStore. 159 160 <a id="document_store.InMemoryDocumentStore.count_documents"></a> 161 162 #### InMemoryDocumentStore.count\_documents 163 164 ```python 165 def count_documents() -> int 166 ``` 167 168 Returns the number of how many documents are present in the DocumentStore. 169 170 <a id="document_store.InMemoryDocumentStore.filter_documents"></a> 171 172 #### InMemoryDocumentStore.filter\_documents 173 174 ```python 175 def filter_documents( 176 filters: Optional[dict[str, Any]] = None) -> list[Document] 177 ``` 178 179 Returns the documents that match the filters provided. 180 181 For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol 182 documentation. 183 184 **Arguments**: 185 186 - `filters`: The filters to apply to the document list. 187 188 **Returns**: 189 190 A list of Documents that match the given filters. 191 192 <a id="document_store.InMemoryDocumentStore.write_documents"></a> 193 194 #### InMemoryDocumentStore.write\_documents 195 196 ```python 197 def write_documents(documents: list[Document], 198 policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int 199 ``` 200 201 Refer to the DocumentStore.write_documents() protocol documentation. 202 203 If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`. 204 205 <a id="document_store.InMemoryDocumentStore.delete_documents"></a> 206 207 #### InMemoryDocumentStore.delete\_documents 208 209 ```python 210 def delete_documents(document_ids: list[str]) -> None 211 ``` 212 213 Deletes all documents with matching document_ids from the DocumentStore. 214 215 **Arguments**: 216 217 - `document_ids`: The object_ids to delete. 218 219 <a id="document_store.InMemoryDocumentStore.bm25_retrieval"></a> 220 221 #### InMemoryDocumentStore.bm25\_retrieval 222 223 ```python 224 def bm25_retrieval(query: str, 225 filters: Optional[dict[str, Any]] = None, 226 top_k: int = 10, 227 scale_score: bool = False) -> list[Document] 228 ``` 229 230 Retrieves documents that are most relevant to the query using BM25 algorithm. 231 232 **Arguments**: 233 234 - `query`: The query string. 235 - `filters`: A dictionary with filters to narrow down the search space. 236 - `top_k`: The number of top documents to retrieve. Default is 10. 237 - `scale_score`: Whether to scale the scores of the retrieved documents. Default is False. 238 239 **Returns**: 240 241 A list of the top_k documents most relevant to the query. 242 243 <a id="document_store.InMemoryDocumentStore.embedding_retrieval"></a> 244 245 #### InMemoryDocumentStore.embedding\_retrieval 246 247 ```python 248 def embedding_retrieval( 249 query_embedding: list[float], 250 filters: Optional[dict[str, Any]] = None, 251 top_k: int = 10, 252 scale_score: bool = False, 253 return_embedding: Optional[bool] = False) -> list[Document] 254 ``` 255 256 Retrieves documents that are most similar to the query embedding using a vector similarity metric. 257 258 **Arguments**: 259 260 - `query_embedding`: Embedding of the query. 261 - `filters`: A dictionary with filters to narrow down the search space. 262 - `top_k`: The number of top documents to retrieve. Default is 10. 263 - `scale_score`: Whether to scale the scores of the retrieved Documents. Default is False. 264 - `return_embedding`: Whether to return the embedding of the retrieved Documents. 265 If not provided, the value of the `return_embedding` parameter set at component 266 initialization will be used. Default is False. 267 268 **Returns**: 269 270 A list of the top_k documents most relevant to the query. 271 272 <a id="document_store.InMemoryDocumentStore.count_documents_async"></a> 273 274 #### InMemoryDocumentStore.count\_documents\_async 275 276 ```python 277 async def count_documents_async() -> int 278 ``` 279 280 Returns the number of how many documents are present in the DocumentStore. 281 282 <a id="document_store.InMemoryDocumentStore.filter_documents_async"></a> 283 284 #### InMemoryDocumentStore.filter\_documents\_async 285 286 ```python 287 async def filter_documents_async( 288 filters: Optional[dict[str, Any]] = None) -> list[Document] 289 ``` 290 291 Returns the documents that match the filters provided. 292 293 For a detailed specification of the filters, refer to the DocumentStore.filter_documents() protocol 294 documentation. 295 296 **Arguments**: 297 298 - `filters`: The filters to apply to the document list. 299 300 **Returns**: 301 302 A list of Documents that match the given filters. 303 304 <a id="document_store.InMemoryDocumentStore.write_documents_async"></a> 305 306 #### InMemoryDocumentStore.write\_documents\_async 307 308 ```python 309 async def write_documents_async( 310 documents: list[Document], 311 policy: DuplicatePolicy = DuplicatePolicy.NONE) -> int 312 ``` 313 314 Refer to the DocumentStore.write_documents() protocol documentation. 315 316 If `policy` is set to `DuplicatePolicy.NONE` defaults to `DuplicatePolicy.FAIL`. 317 318 <a id="document_store.InMemoryDocumentStore.delete_documents_async"></a> 319 320 #### InMemoryDocumentStore.delete\_documents\_async 321 322 ```python 323 async def delete_documents_async(document_ids: list[str]) -> None 324 ``` 325 326 Deletes all documents with matching document_ids from the DocumentStore. 327 328 **Arguments**: 329 330 - `document_ids`: The object_ids to delete. 331 332 <a id="document_store.InMemoryDocumentStore.bm25_retrieval_async"></a> 333 334 #### InMemoryDocumentStore.bm25\_retrieval\_async 335 336 ```python 337 async def bm25_retrieval_async(query: str, 338 filters: Optional[dict[str, Any]] = None, 339 top_k: int = 10, 340 scale_score: bool = False) -> list[Document] 341 ``` 342 343 Retrieves documents that are most relevant to the query using BM25 algorithm. 344 345 **Arguments**: 346 347 - `query`: The query string. 348 - `filters`: A dictionary with filters to narrow down the search space. 349 - `top_k`: The number of top documents to retrieve. Default is 10. 350 - `scale_score`: Whether to scale the scores of the retrieved documents. Default is False. 351 352 **Returns**: 353 354 A list of the top_k documents most relevant to the query. 355 356 <a id="document_store.InMemoryDocumentStore.embedding_retrieval_async"></a> 357 358 #### InMemoryDocumentStore.embedding\_retrieval\_async 359 360 ```python 361 async def embedding_retrieval_async( 362 query_embedding: list[float], 363 filters: Optional[dict[str, Any]] = None, 364 top_k: int = 10, 365 scale_score: bool = False, 366 return_embedding: bool = False) -> list[Document] 367 ``` 368 369 Retrieves documents that are most similar to the query embedding using a vector similarity metric. 370 371 **Arguments**: 372 373 - `query_embedding`: Embedding of the query. 374 - `filters`: A dictionary with filters to narrow down the search space. 375 - `top_k`: The number of top documents to retrieve. Default is 10. 376 - `scale_score`: Whether to scale the scores of the retrieved Documents. Default is False. 377 - `return_embedding`: Whether to return the embedding of the retrieved Documents. Default is False. 378 379 **Returns**: 380 381 A list of the top_k documents most relevant to the query. 382