choosing-a-document-store.mdx
1 --- 2 title: "Choosing a Document Store" 3 id: choosing-a-document-store 4 slug: "/choosing-a-document-store" 5 description: "This article goes through different types of Document Stores and explains their advantages and disadvantages." 6 --- 7 8 import ClickableImage from "@site/src/components/ClickableImage"; 9 10 # Choosing a Document Store 11 12 This article goes through different types of Document Stores and explains their advantages and disadvantages. 13 14 ### Introduction 15 16 Whether you are developing a chatbot, a RAG system, or an image captioner, at some point, it’ll be likely for your AI application to compare the input it gets with the information it already knows. Most of the time, this comparison is performed through vector similarity search. 17 18 If you’re unfamiliar with vectors, think about them as a way to represent text, images, or audio/video in a numerical form called vector embeddings. Vector databases are specifically designed to store such vectors efficiently, providing all the functionalities an AI application needs to implement data retrieval and similarity search. 19 20 Document Stores are special objects in Haystack that abstract all the different vector databases into a common interface that can be easily integrated into a pipeline, most commonly through a Retriever component. Normally, you will find specialized Document Store and Retriever objects for each vector database Haystack supports. 21 22 ### Types of vector databases 23 24 But why are vector databases so different, and which one should you use in your Haystack pipeline? 25 26 We can group vector databases into five categories, from more specialized to general purpose: 27 28 - Vector libraries 29 - Pure vector databases 30 - Vector-capable SQL databases 31 - Vector-capable NoSQL databases 32 - Full-text search databases 33 34 We are working on supporting all these types in Haystack. 35 36 In the meantime, here’s the most recent overview of available integrations: 37 <ClickableImage src="/img/2c188e9-2.0_Document_Stores_6.png" alt="Document store categories diagram showing four types: pure vector databases (Chroma, Milvus, Pinecone, Weaviate, Qdrant), full-text search databases (Elasticsearch, OpenSearch), vector-capable SQL databases (Pgvector for PostgreSQL), and vector-capable NoSQL databases (DataStax Astra, MongoDB, neo4j)" className="img-light-bg" /> 38 39 #### Summary 40 41 Here is a quick summary of different Document Stores available in Haystack. 42 43 Continue further down the article for a more complex explanation of the strengths and disadvantages of each type. 44 45 <div className="key-value-table"> 46 47 | | | 48 | --- | --- | 49 | Type | Best for | 50 | Vector libraries | Managing hardware resources effectively. | 51 | Pure vector DBs | Managing lots of high-dimensional data. | 52 | Vector-capable SQL DBs | Lower maintenance costs with focus on structured data and less on vectors. | 53 | Vector-capable NoSQL DBs | Combining vectors with structured data without the limitations of the traditional relational model. | 54 | Full-text search DBs | Superior full-text search, reliable for production. | 55 | In-memory | Fast, minimal prototypes on small datasets. | 56 57 </div> 58 59 #### Vector libraries 60 61 Vector libraries are often included in the “vector database” category improperly, as they are limited to handling only vectors, are designed to work in-memory, and normally don’t have a clean way to store data on disk. Still, they are the way to go every time performance and speed are the top requirements for your AI application, as these libraries can use hardware resources very effectively. 62 63 :::warning[In progress] 64 65 We are currently developing the support for vector libraries in Haystack. 66 ::: 67 68 #### Pure vector databases 69 70 Pure vector databases, also known as just “vector databases”, offer efficient similarity search capabilities through advanced indexing techniques. Most of them support metadata, and despite a recent trend to add more text-search features on top of it, you should consider pure vector databases closer to vector libraries than a regular database. Pick a pure vector database when your application needs to manage huge amounts of high-dimensional data effectively: they are designed to be highly scalable and highly available. Most are open source, but companies usually provide them “as a service” through paid subscriptions. 71 72 - [Chroma](../../document-stores/chromadocumentstore.mdx) 73 - [Pinecone](../../document-stores/pinecone-document-store.mdx) 74 - [Qdrant](../../document-stores/qdrant-document-store.mdx) 75 - [Weaviate](../../document-stores/weaviatedocumentstore.mdx) 76 - [Milvus](https://haystack.deepset.ai/integrations/milvus-document-store) (external integration) 77 78 #### Vector-capable SQL databases 79 80 This category is relatively small but growing fast and includes well-known relational databases where vector capabilities were added through plugins or extensions. They are not as performant as the previous categories, but the main advantage of these databases is the opportunity to easily combine vectors with structured data, having a one-stop data shop for your application. You should pick a vector-capable SQL database when the performance trade-off is paid off by the lower cost of maintaining a single database instance for your application or when the structured data plays a more fundamental role in your business logic, with vectors being more of a nice-to-have. 81 82 - [Pgvector](../../document-stores/pgvectordocumentstore.mdx) 83 84 #### Vector-capable NoSQL databases 85 86 Historically, the killer features of NoSQL databases were the ability to scale horizontally and the adoption of a flexible data model to overcome certain limitations of the traditional relational model. This stays true for databases in this category, where the vector capabilities are added on top of the existing features. Similarly to the previous category, vector support might not be as good as pure vector databases, but once again, there is a tradeoff that might be convenient to bear depending on the use case. For example, if a certain NoSQL database is already part of the stack of your application and a lower performance is not a show-stopper, you might give it a shot. 87 88 - [Astra](../../document-stores/astradocumentstore.mdx) 89 - [MongoDB](../../document-stores/mongodbatlasdocumentstore.mdx) 90 - [Neo4j](https://haystack.deepset.ai/integrations/neo4j-document-store) (external) 91 92 #### Full-text search databases 93 94 The main advantage of full-text search databases is they are already designed to work with text, so you can expect a high level of support for text data along with good performance and the opportunity to scale both horizontally and vertically. Initially, vector capabilities were subpar and provided through plugins or extensions, but this is rapidly changing. You can see how the market leaders in this category have recently added first-class support for vectors. Pick a full-text search database if text data plays a central role in your business logic so that you can easily and effectively implement techniques like hybrid search with a good level of support for similarity search and state-of-the-art support for full-text search. 95 96 - [Elasticsearch](../../document-stores/elasticsearch-document-store.mdx) 97 - [OpenSearch](../../document-stores/opensearch-document-store.mdx) 98 99 #### The in-memory Document Store 100 101 Haystack ships with an ephemeral document store that relies on pure Python data structures stored in memory, so it doesn’t fall into any of the vector database categories above. This special Document Store is ideal for creating quick prototypes with small datasets. It doesn’t require any special setup, and it can be used right away without installing additional dependencies. 102 103 - [InMemory](../../document-stores/inmemorydocumentstore.mdx) 104 105 ### Final considerations 106 107 It can be very challenging to pick one vector database over another by only looking at pure performance, as even the slightest difference in the benchmark can produce a different leaderboard (for example, some benchmarks test the cloud services while others work on a reference machine). Thinking about including features like filtering or not can bring in a whole new set of complexities that make the comparison even harder. 108 109 What’s important for you to know is that the Document Store interface doesn’t add much to the costs, and the relative performance of one vector database over another should stay the same when used within Haystack pipelines.