Cradicle Explorer

/ docs-website / docs / concepts / document-store / choosing-a-document-store.mdx
choosing-a-document-store.mdx
  1  ---
  2  title: "Choosing a Document Store"
  3  id: choosing-a-document-store
  4  slug: "/choosing-a-document-store"
  5  description: "This article goes through different types of Document Stores and explains their advantages and disadvantages."
  6  ---
  7  
  8  import ClickableImage from "@site/src/components/ClickableImage";
  9  
 10  # Choosing a Document Store
 11  
 12  This article goes through different types of Document Stores and explains their advantages and disadvantages.
 13  
 14  ### Introduction
 15  
 16  Whether you are developing a chatbot, a RAG system, or an image captioner, at some point, it’ll be likely for your AI application to compare the input it gets with the information it already knows. Most of the time, this comparison is performed through vector similarity search.
 17  
 18  If you’re unfamiliar with vectors, think about them as a way to represent text, images, or audio/video in a numerical form called vector embeddings. Vector databases are specifically designed to store such vectors efficiently, providing all the functionalities an AI application needs to implement data retrieval and similarity search.
 19  
 20  Document Stores are special objects in Haystack that abstract all the different vector databases into a common interface that can be easily integrated into a pipeline, most commonly through a Retriever component. Normally, you will find specialized Document Store and Retriever objects for each vector database Haystack supports.
 21  
 22  ### Types of vector databases
 23  
 24  But why are vector databases so different, and which one should you use in your Haystack pipeline?
 25  
 26  We can group vector databases into five categories, from more specialized to general purpose:
 27  
 28  - Vector libraries
 29  - Pure vector databases
 30  - Vector-capable SQL databases
 31  - Vector-capable NoSQL databases
 32  - Full-text search databases
 33  
 34  We are working on supporting all these types in Haystack.
 35  
 36  In the meantime, here’s the most recent overview of available integrations:
 37  <ClickableImage src="/img/2c188e9-2.0_Document_Stores_6.png" alt="Document store categories diagram showing four types: pure vector databases (Chroma, Milvus, Pinecone, Weaviate, Qdrant), full-text search databases (Elasticsearch, OpenSearch), vector-capable SQL databases (Pgvector for PostgreSQL), and vector-capable NoSQL databases (DataStax Astra, MongoDB, neo4j)" className="img-light-bg" />
 38  
 39  #### Summary
 40  
 41  Here is a quick summary of different Document Stores available in Haystack.
 42  
 43  Continue further down the article for a more complex explanation of the strengths and disadvantages of each type.
 44  
 45  <div className="key-value-table">
 46  
 47  |  |  |
 48  | --- | --- |
 49  | Type                     | Best for                                                                                            |
 50  | Vector libraries         | Managing hardware resources effectively.                                                            |
 51  | Pure vector DBs          | Managing lots of high-dimensional data.                                                             |
 52  | Vector-capable SQL DBs   | Lower maintenance costs with focus on structured data and less on vectors.                          |
 53  | Vector-capable NoSQL DBs | Combining vectors with structured data without the limitations of the traditional relational model. |
 54  | Full-text search DBs     | Superior full-text search, reliable for production.                                                 |
 55  | In-memory                | Fast, minimal prototypes on small datasets.                                                         |
 56  
 57  </div>
 58  
 59  #### Vector libraries
 60  
 61  Vector libraries are often included in the “vector database” category improperly, as they are limited to handling only vectors, are designed to work in-memory, and normally don’t have a clean way to store data on disk. Still, they are the way to go every time performance and speed are the top requirements for your AI application, as these libraries can use hardware resources very effectively.
 62  
 63  :::warning[In progress]
 64  
 65  We are currently developing the support for vector libraries in Haystack.
 66  :::
 67  
 68  #### Pure vector databases
 69  
 70  Pure vector databases, also known as just “vector databases”, offer efficient similarity search capabilities through advanced indexing techniques. Most of them support metadata, and despite a recent trend to add more text-search features on top of it, you should consider pure vector databases closer to vector libraries than a regular database. Pick a pure vector database when your application needs to manage huge amounts of high-dimensional data effectively: they are designed to be highly scalable and highly available. Most are open source, but companies usually provide them “as a service” through paid subscriptions.
 71  
 72  - [Chroma](../../document-stores/chromadocumentstore.mdx)
 73  - [Pinecone](../../document-stores/pinecone-document-store.mdx)
 74  - [Qdrant](../../document-stores/qdrant-document-store.mdx)
 75  - [Weaviate](../../document-stores/weaviatedocumentstore.mdx)
 76  - [Milvus](https://haystack.deepset.ai/integrations/milvus-document-store) (external integration)
 77  
 78  #### Vector-capable SQL databases
 79  
 80  This category is relatively small but growing fast and includes well-known relational databases where vector capabilities were added through plugins or extensions. They are not as performant as the previous categories, but the main advantage of these databases is the opportunity to easily combine vectors with structured data, having a one-stop data shop for your application. You should pick a vector-capable SQL database when the performance trade-off is paid off by the lower cost of maintaining a single database instance for your application or when the structured data plays a more fundamental role in your business logic, with vectors being more of a nice-to-have.
 81  
 82  - [Pgvector](../../document-stores/pgvectordocumentstore.mdx)
 83  
 84  #### Vector-capable NoSQL databases
 85  
 86  Historically, the killer features of NoSQL databases were the ability to scale horizontally and the adoption of a flexible data model to overcome certain limitations of the traditional relational model. This stays true for databases in this category, where the vector capabilities are added on top of the existing features. Similarly to the previous category, vector support might not be as good as pure vector databases, but once again, there is a tradeoff that might be convenient to bear depending on the use case. For example, if a certain NoSQL database is already part of the stack of your application and a lower performance is not a show-stopper, you might give it a shot.
 87  
 88  - [Astra](../../document-stores/astradocumentstore.mdx)
 89  - [MongoDB](../../document-stores/mongodbatlasdocumentstore.mdx)
 90  - [Neo4j](https://haystack.deepset.ai/integrations/neo4j-document-store) (external)
 91  
 92  #### Full-text search databases
 93  
 94  The main advantage of full-text search databases is they are already designed to work with text, so you can expect a high level of support for text data along with good performance and the opportunity to scale both horizontally and vertically. Initially, vector capabilities were subpar and provided through plugins or extensions, but this is rapidly changing. You can see how the market leaders in this category have recently added first-class support for vectors. Pick a full-text search database if text data plays a central role in your business logic so that you can easily and effectively implement techniques like hybrid search with a good level of support for similarity search and state-of-the-art support for full-text search.
 95  
 96  - [Elasticsearch](../../document-stores/elasticsearch-document-store.mdx)
 97  - [OpenSearch](../../document-stores/opensearch-document-store.mdx)
 98  
 99  #### The in-memory Document Store
100  
101  Haystack ships with an ephemeral document store that relies on pure Python data structures stored in memory, so it doesn’t fall into any of the vector database categories above. This special Document Store is ideal for creating quick prototypes with small datasets. It doesn’t require any special setup, and it can be used right away without installing additional dependencies.
102  
103  - [InMemory](../../document-stores/inmemorydocumentstore.mdx)
104  
105  ### Final considerations
106  
107  It can be very challenging to pick one vector database over another by only looking at pure performance, as even the slightest difference in the benchmark can produce a different leaderboard (for example, some benchmarks test the cloud services while others work on a reference machine). Thinking about including features like filtering or not can bring in a whole new set of complexities that make the comparison even harder.
108  
109  What’s important for you to know is that the Document Store interface doesn’t add much to the costs, and the relative performance of one vector database over another should stay the same when used within Haystack pipelines.