/ driftkit-rag / README.md
README.md
1 # DriftKit RAG Module 2 3 High-level Fluent API for building Retrieval-Augmented Generation (RAG) and Content-Augmented Generation (CAG) pipelines in DriftKit. 4 5 ## Overview 6 7 The DriftKit RAG module provides a clean, fluent API for: 8 - **Document Ingestion** - ETL pipeline for loading, splitting, embedding, and storing documents 9 - **Document Retrieval** - Search pipeline for finding relevant documents with optional reranking 10 11 ## Key Features 12 13 - 🚀 **Fluent Builder Pattern** - Intuitive, chainable API 14 - 📁 **Multiple Document Sources** - File system, URLs, databases (extensible) 15 - ✂️ **Smart Text Splitting** - Character-based and semantic splitting strategies 16 - 🔍 **Flexible Retrieval** - Support for both text and embedding-based search 17 - 🎯 **Reranking** - Model-based reranking for improved relevance 18 - 🔄 **Streaming Support** - Memory-efficient processing of large document sets 19 - ⚡ **Virtual Threads** - Efficient concurrent processing (Java 21+) 20 - 🔧 **Extensible** - All components use interfaces for easy customization 21 22 ## Quick Start 23 24 ### 1. Document Ingestion 25 26 ```java 27 // Simple ingestion from file system 28 IngestionPipeline pipeline = IngestionPipeline.builder() 29 .documentLoader(FileSystemLoader.builder() 30 .rootPath(Paths.get("./docs")) 31 .parser(unifiedParser) 32 .extensions(Set.of("txt", "md", "pdf", "jpg")) 33 .build()) 34 .textSplitter(RecursiveCharacterTextSplitter.withChunkSize(512)) 35 .vectorStore(vectorStore) 36 .indexName("knowledge-base") 37 .build(); 38 39 // Run ingestion with acknowledgment pattern 40 pipeline.run(result -> { 41 if (result.isSuccess()) { 42 log.info("Ingested: {} ({} chunks)", result.documentId(), result.chunksStored()); 43 // Acknowledge successful processing 44 } else { 45 log.error("Failed: {}", result.errors()); 46 // Handle errors, potentially retry 47 } 48 }); 49 ``` 50 51 ### 2. Document Retrieval 52 53 ```java 54 // Simple retrieval 55 RetrievalPipeline pipeline = RetrievalPipeline.builder() 56 .vectorStore(vectorStore) 57 .indexName("knowledge-base") 58 .topK(10) 59 .build(); 60 61 List<Document> results = pipeline.retrieve("How to implement RAG?"); 62 ``` 63 64 ### 3. Advanced Retrieval with Reranking 65 66 ```java 67 // Retrieval with model-based reranking 68 RetrievalPipeline pipeline = RetrievalPipeline.builder() 69 .vectorStore(vectorStore) 70 .indexName("knowledge-base") 71 .reranker(ModelBasedReranker.builder() 72 .modelClient(openAIClient) 73 .promptService(promptService) 74 .model("gpt-4o") 75 .build()) 76 .topK(20) // Retrieve more initially 77 .minScore(0.5f) 78 .build(); 79 80 List<Document> results = pipeline.retrieve("Best practices for embeddings"); 81 ``` 82 83 ## Architecture 84 85 ### Components 86 87 1. **DocumentLoader** - Load documents from various sources 88 - `FileSystemLoader` - Load from local files (supports PDF, images via UnifiedParser) 89 - `UrlLoader` - Load from web URLs 90 - Custom implementations for databases, S3, etc. 91 92 2. **TextSplitter** - Split documents into chunks 93 - `RecursiveCharacterTextSplitter` - Character-based with overlap 94 - `SemanticTextSplitter` - Groups semantically similar sentences 95 96 3. **Retriever** - Custom retrieval strategies 97 - `VectorStoreRetriever` - Default vector search implementation 98 - Custom implementations for hybrid search, etc. 99 100 4. **Reranker** - Improve relevance of retrieved documents 101 - `ModelBasedReranker` - Uses LLM with structured output 102 - Custom implementations for cross-encoders, etc. 103 104 ### Vector Store Support 105 106 The module works with both types of DriftKit vector stores: 107 - **TextVectorStore** - Handles embedding internally (simpler) 108 - **EmbeddingVectorStore** - Requires external embedding model (more control) 109 110 ## Advanced Features 111 112 ### Progress Tracking 113 114 ```java 115 IngestionPipeline.ProgressListener listener = new IngestionPipeline.ProgressListener() { 116 @Override 117 public void onDocumentLoaded(String documentId, String source) { 118 log.info("Loading: {}", source); 119 } 120 121 @Override 122 public void onDocumentProcessed(String documentId, int chunks) { 123 log.info("Processed: {} -> {} chunks", documentId, chunks); 124 } 125 126 @Override 127 public void onChunkStored(String chunkId) { 128 // Track individual chunk storage 129 } 130 }; 131 132 pipeline.run(listener); 133 ``` 134 135 ### Streaming Processing 136 137 ```java 138 // Process documents as a stream (memory efficient) 139 try (Stream<IngestionPipeline.DocumentResult> results = pipeline.run()) { 140 results 141 .filter(result -> result.isSuccess()) 142 .forEach(result -> { 143 // Process each result 144 }); 145 } 146 ``` 147 148 ### Retry and Error Handling 149 150 ```java 151 IngestionPipeline pipeline = IngestionPipeline.builder() 152 // ... other configuration 153 .maxRetries(3) 154 .retryDelayMs(1000) 155 .useVirtualThreads(true) // Efficient concurrency 156 .build(); 157 ``` 158 159 ### Metadata Filtering 160 161 ```java 162 RetrievalPipeline pipeline = RetrievalPipeline.builder() 163 .vectorStore(vectorStore) 164 .indexName("knowledge-base") 165 .filters(Map.of( 166 "contentType", "PDF", 167 "department", "engineering" 168 )) 169 .build(); 170 ``` 171 172 ## Integration with DriftKit Ecosystem 173 174 - Uses existing `UnifiedParser` for multi-format document parsing 175 - Integrates with DriftKit embedding models 176 - Works with all DriftKit vector store implementations 177 - Compatible with Spring Boot auto-configuration 178 - Supports DriftKit's prompt management system 179 180 ## Dependencies 181 182 ```xml 183 <dependency> 184 <groupId>ai.driftkit</groupId> 185 <artifactId>driftkit-rag-core</artifactId> 186 <version>${driftkit.version}</version> 187 </dependency> 188 ``` 189 190 For Spring Boot applications: 191 ```xml 192 <dependency> 193 <groupId>ai.driftkit</groupId> 194 <artifactId>driftkit-rag-spring-boot-starter</artifactId> 195 <version>${driftkit.version}</version> 196 </dependency> 197 ``` 198 199 ## Requirements 200 201 - Java 21+ (for virtual threads support) 202 - DriftKit 0.8.8+ 203 204 ## Spring Boot Integration 205 206 The Spring Boot starter provides auto-configuration and convenient services: 207 208 ### Configuration 209 210 ```yaml 211 driftkit: 212 rag: 213 enabled: true 214 default-index: knowledge-base 215 216 splitter: 217 type: recursive # or 'semantic' 218 chunk-size: 512 219 chunk-overlap: 128 220 221 reranker: 222 enabled: true 223 model: gpt-4o 224 temperature: 0.0 225 226 retriever: 227 default-top-k: 10 228 default-min-score: 0.0 229 ``` 230 231 ### Using RagService 232 233 ```java 234 @RestController 235 @RequiredArgsConstructor 236 public class DocumentController { 237 238 private final RagService ragService; 239 private final DocumentLoaderFactory loaderFactory; 240 241 @PostMapping("/ingest/urls") 242 public void ingestUrls(@RequestBody List<String> urls) { 243 try (Stream<DocumentResult> results = ragService.ingestFromUrls(urls, "web-docs")) { 244 results.forEach(result -> { 245 if (result.isSuccess()) { 246 log.info("Ingested: {}", result.documentId()); 247 } 248 }); 249 } 250 } 251 252 @GetMapping("/search") 253 public List<Document> search(@RequestParam String query) { 254 return ragService.retrieve(query); 255 } 256 } 257 ``` 258 259 ### Dynamic Document Loading 260 261 Create loaders at runtime with the factory: 262 263 ```java 264 // Load from different sources dynamically 265 DocumentLoader urlLoader = loaderFactory.urlLoader(urls); 266 DocumentLoader fileLoader = loaderFactory.fileSystemLoader("/path/to/docs"); 267 268 // Combine multiple loaders 269 DocumentLoader combined = loaderFactory.compositeLoader(urlLoader, fileLoader); 270 271 // Use with custom pipeline 272 ragService.ingest(combined, "my-index"); 273 ``` 274 275 ## Future Enhancements 276 277 - [ ] Hybrid search support (keyword + semantic) 278 - [ ] Incremental indexing 279 - [ ] Document update detection 280 - [ ] Caching layer for retrieval 281 - [ ] More reranker implementations 282 - [ ] Batch API for large-scale operations 283 - [ ] Streaming ingestion API 284 - [ ] WebSocket support for real-time progress