/ driftkit-rag / README.md
README.md
  1  # DriftKit RAG Module
  2  
  3  High-level Fluent API for building Retrieval-Augmented Generation (RAG) and Content-Augmented Generation (CAG) pipelines in DriftKit.
  4  
  5  ## Overview
  6  
  7  The DriftKit RAG module provides a clean, fluent API for:
  8  - **Document Ingestion** - ETL pipeline for loading, splitting, embedding, and storing documents
  9  - **Document Retrieval** - Search pipeline for finding relevant documents with optional reranking
 10  
 11  ## Key Features
 12  
 13  - 🚀 **Fluent Builder Pattern** - Intuitive, chainable API
 14  - 📁 **Multiple Document Sources** - File system, URLs, databases (extensible)
 15  - ✂️ **Smart Text Splitting** - Character-based and semantic splitting strategies
 16  - 🔍 **Flexible Retrieval** - Support for both text and embedding-based search
 17  - 🎯 **Reranking** - Model-based reranking for improved relevance
 18  - 🔄 **Streaming Support** - Memory-efficient processing of large document sets
 19  - ⚡ **Virtual Threads** - Efficient concurrent processing (Java 21+)
 20  - 🔧 **Extensible** - All components use interfaces for easy customization
 21  
 22  ## Quick Start
 23  
 24  ### 1. Document Ingestion
 25  
 26  ```java
 27  // Simple ingestion from file system
 28  IngestionPipeline pipeline = IngestionPipeline.builder()
 29      .documentLoader(FileSystemLoader.builder()
 30          .rootPath(Paths.get("./docs"))
 31          .parser(unifiedParser)
 32          .extensions(Set.of("txt", "md", "pdf", "jpg"))
 33          .build())
 34      .textSplitter(RecursiveCharacterTextSplitter.withChunkSize(512))
 35      .vectorStore(vectorStore)
 36      .indexName("knowledge-base")
 37      .build();
 38  
 39  // Run ingestion with acknowledgment pattern
 40  pipeline.run(result -> {
 41      if (result.isSuccess()) {
 42          log.info("Ingested: {} ({} chunks)", result.documentId(), result.chunksStored());
 43          // Acknowledge successful processing
 44      } else {
 45          log.error("Failed: {}", result.errors());
 46          // Handle errors, potentially retry
 47      }
 48  });
 49  ```
 50  
 51  ### 2. Document Retrieval
 52  
 53  ```java
 54  // Simple retrieval
 55  RetrievalPipeline pipeline = RetrievalPipeline.builder()
 56      .vectorStore(vectorStore)
 57      .indexName("knowledge-base")
 58      .topK(10)
 59      .build();
 60  
 61  List<Document> results = pipeline.retrieve("How to implement RAG?");
 62  ```
 63  
 64  ### 3. Advanced Retrieval with Reranking
 65  
 66  ```java
 67  // Retrieval with model-based reranking
 68  RetrievalPipeline pipeline = RetrievalPipeline.builder()
 69      .vectorStore(vectorStore)
 70      .indexName("knowledge-base")
 71      .reranker(ModelBasedReranker.builder()
 72          .modelClient(openAIClient)
 73          .promptService(promptService)
 74          .model("gpt-4o")
 75          .build())
 76      .topK(20)  // Retrieve more initially
 77      .minScore(0.5f)
 78      .build();
 79  
 80  List<Document> results = pipeline.retrieve("Best practices for embeddings");
 81  ```
 82  
 83  ## Architecture
 84  
 85  ### Components
 86  
 87  1. **DocumentLoader** - Load documents from various sources
 88     - `FileSystemLoader` - Load from local files (supports PDF, images via UnifiedParser)
 89     - `UrlLoader` - Load from web URLs
 90     - Custom implementations for databases, S3, etc.
 91  
 92  2. **TextSplitter** - Split documents into chunks
 93     - `RecursiveCharacterTextSplitter` - Character-based with overlap
 94     - `SemanticTextSplitter` - Groups semantically similar sentences
 95  
 96  3. **Retriever** - Custom retrieval strategies
 97     - `VectorStoreRetriever` - Default vector search implementation
 98     - Custom implementations for hybrid search, etc.
 99  
100  4. **Reranker** - Improve relevance of retrieved documents
101     - `ModelBasedReranker` - Uses LLM with structured output
102     - Custom implementations for cross-encoders, etc.
103  
104  ### Vector Store Support
105  
106  The module works with both types of DriftKit vector stores:
107  - **TextVectorStore** - Handles embedding internally (simpler)
108  - **EmbeddingVectorStore** - Requires external embedding model (more control)
109  
110  ## Advanced Features
111  
112  ### Progress Tracking
113  
114  ```java
115  IngestionPipeline.ProgressListener listener = new IngestionPipeline.ProgressListener() {
116      @Override
117      public void onDocumentLoaded(String documentId, String source) {
118          log.info("Loading: {}", source);
119      }
120      
121      @Override
122      public void onDocumentProcessed(String documentId, int chunks) {
123          log.info("Processed: {} -> {} chunks", documentId, chunks);
124      }
125      
126      @Override
127      public void onChunkStored(String chunkId) {
128          // Track individual chunk storage
129      }
130  };
131  
132  pipeline.run(listener);
133  ```
134  
135  ### Streaming Processing
136  
137  ```java
138  // Process documents as a stream (memory efficient)
139  try (Stream<IngestionPipeline.DocumentResult> results = pipeline.run()) {
140      results
141          .filter(result -> result.isSuccess())
142          .forEach(result -> {
143              // Process each result
144          });
145  }
146  ```
147  
148  ### Retry and Error Handling
149  
150  ```java
151  IngestionPipeline pipeline = IngestionPipeline.builder()
152      // ... other configuration
153      .maxRetries(3)
154      .retryDelayMs(1000)
155      .useVirtualThreads(true)  // Efficient concurrency
156      .build();
157  ```
158  
159  ### Metadata Filtering
160  
161  ```java
162  RetrievalPipeline pipeline = RetrievalPipeline.builder()
163      .vectorStore(vectorStore)
164      .indexName("knowledge-base")
165      .filters(Map.of(
166          "contentType", "PDF",
167          "department", "engineering"
168      ))
169      .build();
170  ```
171  
172  ## Integration with DriftKit Ecosystem
173  
174  - Uses existing `UnifiedParser` for multi-format document parsing
175  - Integrates with DriftKit embedding models
176  - Works with all DriftKit vector store implementations
177  - Compatible with Spring Boot auto-configuration
178  - Supports DriftKit's prompt management system
179  
180  ## Dependencies
181  
182  ```xml
183  <dependency>
184      <groupId>ai.driftkit</groupId>
185      <artifactId>driftkit-rag-core</artifactId>
186      <version>${driftkit.version}</version>
187  </dependency>
188  ```
189  
190  For Spring Boot applications:
191  ```xml
192  <dependency>
193      <groupId>ai.driftkit</groupId>
194      <artifactId>driftkit-rag-spring-boot-starter</artifactId>
195      <version>${driftkit.version}</version>
196  </dependency>
197  ```
198  
199  ## Requirements
200  
201  - Java 21+ (for virtual threads support)
202  - DriftKit 0.8.8+
203  
204  ## Spring Boot Integration
205  
206  The Spring Boot starter provides auto-configuration and convenient services:
207  
208  ### Configuration
209  
210  ```yaml
211  driftkit:
212    rag:
213      enabled: true
214      default-index: knowledge-base
215      
216      splitter:
217        type: recursive  # or 'semantic'
218        chunk-size: 512
219        chunk-overlap: 128
220      
221      reranker:
222        enabled: true
223        model: gpt-4o
224        temperature: 0.0
225      
226      retriever:
227        default-top-k: 10
228        default-min-score: 0.0
229  ```
230  
231  ### Using RagService
232  
233  ```java
234  @RestController
235  @RequiredArgsConstructor
236  public class DocumentController {
237      
238      private final RagService ragService;
239      private final DocumentLoaderFactory loaderFactory;
240      
241      @PostMapping("/ingest/urls")
242      public void ingestUrls(@RequestBody List<String> urls) {
243          try (Stream<DocumentResult> results = ragService.ingestFromUrls(urls, "web-docs")) {
244              results.forEach(result -> {
245                  if (result.isSuccess()) {
246                      log.info("Ingested: {}", result.documentId());
247                  }
248              });
249          }
250      }
251      
252      @GetMapping("/search")
253      public List<Document> search(@RequestParam String query) {
254          return ragService.retrieve(query);
255      }
256  }
257  ```
258  
259  ### Dynamic Document Loading
260  
261  Create loaders at runtime with the factory:
262  
263  ```java
264  // Load from different sources dynamically
265  DocumentLoader urlLoader = loaderFactory.urlLoader(urls);
266  DocumentLoader fileLoader = loaderFactory.fileSystemLoader("/path/to/docs");
267  
268  // Combine multiple loaders
269  DocumentLoader combined = loaderFactory.compositeLoader(urlLoader, fileLoader);
270  
271  // Use with custom pipeline
272  ragService.ingest(combined, "my-index");
273  ```
274  
275  ## Future Enhancements
276  
277  - [ ] Hybrid search support (keyword + semantic)
278  - [ ] Incremental indexing
279  - [ ] Document update detection
280  - [ ] Caching layer for retrieval
281  - [ ] More reranker implementations
282  - [ ] Batch API for large-scale operations
283  - [ ] Streaming ingestion API
284  - [ ] WebSocket support for real-time progress