20250127_backfill_strategy.md
1 # Vector Embeddings Backfill Strategy 2 3 **Target Execution:** After Phase 4 completion 4 **Estimated Duration:** 1-3 days (depending on data volume) 5 **Prerequisites:** Phases 3 & 4 complete, search functionality validated 6 7 ## Overview 8 9 Currently, only new messages are being embedded as they're indexed. All historical messages and notes in Deno KV need to be retroactively embedded to enable comprehensive semantic search across the entire message history. 10 11 ## Scope Analysis 12 13 ### Data Sources to Backfill 14 1. **Historical Messages**: All Discord messages already in Deno KV 15 2. **Existing Notes**: All user-created notes in the system 16 3. **Message Metadata**: Ensure all message fields are properly embedded 17 18 ### Estimated Volumes 19 ```typescript 20 // Analysis queries to run: 21 const analysisQueries = { 22 totalMessages: `SELECT COUNT(*) FROM DiscordMessage`, 23 messagesByGuild: `SELECT guildId, COUNT(*) FROM DiscordMessage GROUP BY guildId`, 24 messagesByMonth: `SELECT DATE_TRUNC('month', timestamp), COUNT(*) FROM DiscordMessage GROUP BY 1`, 25 totalNotes: `SELECT COUNT(*) FROM Note`, 26 averageMessageLength: `SELECT AVG(LENGTH(content)) FROM DiscordMessage WHERE LENGTH(content) > 0` 27 }; 28 ``` 29 30 ### Processing Estimates 31 - **Embedding Rate**: ~500-1000 messages/hour (conservative, depends on LM Studio performance) 32 - **Batch Size**: 50-100 messages per batch (larger than real-time processing) 33 - **API Cooldown**: 500ms between batches (faster than real-time) 34 - **Error Rate**: Assume 1-2% failure rate for retry handling 35 36 ## Implementation Strategy 37 38 ### Phase A: Analysis & Planning 39 ```typescript 40 // utilities/backfill-analyzer.ts 41 export class BackfillAnalyzer { 42 async analyzeBackfillNeeds(): Promise<BackfillAnalysis> { 43 const session = await openWithEnv(); 44 45 // Count total messages in Deno KV 46 const allMessages = await DiscordMessageManager.getAllMessages(session); 47 48 // Count existing embeddings in LanceDB 49 const vectorStore = getVectorStoreManager(); 50 const embeddedCount = await vectorStore.getStats(); 51 52 // Calculate missing embeddings 53 const missingMessages = allMessages.filter(async msg => { 54 return !(await vectorStore.isMessageIndexed(msg.id)); 55 }); 56 57 return { 58 totalMessages: allMessages.length, 59 alreadyEmbedded: embeddedCount.messageCount, 60 needsEmbedding: missingMessages.length, 61 estimatedHours: Math.ceil(missingMessages.length / 750), // 750 msg/hour estimate 62 estimatedBatches: Math.ceil(missingMessages.length / 100), 63 storageEstimate: missingMessages.length * 384 * 4, // 384 dims * 4 bytes/float 64 }; 65 } 66 67 async generateBackfillPlan(): Promise<BackfillPlan> { 68 const analysis = await this.analyzeBackfillNeeds(); 69 70 return { 71 phases: [ 72 { 73 name: "Recent Priority", 74 description: "Last 30 days of messages", 75 messageCount: await this.countMessagesSince(30), 76 priority: 1, 77 estimatedDuration: "2-4 hours" 78 }, 79 { 80 name: "Medium History", 81 description: "30-90 days ago", 82 messageCount: await this.countMessagesBetween(30, 90), 83 priority: 2, 84 estimatedDuration: "4-8 hours" 85 }, 86 { 87 name: "Full History", 88 description: "90+ days ago", 89 messageCount: await this.countMessagesOlderThan(90), 90 priority: 3, 91 estimatedDuration: "8-24 hours" 92 } 93 ], 94 totalEstimate: analysis.estimatedHours, 95 recommendedBatchSize: 100, 96 recommendedCooldown: 500 97 }; 98 } 99 } 100 ``` 101 102 ### Phase B: Backfill Execution Engine 103 ```typescript 104 // utilities/backfill-executor.ts 105 export class BackfillExecutor { 106 private session: Session; 107 private vectorStore: VectorStoreManager; 108 private progressTracker: BackfillProgressTracker; 109 110 async initialize() { 111 this.session = await openWithEnv(); 112 this.vectorStore = getVectorStoreManager(); 113 this.progressTracker = new BackfillProgressTracker(); 114 115 await this.vectorStore.initialize(); 116 } 117 118 async* executeBackfill(options: BackfillOptions): AsyncGenerator<BackfillProgress> { 119 const plan = await this.generateExecutionPlan(options); 120 121 for (const phase of plan.phases) { 122 yield* this.executePhase(phase); 123 } 124 } 125 126 private async* executePhase(phase: BackfillPhase): AsyncGenerator<BackfillProgress> { 127 console.log(`🚀 Starting backfill phase: ${phase.name}`); 128 129 const messages = await this.getMessagesForPhase(phase); 130 const totalBatches = Math.ceil(messages.length / phase.batchSize); 131 132 for (let i = 0; i < messages.length; i += phase.batchSize) { 133 const batch = messages.slice(i, i + phase.batchSize); 134 const batchNumber = Math.floor(i / phase.batchSize) + 1; 135 136 try { 137 const startTime = Date.now(); 138 139 // Filter out already-embedded messages 140 const needsEmbedding = await this.filterUnembedded(batch); 141 142 if (needsEmbedding.length === 0) { 143 console.log(`⏭️ Batch ${batchNumber}/${totalBatches}: All messages already embedded`); 144 continue; 145 } 146 147 // Process batch 148 const result = await this.vectorStore.indexMessages(needsEmbedding); 149 const duration = Date.now() - startTime; 150 151 // Update progress 152 const progress = await this.progressTracker.updateProgress({ 153 phase: phase.name, 154 batchNumber, 155 totalBatches, 156 messagesProcessed: needsEmbedding.length, 157 successCount: result.success, 158 failureCount: result.failed, 159 duration, 160 estimatedTimeRemaining: this.calculateTimeRemaining(batchNumber, totalBatches, duration) 161 }); 162 163 console.log(`✅ Batch ${batchNumber}/${totalBatches}: ${result.success} embedded, ${result.failed} failed (${duration}ms)`); 164 165 yield progress; 166 167 // Apply cooldown 168 if (i + phase.batchSize < messages.length) { 169 await this.sleep(phase.cooldownMs); 170 } 171 172 } catch (error) { 173 console.error(`❌ Batch ${batchNumber} failed:`, error); 174 175 // Try individual messages in failed batch 176 yield* this.handleFailedBatch(batch, batchNumber, totalBatches); 177 } 178 } 179 180 console.log(`🎉 Phase ${phase.name} complete`); 181 } 182 183 private async* handleFailedBatch( 184 batch: DiscordMessageData[], 185 batchNumber: number, 186 totalBatches: number 187 ): AsyncGenerator<BackfillProgress> { 188 console.log(`🔄 Attempting individual message processing for failed batch ${batchNumber}`); 189 190 let successCount = 0; 191 let failureCount = 0; 192 193 for (const message of batch) { 194 try { 195 await this.vectorStore.indexMessage(message); 196 successCount++; 197 } catch (error) { 198 console.error(`❌ Individual message ${message.id} failed:`, error); 199 failureCount++; 200 } 201 } 202 203 const progress = await this.progressTracker.updateProgress({ 204 phase: "Recovery", 205 batchNumber, 206 totalBatches, 207 messagesProcessed: batch.length, 208 successCount, 209 failureCount, 210 duration: 0, 211 estimatedTimeRemaining: 0 212 }); 213 214 console.log(`🔄 Recovery batch ${batchNumber}: ${successCount} recovered, ${failureCount} permanently failed`); 215 yield progress; 216 } 217 } 218 ``` 219 220 ### Phase C: Progress Tracking & Monitoring 221 ```typescript 222 // utilities/backfill-progress-tracker.ts 223 export class BackfillProgressTracker { 224 private progressKey = "backfill:progress"; 225 226 async saveProgress(progress: BackfillProgress): Promise<void> { 227 const session = await openWithEnv(); 228 229 // Store progress in Deno KV for persistence 230 await session.transact([{ 231 assert: { 232 the: "backfill/progress", 233 of: entity(this.progressKey), 234 is: JSON.stringify(progress) 235 } 236 }]); 237 } 238 239 async loadProgress(): Promise<BackfillProgress | null> { 240 const session = await openWithEnv(); 241 242 const results = await session.select({ 243 the: "backfill/progress", 244 of: entity(this.progressKey) 245 }); 246 247 return results.length > 0 ? JSON.parse(results[0].is) : null; 248 } 249 250 async generateProgressReport(): Promise<BackfillReport> { 251 const progress = await this.loadProgress(); 252 if (!progress) return null; 253 254 return { 255 summary: { 256 totalMessages: progress.totalMessagesFound, 257 processed: progress.totalProcessed, 258 embedded: progress.totalEmbedded, 259 failed: progress.totalFailed, 260 percentComplete: (progress.totalProcessed / progress.totalMessagesFound) * 100 261 }, 262 currentPhase: progress.currentPhase, 263 estimatedTimeRemaining: progress.estimatedTimeRemaining, 264 averageProcessingRate: progress.totalProcessed / progress.elapsedHours, 265 errorRate: progress.totalFailed / progress.totalProcessed, 266 phases: progress.phases.map(p => ({ 267 name: p.name, 268 status: p.status, 269 messagesProcessed: p.messagesProcessed, 270 percentComplete: (p.messagesProcessed / p.totalMessages) * 100 271 })) 272 }; 273 } 274 } 275 ``` 276 277 ## Execution Plan 278 279 ### Pre-Execution Checklist 280 - [ ] Phase 3 & 4 complete and stable 281 - [ ] LM Studio running and stable 282 - [ ] Sufficient disk space for vector storage 283 - [ ] LanceDB performance optimized 284 - [ ] Backup of current vector store 285 - [ ] Monitoring and logging configured 286 287 ### Execution Phases 288 289 #### Phase 1: Recent Messages (High Priority) 290 - **Scope**: Messages from last 30 days 291 - **Rationale**: Most relevant for current conversations 292 - **Batch Size**: 100 messages 293 - **Cooldown**: 500ms 294 - **Expected Duration**: 2-4 hours 295 296 #### Phase 2: Medium History (Medium Priority) 297 - **Scope**: Messages 30-90 days old 298 - **Rationale**: Provides broader context 299 - **Batch Size**: 150 messages (larger batches for efficiency) 300 - **Cooldown**: 300ms 301 - **Expected Duration**: 4-8 hours 302 303 #### Phase 3: Full History (Lower Priority) 304 - **Scope**: Messages older than 90 days 305 - **Rationale**: Complete historical coverage 306 - **Batch Size**: 200 messages (maximum efficiency) 307 - **Cooldown**: 200ms 308 - **Expected Duration**: 8-24 hours 309 310 #### Phase 4: Notes Backfill 311 - **Scope**: All existing notes 312 - **Rationale**: Enable note-message cross-search 313 - **Batch Size**: 50 notes (typically longer content) 314 - **Cooldown**: 1000ms 315 - **Expected Duration**: 1-2 hours 316 317 ### Monitoring & Health Checks 318 319 #### Real-time Monitoring 320 ```typescript 321 const BackfillMonitor = { 322 // Check LM Studio health 323 async checkEmbeddingService(): Promise<boolean> { 324 try { 325 const testEmbedding = await embeddingService.embedText("test"); 326 return testEmbedding.length > 0; 327 } catch { 328 return false; 329 } 330 }, 331 332 // Check LanceDB health 333 async checkVectorStore(): Promise<boolean> { 334 try { 335 const stats = await vectorStore.getStats(); 336 return stats.messageCount >= 0; 337 } catch { 338 return false; 339 } 340 }, 341 342 // Monitor system resources 343 async checkSystemHealth(): Promise<SystemHealth> { 344 return { 345 diskSpace: await this.getAvailableDiskSpace(), 346 memoryUsage: await this.getMemoryUsage(), 347 cpuLoad: await this.getCPULoad(), 348 lmStudioResponsive: await this.checkEmbeddingService(), 349 lancedbResponsive: await this.checkVectorStore() 350 }; 351 } 352 }; 353 ``` 354 355 #### Alert Conditions 356 - Embedding service response time > 5 seconds 357 - Batch failure rate > 5% 358 - Available disk space < 10GB 359 - Memory usage > 90% 360 - LM Studio becomes unresponsive 361 362 ### Recovery & Resume Strategy 363 364 #### Checkpoint System 365 - Progress saved after each successful batch 366 - Failed batches logged with specific error details 367 - Resume capability from last successful checkpoint 368 - Manual intervention points for troubleshooting 369 370 #### Error Handling 371 ```typescript 372 const RecoveryStrategies = { 373 // LM Studio disconnection 374 embeddingServiceDown: async () => { 375 console.log("🔄 Embedding service down, waiting for recovery..."); 376 await this.waitForServiceRecovery(); 377 await this.resumeFromLastCheckpoint(); 378 }, 379 380 // Disk space issues 381 diskSpaceLow: async () => { 382 console.log("⚠️ Disk space low, pausing backfill"); 383 await this.pauseBackfill(); 384 await this.notifyAdministrator(); 385 }, 386 387 // Memory issues 388 memoryPressure: async () => { 389 console.log("🔄 Memory pressure detected, reducing batch size"); 390 await this.reduceBatchSize(); 391 await this.triggerGarbageCollection(); 392 } 393 }; 394 ``` 395 396 ## Post-Backfill Validation 397 398 ### Data Integrity Checks 399 - Verify embedding count matches message count 400 - Spot-check search results for historical messages 401 - Validate vector dimensions are consistent 402 - Check for any orphaned or corrupted embeddings 403 404 ### Performance Validation 405 - Benchmark search performance with full dataset 406 - Validate memory usage remains stable 407 - Confirm LanceDB index performance 408 - Test search across different time ranges 409 410 ### User Acceptance Testing 411 - Test semantic search across historical messages 412 - Validate search relevance for older content 413 - Ensure no degradation in real-time embedding performance 414 - Confirm Discord commands work with full dataset 415 416 ## Success Criteria 417 418 ### Quantitative Metrics 419 - **Coverage**: >95% of messages successfully embedded 420 - **Performance**: Search times remain <100ms with full dataset 421 - **Reliability**: <1% failure rate during backfill execution 422 - **Completeness**: All historical timeframes represented in search results 423 424 ### Qualitative Metrics 425 - Users can find relevant historical conversations 426 - Search quality is consistent across time periods 427 - No noticeable impact on system performance 428 - Confidence in semantic search capabilities 429 430 ## Rollback Plan 431 432 ### Emergency Rollback 433 If critical issues arise during backfill: 434 1. Stop backfill execution immediately 435 2. Restore vector store from pre-backfill backup 436 3. Revert to real-time embedding only 437 4. Investigate and resolve issues 438 5. Plan revised backfill approach 439 440 ### Partial Rollback 441 If specific phases have issues: 442 1. Complete current phase successfully 443 2. Skip problematic phases temporarily 444 3. Investigate issues with failed phase 445 4. Resume with fixed approach 446 447 **The backfill operation will provide complete historical context for semantic search, enabling users to discover relevant conversations and patterns across the entire message history.**