Cradicle Explorer

/ progress / 20250127_backfill_strategy.md
20250127_backfill_strategy.md
  1  # Vector Embeddings Backfill Strategy
  2  
  3  **Target Execution:** After Phase 4 completion  
  4  **Estimated Duration:** 1-3 days (depending on data volume)  
  5  **Prerequisites:** Phases 3 & 4 complete, search functionality validated
  6  
  7  ## Overview
  8  
  9  Currently, only new messages are being embedded as they're indexed. All historical messages and notes in Deno KV need to be retroactively embedded to enable comprehensive semantic search across the entire message history.
 10  
 11  ## Scope Analysis
 12  
 13  ### Data Sources to Backfill
 14  1. **Historical Messages**: All Discord messages already in Deno KV
 15  2. **Existing Notes**: All user-created notes in the system  
 16  3. **Message Metadata**: Ensure all message fields are properly embedded
 17  
 18  ### Estimated Volumes
 19  ```typescript
 20  // Analysis queries to run:
 21  const analysisQueries = {
 22    totalMessages: `SELECT COUNT(*) FROM DiscordMessage`,
 23    messagesByGuild: `SELECT guildId, COUNT(*) FROM DiscordMessage GROUP BY guildId`,
 24    messagesByMonth: `SELECT DATE_TRUNC('month', timestamp), COUNT(*) FROM DiscordMessage GROUP BY 1`,
 25    totalNotes: `SELECT COUNT(*) FROM Note`,
 26    averageMessageLength: `SELECT AVG(LENGTH(content)) FROM DiscordMessage WHERE LENGTH(content) > 0`
 27  };
 28  ```
 29  
 30  ### Processing Estimates
 31  - **Embedding Rate**: ~500-1000 messages/hour (conservative, depends on LM Studio performance)
 32  - **Batch Size**: 50-100 messages per batch (larger than real-time processing)
 33  - **API Cooldown**: 500ms between batches (faster than real-time)
 34  - **Error Rate**: Assume 1-2% failure rate for retry handling
 35  
 36  ## Implementation Strategy
 37  
 38  ### Phase A: Analysis & Planning
 39  ```typescript
 40  // utilities/backfill-analyzer.ts
 41  export class BackfillAnalyzer {
 42    async analyzeBackfillNeeds(): Promise<BackfillAnalysis> {
 43      const session = await openWithEnv();
 44      
 45      // Count total messages in Deno KV
 46      const allMessages = await DiscordMessageManager.getAllMessages(session);
 47      
 48      // Count existing embeddings in LanceDB
 49      const vectorStore = getVectorStoreManager();
 50      const embeddedCount = await vectorStore.getStats();
 51      
 52      // Calculate missing embeddings
 53      const missingMessages = allMessages.filter(async msg => {
 54        return !(await vectorStore.isMessageIndexed(msg.id));
 55      });
 56      
 57      return {
 58        totalMessages: allMessages.length,
 59        alreadyEmbedded: embeddedCount.messageCount,
 60        needsEmbedding: missingMessages.length,
 61        estimatedHours: Math.ceil(missingMessages.length / 750), // 750 msg/hour estimate
 62        estimatedBatches: Math.ceil(missingMessages.length / 100),
 63        storageEstimate: missingMessages.length * 384 * 4, // 384 dims * 4 bytes/float
 64      };
 65    }
 66    
 67    async generateBackfillPlan(): Promise<BackfillPlan> {
 68      const analysis = await this.analyzeBackfillNeeds();
 69      
 70      return {
 71        phases: [
 72          {
 73            name: "Recent Priority",
 74            description: "Last 30 days of messages",
 75            messageCount: await this.countMessagesSince(30),
 76            priority: 1,
 77            estimatedDuration: "2-4 hours"
 78          },
 79          {
 80            name: "Medium History", 
 81            description: "30-90 days ago",
 82            messageCount: await this.countMessagesBetween(30, 90),
 83            priority: 2,
 84            estimatedDuration: "4-8 hours"
 85          },
 86          {
 87            name: "Full History",
 88            description: "90+ days ago", 
 89            messageCount: await this.countMessagesOlderThan(90),
 90            priority: 3,
 91            estimatedDuration: "8-24 hours"
 92          }
 93        ],
 94        totalEstimate: analysis.estimatedHours,
 95        recommendedBatchSize: 100,
 96        recommendedCooldown: 500
 97      };
 98    }
 99  }
100  ```
101  
102  ### Phase B: Backfill Execution Engine
103  ```typescript
104  // utilities/backfill-executor.ts
105  export class BackfillExecutor {
106    private session: Session;
107    private vectorStore: VectorStoreManager;
108    private progressTracker: BackfillProgressTracker;
109    
110    async initialize() {
111      this.session = await openWithEnv();
112      this.vectorStore = getVectorStoreManager();
113      this.progressTracker = new BackfillProgressTracker();
114      
115      await this.vectorStore.initialize();
116    }
117    
118    async* executeBackfill(options: BackfillOptions): AsyncGenerator<BackfillProgress> {
119      const plan = await this.generateExecutionPlan(options);
120      
121      for (const phase of plan.phases) {
122        yield* this.executePhase(phase);
123      }
124    }
125    
126    private async* executePhase(phase: BackfillPhase): AsyncGenerator<BackfillProgress> {
127      console.log(`🚀 Starting backfill phase: ${phase.name}`);
128      
129      const messages = await this.getMessagesForPhase(phase);
130      const totalBatches = Math.ceil(messages.length / phase.batchSize);
131      
132      for (let i = 0; i < messages.length; i += phase.batchSize) {
133        const batch = messages.slice(i, i + phase.batchSize);
134        const batchNumber = Math.floor(i / phase.batchSize) + 1;
135        
136        try {
137          const startTime = Date.now();
138          
139          // Filter out already-embedded messages
140          const needsEmbedding = await this.filterUnembedded(batch);
141          
142          if (needsEmbedding.length === 0) {
143            console.log(`⏭️ Batch ${batchNumber}/${totalBatches}: All messages already embedded`);
144            continue;
145          }
146          
147          // Process batch
148          const result = await this.vectorStore.indexMessages(needsEmbedding);
149          const duration = Date.now() - startTime;
150          
151          // Update progress
152          const progress = await this.progressTracker.updateProgress({
153            phase: phase.name,
154            batchNumber,
155            totalBatches,
156            messagesProcessed: needsEmbedding.length,
157            successCount: result.success,
158            failureCount: result.failed,
159            duration,
160            estimatedTimeRemaining: this.calculateTimeRemaining(batchNumber, totalBatches, duration)
161          });
162          
163          console.log(`✅ Batch ${batchNumber}/${totalBatches}: ${result.success} embedded, ${result.failed} failed (${duration}ms)`);
164          
165          yield progress;
166          
167          // Apply cooldown
168          if (i + phase.batchSize < messages.length) {
169            await this.sleep(phase.cooldownMs);
170          }
171          
172        } catch (error) {
173          console.error(`❌ Batch ${batchNumber} failed:`, error);
174          
175          // Try individual messages in failed batch
176          yield* this.handleFailedBatch(batch, batchNumber, totalBatches);
177        }
178      }
179      
180      console.log(`🎉 Phase ${phase.name} complete`);
181    }
182    
183    private async* handleFailedBatch(
184      batch: DiscordMessageData[], 
185      batchNumber: number, 
186      totalBatches: number
187    ): AsyncGenerator<BackfillProgress> {
188      console.log(`🔄 Attempting individual message processing for failed batch ${batchNumber}`);
189      
190      let successCount = 0;
191      let failureCount = 0;
192      
193      for (const message of batch) {
194        try {
195          await this.vectorStore.indexMessage(message);
196          successCount++;
197        } catch (error) {
198          console.error(`❌ Individual message ${message.id} failed:`, error);
199          failureCount++;
200        }
201      }
202      
203      const progress = await this.progressTracker.updateProgress({
204        phase: "Recovery",
205        batchNumber,
206        totalBatches,
207        messagesProcessed: batch.length,
208        successCount,
209        failureCount,
210        duration: 0,
211        estimatedTimeRemaining: 0
212      });
213      
214      console.log(`🔄 Recovery batch ${batchNumber}: ${successCount} recovered, ${failureCount} permanently failed`);
215      yield progress;
216    }
217  }
218  ```
219  
220  ### Phase C: Progress Tracking & Monitoring
221  ```typescript
222  // utilities/backfill-progress-tracker.ts
223  export class BackfillProgressTracker {
224    private progressKey = "backfill:progress";
225    
226    async saveProgress(progress: BackfillProgress): Promise<void> {
227      const session = await openWithEnv();
228      
229      // Store progress in Deno KV for persistence
230      await session.transact([{
231        assert: {
232          the: "backfill/progress",
233          of: entity(this.progressKey),
234          is: JSON.stringify(progress)
235        }
236      }]);
237    }
238    
239    async loadProgress(): Promise<BackfillProgress | null> {
240      const session = await openWithEnv();
241      
242      const results = await session.select({
243        the: "backfill/progress",
244        of: entity(this.progressKey)
245      });
246      
247      return results.length > 0 ? JSON.parse(results[0].is) : null;
248    }
249    
250    async generateProgressReport(): Promise<BackfillReport> {
251      const progress = await this.loadProgress();
252      if (!progress) return null;
253      
254      return {
255        summary: {
256          totalMessages: progress.totalMessagesFound,
257          processed: progress.totalProcessed,
258          embedded: progress.totalEmbedded,
259          failed: progress.totalFailed,
260          percentComplete: (progress.totalProcessed / progress.totalMessagesFound) * 100
261        },
262        currentPhase: progress.currentPhase,
263        estimatedTimeRemaining: progress.estimatedTimeRemaining,
264        averageProcessingRate: progress.totalProcessed / progress.elapsedHours,
265        errorRate: progress.totalFailed / progress.totalProcessed,
266        phases: progress.phases.map(p => ({
267          name: p.name,
268          status: p.status,
269          messagesProcessed: p.messagesProcessed,
270          percentComplete: (p.messagesProcessed / p.totalMessages) * 100
271        }))
272      };
273    }
274  }
275  ```
276  
277  ## Execution Plan
278  
279  ### Pre-Execution Checklist
280  - [ ] Phase 3 & 4 complete and stable
281  - [ ] LM Studio running and stable
282  - [ ] Sufficient disk space for vector storage
283  - [ ] LanceDB performance optimized
284  - [ ] Backup of current vector store
285  - [ ] Monitoring and logging configured
286  
287  ### Execution Phases
288  
289  #### Phase 1: Recent Messages (High Priority)
290  - **Scope**: Messages from last 30 days
291  - **Rationale**: Most relevant for current conversations
292  - **Batch Size**: 100 messages
293  - **Cooldown**: 500ms
294  - **Expected Duration**: 2-4 hours
295  
296  #### Phase 2: Medium History (Medium Priority)  
297  - **Scope**: Messages 30-90 days old
298  - **Rationale**: Provides broader context
299  - **Batch Size**: 150 messages (larger batches for efficiency)
300  - **Cooldown**: 300ms
301  - **Expected Duration**: 4-8 hours
302  
303  #### Phase 3: Full History (Lower Priority)
304  - **Scope**: Messages older than 90 days
305  - **Rationale**: Complete historical coverage
306  - **Batch Size**: 200 messages (maximum efficiency)
307  - **Cooldown**: 200ms
308  - **Expected Duration**: 8-24 hours
309  
310  #### Phase 4: Notes Backfill
311  - **Scope**: All existing notes
312  - **Rationale**: Enable note-message cross-search
313  - **Batch Size**: 50 notes (typically longer content)
314  - **Cooldown**: 1000ms
315  - **Expected Duration**: 1-2 hours
316  
317  ### Monitoring & Health Checks
318  
319  #### Real-time Monitoring
320  ```typescript
321  const BackfillMonitor = {
322    // Check LM Studio health
323    async checkEmbeddingService(): Promise<boolean> {
324      try {
325        const testEmbedding = await embeddingService.embedText("test");
326        return testEmbedding.length > 0;
327      } catch {
328        return false;
329      }
330    },
331    
332    // Check LanceDB health
333    async checkVectorStore(): Promise<boolean> {
334      try {
335        const stats = await vectorStore.getStats();
336        return stats.messageCount >= 0;
337      } catch {
338        return false;
339      }
340    },
341    
342    // Monitor system resources
343    async checkSystemHealth(): Promise<SystemHealth> {
344      return {
345        diskSpace: await this.getAvailableDiskSpace(),
346        memoryUsage: await this.getMemoryUsage(), 
347        cpuLoad: await this.getCPULoad(),
348        lmStudioResponsive: await this.checkEmbeddingService(),
349        lancedbResponsive: await this.checkVectorStore()
350      };
351    }
352  };
353  ```
354  
355  #### Alert Conditions
356  - Embedding service response time > 5 seconds
357  - Batch failure rate > 5%
358  - Available disk space < 10GB
359  - Memory usage > 90%
360  - LM Studio becomes unresponsive
361  
362  ### Recovery & Resume Strategy
363  
364  #### Checkpoint System
365  - Progress saved after each successful batch
366  - Failed batches logged with specific error details
367  - Resume capability from last successful checkpoint
368  - Manual intervention points for troubleshooting
369  
370  #### Error Handling
371  ```typescript
372  const RecoveryStrategies = {
373    // LM Studio disconnection
374    embeddingServiceDown: async () => {
375      console.log("🔄 Embedding service down, waiting for recovery...");
376      await this.waitForServiceRecovery();
377      await this.resumeFromLastCheckpoint();
378    },
379    
380    // Disk space issues
381    diskSpaceLow: async () => {
382      console.log("⚠️ Disk space low, pausing backfill");
383      await this.pauseBackfill();
384      await this.notifyAdministrator();
385    },
386    
387    // Memory issues
388    memoryPressure: async () => {
389      console.log("🔄 Memory pressure detected, reducing batch size");
390      await this.reduceBatchSize();
391      await this.triggerGarbageCollection();
392    }
393  };
394  ```
395  
396  ## Post-Backfill Validation
397  
398  ### Data Integrity Checks
399  - Verify embedding count matches message count
400  - Spot-check search results for historical messages
401  - Validate vector dimensions are consistent
402  - Check for any orphaned or corrupted embeddings
403  
404  ### Performance Validation
405  - Benchmark search performance with full dataset
406  - Validate memory usage remains stable
407  - Confirm LanceDB index performance
408  - Test search across different time ranges
409  
410  ### User Acceptance Testing
411  - Test semantic search across historical messages
412  - Validate search relevance for older content
413  - Ensure no degradation in real-time embedding performance
414  - Confirm Discord commands work with full dataset
415  
416  ## Success Criteria
417  
418  ### Quantitative Metrics
419  - **Coverage**: >95% of messages successfully embedded
420  - **Performance**: Search times remain <100ms with full dataset
421  - **Reliability**: <1% failure rate during backfill execution
422  - **Completeness**: All historical timeframes represented in search results
423  
424  ### Qualitative Metrics
425  - Users can find relevant historical conversations
426  - Search quality is consistent across time periods
427  - No noticeable impact on system performance
428  - Confidence in semantic search capabilities
429  
430  ## Rollback Plan
431  
432  ### Emergency Rollback
433  If critical issues arise during backfill:
434  1. Stop backfill execution immediately
435  2. Restore vector store from pre-backfill backup
436  3. Revert to real-time embedding only
437  4. Investigate and resolve issues
438  5. Plan revised backfill approach
439  
440  ### Partial Rollback
441  If specific phases have issues:
442  1. Complete current phase successfully
443  2. Skip problematic phases temporarily
444  3. Investigate issues with failed phase
445  4. Resume with fixed approach
446  
447  **The backfill operation will provide complete historical context for semantic search, enabling users to discover relevant conversations and patterns across the entire message history.**