cron-batch-strategy.md
1 --- 2 title: 'Cron Batch Strategy' 3 category: 'archive' 4 last_verified: '2026-02-15' 5 tags: ['cron', 'batch', 'strategy', 'scheduling', 'testing', 'api', 'ai', 'llm'] 6 status: 'archived' 7 --- 8 9 # Cron Job Batch Strategy 10 11 ## Problem: Long-Running Tasks Block Other Jobs 12 13 Cron jobs run **sequentially** (one at a time). If a task takes 30 minutes, nothing else runs during that time. 14 15 **Example of blocking:** 16 17 ``` 18 05:00 - Assets stage starts (takes 30 minutes) 19 05:05 - Email sync is due, but BLOCKED (assets still running) 20 05:10 - SMS sync is due, but BLOCKED 21 05:15 - Scoring is due, but BLOCKED 22 05:30 - Assets completes 23 05:35 - Email sync finally runs (30 minutes late!) 24 ``` 25 26 ## Solution: Time-Boxed Batches 27 28 Instead of processing 100 sites in one go, each stage processes a **small batch** and returns control within 1-2 minutes. 29 30 **Example with batches:** 31 32 ``` 33 05:00 - Assets stage: 5 sites (2 minutes) ✅ 34 05:02 - Returns control 35 05:05 - Email sync runs (on time!) ✅ 36 05:06 - Returns control 37 05:10 - Scoring: 10 sites (1 minute) ✅ 38 05:11 - Returns control 39 05:15 - Assets stage: next 5 sites ✅ 40 ``` 41 42 ## Current Batch Sizes 43 44 These are configured in the `cron_jobs` table: 45 46 | Stage | Batch Size | Why | 47 | -------------- | ---------- | ------------------------------------------ | 48 | **Assets** | 5 sites | Playwright is slow (~20-30s per site) | 49 | **Scoring** | 10 sites | Fast GPT-4o-mini calls (~5s per site) | 50 | **Rescoring** | 5 sites | Slower (below-fold screenshots + scoring) | 51 | **Enrichment** | 3 sites | Very slow (browse multiple pages per site) | 52 | **Proposals** | 10 sites | Fast AI generation | 53 54 ## How It Works 55 56 1. **Idempotent Stages**: Each stage queries `WHERE status = 'current_stage'` 57 2. **Processes Batch**: Uses `LIMIT N` to process small batch 58 3. **Updates Status**: Moves processed sites to next stage 59 4. **Returns Quickly**: Finishes within 1-2 minutes 60 5. **Next Run**: Picks up where it left off 61 62 **Example SQL from assets stage:** 63 64 ```sql 65 SELECT id FROM sites 66 WHERE status = 'found' 67 AND error_message IS NULL 68 LIMIT 5 -- Only process 5 sites 69 ``` 70 71 ## Adjusting Batch Sizes 72 73 ### Check Current Batch Size 74 75 ```bash 76 sqlite3 db/sites.db "SELECT task_key, handler_value FROM cron_jobs WHERE task_key LIKE 'run%';" 77 ``` 78 79 ### Update Batch Size 80 81 ```bash 82 sqlite3 db/sites.db "UPDATE cron_jobs SET handler_value = 'npm run assets -- --limit 10' WHERE task_key = 'runAssets';" 83 ``` 84 85 Or use the CLI: 86 87 ```bash 88 # Temporarily disable job 89 npm run cron:disable runAssets 90 91 # Update via SQL (no CLI command for editing yet) 92 sqlite3 db/sites.db "UPDATE cron_jobs SET handler_value = 'npm run assets -- --limit 10' WHERE task_key = 'runAssets';" 93 94 # Re-enable 95 npm run cron:enable runAssets 96 ``` 97 98 ## Batch Size Guidelines 99 100 ### Factors to Consider 101 102 1. **Stage Speed** 103 - Assets: ~20-30s per site (Playwright screenshots) 104 - Scoring: ~5s per site (GPT-4o-mini vision) 105 - Enrichment: ~60s per site (browse multiple pages) 106 - Proposals: ~3s per site (LLM generation) 107 108 2. **Cron Interval** 109 - Jobs run every **5 minutes** 110 - Each batch should complete in **1-2 minutes max** 111 - Leaves 3-4 minutes for other jobs 112 113 3. **API Rate Limits** 114 - ZenRows: 20 concurrent (configured in rate-limiter) 115 - OpenRouter: 5 concurrent 116 - Twilio: 10 concurrent 117 - Batch size should respect these limits 118 119 ### Recommended Ranges 120 121 | Stage | Min | Target | Max | Notes | 122 | ---------- | --- | ------ | --- | ---------------------------------- | 123 | Assets | 3 | 5 | 10 | Limited by Playwright speed | 124 | Scoring | 5 | 10 | 20 | Limited by OpenRouter concurrency | 125 | Rescoring | 3 | 5 | 10 | Slower than scoring | 126 | Enrichment | 1 | 3 | 5 | Very slow, browsing multiple pages | 127 | Proposals | 5 | 10 | 20 | Fast LLM generation | 128 | Outreach | 10 | 20 | 50 | Fast API calls | 129 130 ### When to Increase Batch Size 131 132 ✅ **Increase if:** 133 134 - Stage consistently finishes in <30 seconds 135 - Queue is backing up (100+ sites waiting) 136 - You have higher API rate limits 137 138 ### When to Decrease Batch Size 139 140 ❌ **Decrease if:** 141 142 - Stage takes >2 minutes to complete 143 - Other jobs are delayed (check logs) 144 - Getting API rate limit errors 145 - Memory usage is high 146 147 ## Monitoring Batch Performance 148 149 ### Check Execution Times 150 151 ```bash 152 # View recent logs for a stage 153 npm run cron:logs runAssets 154 155 # Check average duration 156 sqlite3 db/sites.db " 157 SELECT 158 job_name, 159 AVG((julianday(finished_at) - julianday(started_at)) * 24 * 60) as avg_minutes, 160 COUNT(*) as runs 161 FROM cron_job_logs 162 WHERE job_name LIKE '%Stage%' 163 AND started_at >= datetime('now', '-7 days') 164 AND status = 'success' 165 GROUP BY job_name 166 ORDER BY avg_minutes DESC; 167 " 168 ``` 169 170 ### Check Queue Depth 171 172 ```bash 173 sqlite3 db/sites.db " 174 SELECT 175 status, 176 COUNT(*) as count 177 FROM sites 178 GROUP BY status 179 ORDER BY count DESC; 180 " 181 ``` 182 183 ### Alert Thresholds 184 185 - ⚠️ **Warning**: Stage takes >90 seconds 186 - 🚨 **Critical**: Stage takes >2 minutes 187 - 📊 **Queue backing up**: >100 sites at same status 188 189 ## Alternative: SERP Stage Optimization 190 191 The SERP stage is different - it doesn't use batches because: 192 193 1. **Parallel Processing**: Uses `processBatch()` with concurrency 194 2. **Rate Limiter**: Bottleneck library queues excess requests 195 3. **Fast**: Completes 20 keywords in ~2 minutes 196 197 **Current config:** 198 199 ```javascript 200 ZENROWS_CONCURRENCY = 20; // Process 20 keywords in parallel 201 ``` 202 203 No batch size needed - it processes ALL pending keywords but limits concurrency. 204 205 ## Best Practices 206 207 1. **Start Conservative**: Begin with smaller batches, increase gradually 208 2. **Monitor First Week**: Check logs daily to tune batch sizes 209 3. **Leave Headroom**: Target 60-90 seconds, not 2+ minutes 210 4. **Test Changes**: Manually run stage with new limit before updating cron 211 5. **Document Changes**: Update this file when changing batch sizes 212 6. **Seasonal Adjustment**: Increase during low-traffic hours, decrease during peak 213 214 ## Testing Batch Sizes 215 216 ```bash 217 # Test assets stage with different limits 218 time npm run assets -- --limit 3 219 time npm run assets -- --limit 5 220 time npm run assets -- --limit 10 221 222 # Check how many sites are waiting 223 sqlite3 db/sites.db "SELECT COUNT(*) FROM sites WHERE status = 'found';" 224 225 # Calculate optimal batch size 226 # (Desired batch time in seconds) / (Avg time per site in seconds) = Batch size 227 # Example: 90s / 25s per site = 3.6 → use batch size of 3 228 ``` 229 230 ## Future Enhancements 231 232 - [ ] Auto-tuning: Adjust batch size based on average execution time 233 - [ ] Queue depth monitoring: Increase batch if queue backs up 234 - [ ] Time-of-day optimization: Larger batches during off-peak hours 235 - [ ] Per-stage timeout enforcement: Kill jobs that exceed 3 minutes 236 - [ ] Parallel job execution for independent stages (email sync + scoring) 237 238 ## Summary 239 240 ✅ **Do:** 241 242 - Use small batches (3-10 sites per run) 243 - Return control within 1-2 minutes 244 - Let cron orchestrate all jobs equally 245 - Monitor execution times weekly 246 247 ❌ **Don't:** 248 249 - Process 100+ sites in one run 250 - Block other jobs for >2 minutes 251 - Forget to adjust for API rate limits 252 - Set batch size without testing first 253 254 The goal is **steady progress** across all stages, not completing one stage as fast as possible.