storage-analysis.md
1 # Data Storage Analysis & Projections 2 3 ## Current Data Volumes (Your System) 4 5 Based on one day of active use (2026-01-13): 6 7 | Data Type | Size | Count | Avg Item Size | 8 |-----------|------|-------|---------------| 9 | Graph atoms | 2.4 MB | 1,629 | ~1.5 KB | 10 | Graph edges | 652 KB | 3,370 | ~200 bytes | 11 | Monologue transcripts | 784 KB | ~10 files | ~78 KB/day | 12 | Pipeline data | 160 KB | varies | varies | 13 | **Daily Total (text)** | **~4 MB** | | | 14 15 ## Projection: One Heavy User 16 17 Assuming 8 hours/day active work: 18 - ~2,000 atoms/day (conversations, sessions) 19 - ~4,000 edges/day 20 - ~100 KB transcripts/day 21 22 **Monthly (22 work days):** 23 - ~44,000 atoms (~66 MB) 24 - ~88,000 edges (~18 MB) 25 - ~2.2 MB transcripts 26 - **Total: ~90 MB/month text** 27 28 **Yearly:** 29 - ~1.1 GB text data 30 - Very manageable 31 32 ## Adding Visual/Video Capture 33 34 This is where storage explodes: 35 36 ### Photos 37 | Scenario | Size/Photo | Photos/Day | Daily MB | 38 |----------|------------|------------|----------| 39 | Quick snaps | 2 MB | 10 | 20 MB | 40 | Documentation | 3 MB | 30 | 90 MB | 41 | Heavy capture | 3 MB | 100 | 300 MB | 42 43 **With thumbnails only stored** (300 KB): 44 - Heavy capture: ~30 MB/day 45 46 ### Video 47 | Length | Quality | Size | Processed Output | 48 |--------|---------|------|------------------| 49 | 1 min | 1080p | ~100 MB | ~5 MB (keyframes + transcript) | 50 | 5 min | 1080p | ~500 MB | ~15 MB | 51 | 30 min | 1080p | ~3 GB | ~50 MB | 52 53 **Technician scenario:** 54 - 5 x 5-minute videos/day = 2.5 GB raw 55 - Processed output: ~75 MB/day 56 - Keep raw for 30 days, then archive/delete 57 58 ## Company-Wide Projections 59 60 ### Scenario: 20 Technicians 61 Each technician: 62 - Text data: ~100 MB/month 63 - Photos: ~2 GB/month (thumbnails only) 64 - Video: ~1.5 GB/month (processed only) 65 - **Per tech: ~3.6 GB/month** 66 67 **Company total:** 68 - ~72 GB/month 69 - ~864 GB/year 70 - **Storage cost: ~$15/month** (cloud) or **~$50 one-time** (2TB drive) 71 72 ### Scenario: 50 Technicians + 10 Office Staff 73 74 | Role | Count | Data/Month | Total | 75 |------|-------|------------|-------| 76 | Technicians | 50 | 3.6 GB | 180 GB | 77 | Office (light) | 10 | 500 MB | 5 GB | 78 | **Monthly** | | | **185 GB** | 79 | **Yearly** | | | **2.2 TB** | 80 81 **Storage costs:** 82 - AWS S3: ~$45/month 83 - Wasabi: ~$13/month 84 - Local NAS: ~$200 one-time for 4TB 85 86 ## Storage Optimization Strategies 87 88 ### 1. Tiered Retention 89 ``` 90 Hot (< 7 days): Full resolution, instant access 91 Warm (7-30 days): Compressed, fast access 92 Cold (> 30 days): Archived, slow access 93 Delete (> 1 yr): Auto-purge with summary retained 94 ``` 95 96 ### 2. Store Processed, Not Raw 97 | Keep | Discard After Processing | 98 |------|-------------------------| 99 | Transcripts | Raw audio | 100 | Keyframes + thumbnails | Full video | 101 | OCR text | High-res images | 102 | Graph atoms | Duplicate content | 103 104 ### 3. Deduplication 105 - Similar photos: Keep best, link others 106 - Repeated phrases: Store once, reference many 107 - Common patterns: Template + diff 108 109 ### 4. Compression Estimates 110 | Data Type | Raw | Compressed | Ratio | 111 |-----------|-----|------------|-------| 112 | JSON (atoms) | 2.4 MB | 400 KB | 6:1 | 113 | Images | 3 MB | 300 KB (thumb) | 10:1 | 114 | Video | 500 MB | 50 MB (processed) | 10:1 | 115 116 ## Graph-Specific Considerations 117 118 ### Why Graph Data Stays Small 119 - Text only (no blobs) 120 - Structural (edges are tiny) 121 - Incremental (diffs, not snapshots) 122 123 ### Edge Explosion Warning 124 With N atoms, worst case O(N²) edges. Mitigations: 125 - Only store strong edges (> threshold) 126 - Prune weak edges monthly 127 - Use edge types to limit scope 128 129 Current ratio: 2:1 edges:atoms (healthy) 130 Warning threshold: 10:1 (needs pruning) 131 132 ### Query Performance 133 At current scale: 134 - 5,000 items: <10ms queries 135 - 50,000 items: <100ms queries 136 - 500,000 items: May need indexing 137 138 Solutions: 139 - SQLite FTS5 for text search 140 - In-memory for hot data 141 - PostgreSQL for scale 142 143 ## Cost Projections by Scale 144 145 | Scale | Storage/Year | Cloud/Month | One-Time HW | 146 |-------|--------------|-------------|-------------| 147 | 1 user (you) | 10 GB | $2 | $50 (SSD) | 148 | 10 users | 100 GB | $10 | $100 (1TB) | 149 | 50 users | 500 GB | $40 | $200 (2TB) | 150 | 200 users | 2 TB | $100 | $500 (4TB RAID) | 151 | 1000 users | 10 TB | $400 | $2000 (NAS) | 152 153 ## Recommendation 154 155 **For Sovereign Estate (20-50 techs):** 156 157 1. **Start with hybrid:** 158 - Local NAS for video/image cache (4TB, ~$400) 159 - Cloud (Wasabi) for graph/text data (~$20/month) 160 - Auto-archive to cold storage after 30 days 161 162 2. **Budget:** 163 - Year 1: $500 hardware + $300 cloud = $800 164 - Year 2+: ~$400/year cloud 165 166 3. **Scale triggers:** 167 - At 100 users: Move to dedicated server 168 - At 500 users: Consider managed database 169 - At 1000 users: Distributed architecture 170 171 ## Conclusion 172 173 **The graph itself is NOT a storage problem.** 174 175 Text-based semantic graphs scale beautifully: 176 - 1,000 users × 365 days × 2,000 atoms = 730M atoms 177 - At 1.5 KB/atom = 1.1 TB 178 - Compressed: ~200 GB 179 - Cost: ~$50/year in cloud storage 180 181 **Video/photos ARE the storage challenge.** 182 183 But with smart retention (keep processed, delete raw): 184 - 90% storage reduction 185 - Searchability preserved 186 - Cost manageable ($400/year at 50 users) 187 188 **This is absolutely tractable.** The cost is negligible compared to the value of captured institutional knowledge. 189 --- 190 191 ## Related 192 193 - [[compute-scaling]] - resonance: 28% 194