/ docs / storage-analysis.md
storage-analysis.md
  1  # Data Storage Analysis & Projections
  2  
  3  ## Current Data Volumes (Your System)
  4  
  5  Based on one day of active use (2026-01-13):
  6  
  7  | Data Type | Size | Count | Avg Item Size |
  8  |-----------|------|-------|---------------|
  9  | Graph atoms | 2.4 MB | 1,629 | ~1.5 KB |
 10  | Graph edges | 652 KB | 3,370 | ~200 bytes |
 11  | Monologue transcripts | 784 KB | ~10 files | ~78 KB/day |
 12  | Pipeline data | 160 KB | varies | varies |
 13  | **Daily Total (text)** | **~4 MB** | | |
 14  
 15  ## Projection: One Heavy User
 16  
 17  Assuming 8 hours/day active work:
 18  - ~2,000 atoms/day (conversations, sessions)
 19  - ~4,000 edges/day
 20  - ~100 KB transcripts/day
 21  
 22  **Monthly (22 work days):**
 23  - ~44,000 atoms (~66 MB)
 24  - ~88,000 edges (~18 MB)
 25  - ~2.2 MB transcripts
 26  - **Total: ~90 MB/month text**
 27  
 28  **Yearly:**
 29  - ~1.1 GB text data
 30  - Very manageable
 31  
 32  ## Adding Visual/Video Capture
 33  
 34  This is where storage explodes:
 35  
 36  ### Photos
 37  | Scenario | Size/Photo | Photos/Day | Daily MB |
 38  |----------|------------|------------|----------|
 39  | Quick snaps | 2 MB | 10 | 20 MB |
 40  | Documentation | 3 MB | 30 | 90 MB |
 41  | Heavy capture | 3 MB | 100 | 300 MB |
 42  
 43  **With thumbnails only stored** (300 KB):
 44  - Heavy capture: ~30 MB/day
 45  
 46  ### Video
 47  | Length | Quality | Size | Processed Output |
 48  |--------|---------|------|------------------|
 49  | 1 min | 1080p | ~100 MB | ~5 MB (keyframes + transcript) |
 50  | 5 min | 1080p | ~500 MB | ~15 MB |
 51  | 30 min | 1080p | ~3 GB | ~50 MB |
 52  
 53  **Technician scenario:**
 54  - 5 x 5-minute videos/day = 2.5 GB raw
 55  - Processed output: ~75 MB/day
 56  - Keep raw for 30 days, then archive/delete
 57  
 58  ## Company-Wide Projections
 59  
 60  ### Scenario: 20 Technicians
 61  Each technician:
 62  - Text data: ~100 MB/month
 63  - Photos: ~2 GB/month (thumbnails only)
 64  - Video: ~1.5 GB/month (processed only)
 65  - **Per tech: ~3.6 GB/month**
 66  
 67  **Company total:**
 68  - ~72 GB/month
 69  - ~864 GB/year
 70  - **Storage cost: ~$15/month** (cloud) or **~$50 one-time** (2TB drive)
 71  
 72  ### Scenario: 50 Technicians + 10 Office Staff
 73  
 74  | Role | Count | Data/Month | Total |
 75  |------|-------|------------|-------|
 76  | Technicians | 50 | 3.6 GB | 180 GB |
 77  | Office (light) | 10 | 500 MB | 5 GB |
 78  | **Monthly** | | | **185 GB** |
 79  | **Yearly** | | | **2.2 TB** |
 80  
 81  **Storage costs:**
 82  - AWS S3: ~$45/month
 83  - Wasabi: ~$13/month
 84  - Local NAS: ~$200 one-time for 4TB
 85  
 86  ## Storage Optimization Strategies
 87  
 88  ### 1. Tiered Retention
 89  ```
 90  Hot (< 7 days):   Full resolution, instant access
 91  Warm (7-30 days): Compressed, fast access
 92  Cold (> 30 days): Archived, slow access
 93  Delete (> 1 yr):  Auto-purge with summary retained
 94  ```
 95  
 96  ### 2. Store Processed, Not Raw
 97  | Keep | Discard After Processing |
 98  |------|-------------------------|
 99  | Transcripts | Raw audio |
100  | Keyframes + thumbnails | Full video |
101  | OCR text | High-res images |
102  | Graph atoms | Duplicate content |
103  
104  ### 3. Deduplication
105  - Similar photos: Keep best, link others
106  - Repeated phrases: Store once, reference many
107  - Common patterns: Template + diff
108  
109  ### 4. Compression Estimates
110  | Data Type | Raw | Compressed | Ratio |
111  |-----------|-----|------------|-------|
112  | JSON (atoms) | 2.4 MB | 400 KB | 6:1 |
113  | Images | 3 MB | 300 KB (thumb) | 10:1 |
114  | Video | 500 MB | 50 MB (processed) | 10:1 |
115  
116  ## Graph-Specific Considerations
117  
118  ### Why Graph Data Stays Small
119  - Text only (no blobs)
120  - Structural (edges are tiny)
121  - Incremental (diffs, not snapshots)
122  
123  ### Edge Explosion Warning
124  With N atoms, worst case O(N²) edges. Mitigations:
125  - Only store strong edges (> threshold)
126  - Prune weak edges monthly
127  - Use edge types to limit scope
128  
129  Current ratio: 2:1 edges:atoms (healthy)
130  Warning threshold: 10:1 (needs pruning)
131  
132  ### Query Performance
133  At current scale:
134  - 5,000 items: <10ms queries
135  - 50,000 items: <100ms queries
136  - 500,000 items: May need indexing
137  
138  Solutions:
139  - SQLite FTS5 for text search
140  - In-memory for hot data
141  - PostgreSQL for scale
142  
143  ## Cost Projections by Scale
144  
145  | Scale | Storage/Year | Cloud/Month | One-Time HW |
146  |-------|--------------|-------------|-------------|
147  | 1 user (you) | 10 GB | $2 | $50 (SSD) |
148  | 10 users | 100 GB | $10 | $100 (1TB) |
149  | 50 users | 500 GB | $40 | $200 (2TB) |
150  | 200 users | 2 TB | $100 | $500 (4TB RAID) |
151  | 1000 users | 10 TB | $400 | $2000 (NAS) |
152  
153  ## Recommendation
154  
155  **For Sovereign Estate (20-50 techs):**
156  
157  1. **Start with hybrid:**
158     - Local NAS for video/image cache (4TB, ~$400)
159     - Cloud (Wasabi) for graph/text data (~$20/month)
160     - Auto-archive to cold storage after 30 days
161  
162  2. **Budget:**
163     - Year 1: $500 hardware + $300 cloud = $800
164     - Year 2+: ~$400/year cloud
165  
166  3. **Scale triggers:**
167     - At 100 users: Move to dedicated server
168     - At 500 users: Consider managed database
169     - At 1000 users: Distributed architecture
170  
171  ## Conclusion
172  
173  **The graph itself is NOT a storage problem.**
174  
175  Text-based semantic graphs scale beautifully:
176  - 1,000 users × 365 days × 2,000 atoms = 730M atoms
177  - At 1.5 KB/atom = 1.1 TB
178  - Compressed: ~200 GB
179  - Cost: ~$50/year in cloud storage
180  
181  **Video/photos ARE the storage challenge.**
182  
183  But with smart retention (keep processed, delete raw):
184  - 90% storage reduction
185  - Searchability preserved
186  - Cost manageable ($400/year at 50 users)
187  
188  **This is absolutely tractable.** The cost is negligible compared to the value of captured institutional knowledge.
189  ---
190  
191  ## Related
192  
193  - [[compute-scaling]] - resonance: 28%
194