/ SESSION_PERSISTENCE_ANALYSIS.md
SESSION_PERSISTENCE_ANALYSIS.md
1 # Session Persistence Bug - Root Cause Analysis 2 3 **Issue:** Sessions created in one `mix run` cannot be accessed in another `mix run` 4 5 **Evidence:** 6 - ✅ Session creation works: `session=ceo_1762842800_145484,turn=1,tokens=793` 7 - ❌ Continuation fails: `FAIL:continuation` with `:session_not_found` 8 - ✅ Works perfectly WITHIN the same process (debug_session.exs succeeded) 9 10 --- 11 12 ## Possible Root Causes & Fix Probabilities 13 14 ### 1. **ETS Table Lifecycle Issue** 🔴 HIGH PROBABILITY 15 16 **Probability:** 85% 17 18 **Root Cause:** 19 - ETS tables are **process-local** in BEAM VM 20 - Each `mix run` starts a **NEW BEAM VM instance** 21 - Session GenServer creates ETS table `:llm_sessions` 22 - When Mix process exits → **ETS table destroyed** 23 - Next `mix run` → New GenServer → **Fresh empty ETS table** 24 25 **Evidence:** 26 ```elixir 27 # apps/echo_shared/lib/echo_shared/llm/session.ex:328 28 :ets.new(@table_name, [:named_table, :set, :public, read_concurrency: true]) 29 # This creates in-memory table - NO persistence to disk 30 ``` 31 32 **Fix Options:** 33 34 #### Option A: PostgreSQL Session Storage (RECOMMENDED) 35 **Probability of Success:** 70% 36 **Effort:** Medium (2-3 hours) 37 **Pros:** 38 - ✅ True persistence across restarts 39 - ✅ Leverages existing infrastructure 40 - ✅ Production-ready (ACID compliance) 41 - ✅ Aligns with ECHO architecture (everything in DB) 42 - ✅ Session history queryable/analyzable 43 - ✅ Supports multi-instance deployments 44 45 **Cons:** 46 - ⚠️ Slightly slower than ETS (~5-10ms per query vs <1ms) 47 - ⚠️ Requires migration 48 49 **Implementation:** 50 1. Create `llm_sessions` table in PostgreSQL 51 2. Add `Session` schema with Ecto 52 3. Replace ETS calls with Repo calls 53 4. Keep cleanup logic (cron job or TTL) 54 55 #### Option B: DETS (Disk-based ETS) 56 **Probability of Success:** 50% 57 **Effort:** Low (30 minutes) 58 **Pros:** 59 - ✅ Simple drop-in replacement for ETS 60 - ✅ Automatic disk persistence 61 62 **Cons:** 63 - ❌ Slower than ETS (10-50x) 64 - ❌ File corruption risk 65 - ❌ Not suitable for production 66 - ❌ Single-file bottleneck 67 - ❌ Needs manual file management 68 69 #### Option C: Mnesia 70 **Probability of Success:** 60% 71 **Effort:** High (4-6 hours) 72 **Pros:** 73 - ✅ Distributed database built into Erlang 74 - ✅ Fast like ETS with persistence 75 76 **Cons:** 77 - ❌ Overkill for this use case 78 - ❌ Complex setup and clustering 79 - ❌ Another database to manage 80 81 --- 82 83 ### 2. **GenServer Not Starting in Test Context** 84 85 **Probability:** 5% 86 87 **Root Cause:** 88 - Session GenServer might not be supervised in test environment 89 - Application supervision tree not starting 90 91 **Evidence AGAINST This:** 92 ``` 93 ✅ Logs show: "LLM Session manager started" 94 ✅ Verification passed: Application supervision check 95 ``` 96 97 **Fix:** Not needed - already working 98 99 --- 100 101 ### 3. **ETS Table Configuration Issue** 102 103 **Probability:** 8% 104 105 **Root Cause:** 106 - Table not `public` (can't access from other processes) 107 - Table not `named_table` (can't find by name) 108 - Race condition on table creation 109 110 **Evidence AGAINST This:** 111 ```elixir 112 :ets.new(@table_name, [:named_table, :set, :public, read_concurrency: true]) 113 ^^^^^^^^^^^^^^^^ ^^^^^^^ 114 # Correct configuration 115 ``` 116 117 **Fix:** Not needed - configuration is correct 118 119 --- 120 121 ### 4. **Session Cleanup Too Aggressive** 122 123 **Probability:** 2% 124 125 **Root Cause:** 126 - Cleanup cron job runs too frequently 127 - Session deleted before continuation attempt 128 129 **Evidence AGAINST This:** 130 ```elixir 131 @session_timeout_ms :timer.hours(1) # 1 hour 132 @cleanup_interval_ms :timer.minutes(15) # Every 15 minutes 133 # Test runs in < 5 minutes - should not trigger cleanup 134 ``` 135 136 **Fix:** Not needed - cleanup is fine 137 138 --- 139 140 ## Recommended Fix: PostgreSQL Session Storage 141 142 **Confidence:** 70% (HIGH) 143 **Effort:** Medium 144 **Impact:** Solves persistence + enables new features 145 146 ### Implementation Plan 147 148 #### Step 1: Create Migration 149 ```elixir 150 # apps/echo_shared/priv/repo/migrations/XXXXXX_create_llm_sessions.exs 151 152 defmodule EchoShared.Repo.Migrations.CreateLlmSessions do 153 use Ecto.Migration 154 155 def change do 156 create table(:llm_sessions, primary_key: false) do 157 add :session_id, :string, primary_key: true 158 add :agent_role, :string, null: false 159 add :startup_context, :text 160 add :conversation_history, :jsonb, default: "[]" 161 add :turn_count, :integer, default: 0 162 add :total_tokens, :integer, default: 0 163 add :created_at, :utc_datetime 164 add :last_query_at, :utc_datetime 165 end 166 167 create index(:llm_sessions, [:agent_role]) 168 create index(:llm_sessions, [:last_query_at]) 169 end 170 end 171 ``` 172 173 #### Step 2: Create Schema 174 ```elixir 175 # apps/echo_shared/lib/echo_shared/schemas/llm_session.ex 176 177 defmodule EchoShared.Schemas.LlmSession do 178 use Ecto.Schema 179 import Ecto.Changeset 180 181 @primary_key {:session_id, :string, autogenerate: false} 182 schema "llm_sessions" do 183 field :agent_role, :string 184 field :startup_context, :string 185 field :conversation_history, {:array, :map}, default: [] 186 field :turn_count, :integer, default: 0 187 field :total_tokens, :integer, default: 0 188 field :created_at, :utc_datetime 189 field :last_query_at, :utc_datetime 190 end 191 192 def changeset(session, attrs) do 193 session 194 |> cast(attrs, [:session_id, :agent_role, :startup_context, 195 :conversation_history, :turn_count, :total_tokens, 196 :created_at, :last_query_at]) 197 |> validate_required([:session_id, :agent_role]) 198 end 199 end 200 ``` 201 202 #### Step 3: Update Session Module 203 ```elixir 204 # Replace ETS calls with Repo calls 205 206 # OLD: 207 :ets.insert(@table_name, {session_id, session}) 208 209 # NEW: 210 %LlmSession{} 211 |> LlmSession.changeset(session) 212 |> Repo.insert() 213 214 # OLD: 215 case :ets.lookup(@table_name, session_id) do 216 [{^session_id, session}] -> session 217 [] -> nil 218 end 219 220 # NEW: 221 Repo.get(LlmSession, session_id) 222 ``` 223 224 #### Step 4: Update Cleanup Logic 225 ```elixir 226 # Replace ETS scan with DB query 227 228 # OLD: 229 :ets.tab2list(@table_name) 230 |> Enum.filter(fn {_id, session} -> 231 DateTime.compare(session.last_query_at, cutoff) == :lt 232 end) 233 234 # NEW: 235 from(s in LlmSession, 236 where: s.last_query_at < ^cutoff 237 ) 238 |> Repo.delete_all() 239 ``` 240 241 ### Benefits of This Fix 242 243 1. **✅ Solves the bug** - Sessions persist across restarts 244 2. **✅ Production-ready** - ACID compliance, backups included 245 3. **✅ Enables features:** 246 - Session history analysis 247 - Multi-instance deployments (shared sessions) 248 - Session resume after app restart 249 - Long-running sessions (days/weeks) 250 4. **✅ Minimal performance impact** - ~5ms extra per query (acceptable) 251 5. **✅ Consistent architecture** - Everything in PostgreSQL 252 253 --- 254 255 ## Alternative: Quick Fix (Not Recommended) 256 257 If you want sessions to work ONLY within a single process: 258 259 **Run agents in continuous mode:** 260 ```bash 261 # Start agent as long-running process 262 cd apps/ceo && iex -S mix 263 264 # Now all session_consult calls work in this iex session 265 ``` 266 267 **Pros:** 268 - ✅ Zero code changes 269 - ✅ Works immediately 270 271 **Cons:** 272 - ❌ Doesn't solve the real problem 273 - ❌ Sessions lost on restart 274 - ❌ Not production-ready 275 - ❌ Testing is awkward 276 277 --- 278 279 ## Decision Matrix 280 281 | Solution | Probability | Effort | Production Ready | Recommended | 282 |----------|-------------|--------|------------------|-------------| 283 | PostgreSQL | 70% | Medium | ✅ Yes | ✅ **BEST** | 284 | DETS | 50% | Low | ❌ No | ❌ No | 285 | Mnesia | 60% | High | ⚠️ Complex | ❌ No | 286 | Continuous mode | 90% | None | ⚠️ Workaround | ⚠️ Temporary | 287 288 --- 289 290 ## Recommendation 291 292 **Implement PostgreSQL session storage** because: 293 1. Highest confidence for production (70%) 294 2. Solves persistence properly 295 3. Enables future features (session analysis, multi-instance) 296 4. Aligns with ECHO architecture 297 5. ~5ms performance impact is acceptable for session operations 298 299 **Estimated time:** 2-3 hours 300 **Risk:** Low (well-understood technology) 301 **Impact:** HIGH (fixes bug + adds production capabilities)