/ SESSION_PERSISTENCE_ANALYSIS.md
SESSION_PERSISTENCE_ANALYSIS.md
  1  # Session Persistence Bug - Root Cause Analysis
  2  
  3  **Issue:** Sessions created in one `mix run` cannot be accessed in another `mix run`
  4  
  5  **Evidence:**
  6  - ✅ Session creation works: `session=ceo_1762842800_145484,turn=1,tokens=793`
  7  - ❌ Continuation fails: `FAIL:continuation` with `:session_not_found`
  8  - ✅ Works perfectly WITHIN the same process (debug_session.exs succeeded)
  9  
 10  ---
 11  
 12  ## Possible Root Causes & Fix Probabilities
 13  
 14  ### 1. **ETS Table Lifecycle Issue** 🔴 HIGH PROBABILITY
 15  
 16  **Probability:** 85%
 17  
 18  **Root Cause:**
 19  - ETS tables are **process-local** in BEAM VM
 20  - Each `mix run` starts a **NEW BEAM VM instance**
 21  - Session GenServer creates ETS table `:llm_sessions`
 22  - When Mix process exits → **ETS table destroyed**
 23  - Next `mix run` → New GenServer → **Fresh empty ETS table**
 24  
 25  **Evidence:**
 26  ```elixir
 27  # apps/echo_shared/lib/echo_shared/llm/session.ex:328
 28  :ets.new(@table_name, [:named_table, :set, :public, read_concurrency: true])
 29  # This creates in-memory table - NO persistence to disk
 30  ```
 31  
 32  **Fix Options:**
 33  
 34  #### Option A: PostgreSQL Session Storage (RECOMMENDED)
 35  **Probability of Success:** 70%
 36  **Effort:** Medium (2-3 hours)
 37  **Pros:**
 38  - ✅ True persistence across restarts
 39  - ✅ Leverages existing infrastructure
 40  - ✅ Production-ready (ACID compliance)
 41  - ✅ Aligns with ECHO architecture (everything in DB)
 42  - ✅ Session history queryable/analyzable
 43  - ✅ Supports multi-instance deployments
 44  
 45  **Cons:**
 46  - ⚠️ Slightly slower than ETS (~5-10ms per query vs <1ms)
 47  - ⚠️ Requires migration
 48  
 49  **Implementation:**
 50  1. Create `llm_sessions` table in PostgreSQL
 51  2. Add `Session` schema with Ecto
 52  3. Replace ETS calls with Repo calls
 53  4. Keep cleanup logic (cron job or TTL)
 54  
 55  #### Option B: DETS (Disk-based ETS)
 56  **Probability of Success:** 50%
 57  **Effort:** Low (30 minutes)
 58  **Pros:**
 59  - ✅ Simple drop-in replacement for ETS
 60  - ✅ Automatic disk persistence
 61  
 62  **Cons:**
 63  - ❌ Slower than ETS (10-50x)
 64  - ❌ File corruption risk
 65  - ❌ Not suitable for production
 66  - ❌ Single-file bottleneck
 67  - ❌ Needs manual file management
 68  
 69  #### Option C: Mnesia
 70  **Probability of Success:** 60%
 71  **Effort:** High (4-6 hours)
 72  **Pros:**
 73  - ✅ Distributed database built into Erlang
 74  - ✅ Fast like ETS with persistence
 75  
 76  **Cons:**
 77  - ❌ Overkill for this use case
 78  - ❌ Complex setup and clustering
 79  - ❌ Another database to manage
 80  
 81  ---
 82  
 83  ### 2. **GenServer Not Starting in Test Context**
 84  
 85  **Probability:** 5%
 86  
 87  **Root Cause:**
 88  - Session GenServer might not be supervised in test environment
 89  - Application supervision tree not starting
 90  
 91  **Evidence AGAINST This:**
 92  ```
 93  ✅ Logs show: "LLM Session manager started"
 94  ✅ Verification passed: Application supervision check
 95  ```
 96  
 97  **Fix:** Not needed - already working
 98  
 99  ---
100  
101  ### 3. **ETS Table Configuration Issue**
102  
103  **Probability:** 8%
104  
105  **Root Cause:**
106  - Table not `public` (can't access from other processes)
107  - Table not `named_table` (can't find by name)
108  - Race condition on table creation
109  
110  **Evidence AGAINST This:**
111  ```elixir
112  :ets.new(@table_name, [:named_table, :set, :public, read_concurrency: true])
113                         ^^^^^^^^^^^^^^^^      ^^^^^^^
114  # Correct configuration
115  ```
116  
117  **Fix:** Not needed - configuration is correct
118  
119  ---
120  
121  ### 4. **Session Cleanup Too Aggressive**
122  
123  **Probability:** 2%
124  
125  **Root Cause:**
126  - Cleanup cron job runs too frequently
127  - Session deleted before continuation attempt
128  
129  **Evidence AGAINST This:**
130  ```elixir
131  @session_timeout_ms :timer.hours(1)      # 1 hour
132  @cleanup_interval_ms :timer.minutes(15)  # Every 15 minutes
133  # Test runs in < 5 minutes - should not trigger cleanup
134  ```
135  
136  **Fix:** Not needed - cleanup is fine
137  
138  ---
139  
140  ## Recommended Fix: PostgreSQL Session Storage
141  
142  **Confidence:** 70% (HIGH)
143  **Effort:** Medium
144  **Impact:** Solves persistence + enables new features
145  
146  ### Implementation Plan
147  
148  #### Step 1: Create Migration
149  ```elixir
150  # apps/echo_shared/priv/repo/migrations/XXXXXX_create_llm_sessions.exs
151  
152  defmodule EchoShared.Repo.Migrations.CreateLlmSessions do
153    use Ecto.Migration
154  
155    def change do
156      create table(:llm_sessions, primary_key: false) do
157        add :session_id, :string, primary_key: true
158        add :agent_role, :string, null: false
159        add :startup_context, :text
160        add :conversation_history, :jsonb, default: "[]"
161        add :turn_count, :integer, default: 0
162        add :total_tokens, :integer, default: 0
163        add :created_at, :utc_datetime
164        add :last_query_at, :utc_datetime
165      end
166  
167      create index(:llm_sessions, [:agent_role])
168      create index(:llm_sessions, [:last_query_at])
169    end
170  end
171  ```
172  
173  #### Step 2: Create Schema
174  ```elixir
175  # apps/echo_shared/lib/echo_shared/schemas/llm_session.ex
176  
177  defmodule EchoShared.Schemas.LlmSession do
178    use Ecto.Schema
179    import Ecto.Changeset
180  
181    @primary_key {:session_id, :string, autogenerate: false}
182    schema "llm_sessions" do
183      field :agent_role, :string
184      field :startup_context, :string
185      field :conversation_history, {:array, :map}, default: []
186      field :turn_count, :integer, default: 0
187      field :total_tokens, :integer, default: 0
188      field :created_at, :utc_datetime
189      field :last_query_at, :utc_datetime
190    end
191  
192    def changeset(session, attrs) do
193      session
194      |> cast(attrs, [:session_id, :agent_role, :startup_context,
195                      :conversation_history, :turn_count, :total_tokens,
196                      :created_at, :last_query_at])
197      |> validate_required([:session_id, :agent_role])
198    end
199  end
200  ```
201  
202  #### Step 3: Update Session Module
203  ```elixir
204  # Replace ETS calls with Repo calls
205  
206  # OLD:
207  :ets.insert(@table_name, {session_id, session})
208  
209  # NEW:
210  %LlmSession{}
211  |> LlmSession.changeset(session)
212  |> Repo.insert()
213  
214  # OLD:
215  case :ets.lookup(@table_name, session_id) do
216    [{^session_id, session}] -> session
217    [] -> nil
218  end
219  
220  # NEW:
221  Repo.get(LlmSession, session_id)
222  ```
223  
224  #### Step 4: Update Cleanup Logic
225  ```elixir
226  # Replace ETS scan with DB query
227  
228  # OLD:
229  :ets.tab2list(@table_name)
230  |> Enum.filter(fn {_id, session} ->
231    DateTime.compare(session.last_query_at, cutoff) == :lt
232  end)
233  
234  # NEW:
235  from(s in LlmSession,
236    where: s.last_query_at < ^cutoff
237  )
238  |> Repo.delete_all()
239  ```
240  
241  ### Benefits of This Fix
242  
243  1. **✅ Solves the bug** - Sessions persist across restarts
244  2. **✅ Production-ready** - ACID compliance, backups included
245  3. **✅ Enables features:**
246     - Session history analysis
247     - Multi-instance deployments (shared sessions)
248     - Session resume after app restart
249     - Long-running sessions (days/weeks)
250  4. **✅ Minimal performance impact** - ~5ms extra per query (acceptable)
251  5. **✅ Consistent architecture** - Everything in PostgreSQL
252  
253  ---
254  
255  ## Alternative: Quick Fix (Not Recommended)
256  
257  If you want sessions to work ONLY within a single process:
258  
259  **Run agents in continuous mode:**
260  ```bash
261  # Start agent as long-running process
262  cd apps/ceo && iex -S mix
263  
264  # Now all session_consult calls work in this iex session
265  ```
266  
267  **Pros:**
268  - ✅ Zero code changes
269  - ✅ Works immediately
270  
271  **Cons:**
272  - ❌ Doesn't solve the real problem
273  - ❌ Sessions lost on restart
274  - ❌ Not production-ready
275  - ❌ Testing is awkward
276  
277  ---
278  
279  ## Decision Matrix
280  
281  | Solution | Probability | Effort | Production Ready | Recommended |
282  |----------|-------------|--------|------------------|-------------|
283  | PostgreSQL | 70% | Medium | ✅ Yes | ✅ **BEST** |
284  | DETS | 50% | Low | ❌ No | ❌ No |
285  | Mnesia | 60% | High | ⚠️ Complex | ❌ No |
286  | Continuous mode | 90% | None | ⚠️ Workaround | ⚠️ Temporary |
287  
288  ---
289  
290  ## Recommendation
291  
292  **Implement PostgreSQL session storage** because:
293  1. Highest confidence for production (70%)
294  2. Solves persistence properly
295  3. Enables future features (session analysis, multi-instance)
296  4. Aligns with ECHO architecture
297  5. ~5ms performance impact is acceptable for session operations
298  
299  **Estimated time:** 2-3 hours
300  **Risk:** Low (well-understood technology)
301  **Impact:** HIGH (fixes bug + adds production capabilities)