context-compression-and-caching.md
1 # Context Compression and Caching 2 3 Hermes Agent uses a dual compression system and Anthropic prompt caching to 4 manage context window usage efficiently across long conversations. 5 6 Source files: `agent/context_engine.py` (ABC), `agent/context_compressor.py` (default engine), 7 `agent/prompt_caching.py`, `gateway/run.py` (session hygiene), `run_agent.py` (search for `_compress_context`) 8 9 10 ## Pluggable Context Engine 11 12 Context management is built on the `ContextEngine` ABC (`agent/context_engine.py`). The built-in `ContextCompressor` is the default implementation, but plugins can replace it with alternative engines (e.g., Lossless Context Management). 13 14 ```yaml 15 context: 16 engine: "compressor" # default — built-in lossy summarization 17 engine: "lcm" # example — plugin providing lossless context 18 ``` 19 20 The engine is responsible for: 21 - Deciding when compaction should fire (`should_compress()`) 22 - Performing compaction (`compress()`) 23 - Optionally exposing tools the agent can call (e.g., `lcm_grep`) 24 - Tracking token usage from API responses 25 26 Selection is config-driven via `context.engine` in `config.yaml`. The resolution order: 27 1. Check `plugins/context_engine/<name>/` directory 28 2. Check general plugin system (`register_context_engine()`) 29 3. Fall back to built-in `ContextCompressor` 30 31 Plugin engines are **never auto-activated** — the user must explicitly set `context.engine` to the plugin's name. The default `"compressor"` always uses the built-in. 32 33 Configure via `hermes plugins` → Provider Plugins → Context Engine, or edit `config.yaml` directly. 34 35 For building a context engine plugin, see [Context Engine Plugins](/docs/developer-guide/context-engine-plugin). 36 37 ## Dual Compression System 38 39 Hermes has two separate compression layers that operate independently: 40 41 ``` 42 ┌──────────────────────────┐ 43 Incoming message │ Gateway Session Hygiene │ Fires at 85% of context 44 ─────────────────► │ (pre-agent, rough est.) │ Safety net for large sessions 45 └─────────────┬────────────┘ 46 │ 47 ▼ 48 ┌──────────────────────────┐ 49 │ Agent ContextCompressor │ Fires at 50% of context (default) 50 │ (in-loop, real tokens) │ Normal context management 51 └──────────────────────────┘ 52 ``` 53 54 ### 1. Gateway Session Hygiene (85% threshold) 55 56 Located in `gateway/run.py` (search for `Session hygiene: auto-compress`). This is a **safety net** that 57 runs before the agent processes a message. It prevents API failures when sessions 58 grow too large between turns (e.g., overnight accumulation in Telegram/Discord). 59 60 - **Threshold**: Fixed at 85% of model context length 61 - **Token source**: Prefers actual API-reported tokens from last turn; falls back 62 to rough character-based estimate (`estimate_messages_tokens_rough`) 63 - **Fires**: Only when `len(history) >= 4` and compression is enabled 64 - **Purpose**: Catch sessions that escaped the agent's own compressor 65 66 The gateway hygiene threshold is intentionally higher than the agent's compressor. 67 Setting it at 50% (same as the agent) caused premature compression on every turn 68 in long gateway sessions. 69 70 ### 2. Agent ContextCompressor (50% threshold, configurable) 71 72 Located in `agent/context_compressor.py`. This is the **primary compression 73 system** that runs inside the agent's tool loop with access to accurate, 74 API-reported token counts. 75 76 77 ## Configuration 78 79 All compression settings are read from `config.yaml` under the `compression` key: 80 81 ```yaml 82 compression: 83 enabled: true # Enable/disable compression (default: true) 84 threshold: 0.50 # Fraction of context window (default: 0.50 = 50%) 85 target_ratio: 0.20 # How much of threshold to keep as tail (default: 0.20) 86 protect_last_n: 20 # Minimum protected tail messages (default: 20) 87 88 # Summarization model/provider configured under auxiliary: 89 auxiliary: 90 compression: 91 model: null # Override model for summaries (default: auto-detect) 92 provider: auto # Provider: "auto", "openrouter", "nous", "main", etc. 93 base_url: null # Custom OpenAI-compatible endpoint 94 ``` 95 96 ### Parameter Details 97 98 | Parameter | Default | Range | Description | 99 |-----------|---------|-------|-------------| 100 | `threshold` | `0.50` | 0.0-1.0 | Compression triggers when prompt tokens ≥ `threshold × context_length` | 101 | `target_ratio` | `0.20` | 0.10-0.80 | Controls tail protection token budget: `threshold_tokens × target_ratio` | 102 | `protect_last_n` | `20` | ≥1 | Minimum number of recent messages always preserved | 103 | `protect_first_n` | `3` | (hardcoded) | System prompt + first exchange always preserved | 104 105 ### Computed Values (for a 200K context model at defaults) 106 107 ``` 108 context_length = 200,000 109 threshold_tokens = 200,000 × 0.50 = 100,000 110 tail_token_budget = 100,000 × 0.20 = 20,000 111 max_summary_tokens = min(200,000 × 0.05, 12,000) = 10,000 112 ``` 113 114 115 ## Compression Algorithm 116 117 The `ContextCompressor.compress()` method follows a 4-phase algorithm: 118 119 ### Phase 1: Prune Old Tool Results (cheap, no LLM call) 120 121 Old tool results (>200 chars) outside the protected tail are replaced with: 122 ``` 123 [Old tool output cleared to save context space] 124 ``` 125 126 This is a cheap pre-pass that saves significant tokens from verbose tool 127 outputs (file contents, terminal output, search results). 128 129 ### Phase 2: Determine Boundaries 130 131 ``` 132 ┌─────────────────────────────────────────────────────────────┐ 133 │ Message list │ 134 │ │ 135 │ [0..2] ← protect_first_n (system + first exchange) │ 136 │ [3..N] ← middle turns → SUMMARIZED │ 137 │ [N..end] ← tail (by token budget OR protect_last_n) │ 138 │ │ 139 └─────────────────────────────────────────────────────────────┘ 140 ``` 141 142 Tail protection is **token-budget based**: walks backward from the end, 143 accumulating tokens until the budget is exhausted. Falls back to the fixed 144 `protect_last_n` count if the budget would protect fewer messages. 145 146 Boundaries are aligned to avoid splitting tool_call/tool_result groups. 147 The `_align_boundary_backward()` method walks past consecutive tool results 148 to find the parent assistant message, keeping groups intact. 149 150 ### Phase 3: Generate Structured Summary 151 152 :::warning Summary model context length 153 The summary model must have a context window **at least as large** as the main agent model's. The entire middle section is sent to the summary model in a single `call_llm(task="compression")` call. If the summary model's context is smaller, the API returns a context-length error — `_generate_summary()` catches it, logs a warning, and returns `None`. The compressor then drops the middle turns **without a summary**, silently losing conversation context. This is the most common cause of degraded compaction quality. 154 ::: 155 156 The middle turns are summarized using the auxiliary LLM with a structured 157 template: 158 159 ``` 160 ## Goal 161 [What the user is trying to accomplish] 162 163 ## Constraints & Preferences 164 [User preferences, coding style, constraints, important decisions] 165 166 ## Progress 167 ### Done 168 [Completed work — specific file paths, commands run, results] 169 ### In Progress 170 [Work currently underway] 171 ### Blocked 172 [Any blockers or issues encountered] 173 174 ## Key Decisions 175 [Important technical decisions and why] 176 177 ## Relevant Files 178 [Files read, modified, or created — with brief note on each] 179 180 ## Next Steps 181 [What needs to happen next] 182 183 ## Critical Context 184 [Specific values, error messages, configuration details] 185 ``` 186 187 Summary budget scales with the amount of content being compressed: 188 - Formula: `content_tokens × 0.20` (the `_SUMMARY_RATIO` constant) 189 - Minimum: 2,000 tokens 190 - Maximum: `min(context_length × 0.05, 12,000)` tokens 191 192 ### Phase 4: Assemble Compressed Messages 193 194 The compressed message list is: 195 1. Head messages (with a note appended to system prompt on first compression) 196 2. Summary message (role chosen to avoid consecutive same-role violations) 197 3. Tail messages (unmodified) 198 199 Orphaned tool_call/tool_result pairs are cleaned up by `_sanitize_tool_pairs()`: 200 - Tool results referencing removed calls → removed 201 - Tool calls whose results were removed → stub result injected 202 203 ### Iterative Re-compression 204 205 On subsequent compressions, the previous summary is passed to the LLM with 206 instructions to **update** it rather than summarize from scratch. This preserves 207 information across multiple compactions — items move from "In Progress" to "Done", 208 new progress is added, and obsolete information is removed. 209 210 The `_previous_summary` field on the compressor instance stores the last summary 211 text for this purpose. 212 213 214 ## Before/After Example 215 216 ### Before Compression (45 messages, ~95K tokens) 217 218 ``` 219 [0] system: "You are a helpful assistant..." (system prompt) 220 [1] user: "Help me set up a FastAPI project" 221 [2] assistant: <tool_call> terminal: mkdir project </tool_call> 222 [3] tool: "directory created" 223 [4] assistant: <tool_call> write_file: main.py </tool_call> 224 [5] tool: "file written (2.3KB)" 225 ... 30 more turns of file editing, testing, debugging ... 226 [38] assistant: <tool_call> terminal: pytest </tool_call> 227 [39] tool: "8 passed, 2 failed\n..." (5KB output) 228 [40] user: "Fix the failing tests" 229 [41] assistant: <tool_call> read_file: tests/test_api.py </tool_call> 230 [42] tool: "import pytest\n..." (3KB) 231 [43] assistant: "I see the issue with the test fixtures..." 232 [44] user: "Great, also add error handling" 233 ``` 234 235 ### After Compression (25 messages, ~45K tokens) 236 237 ``` 238 [0] system: "You are a helpful assistant... 239 [Note: Some earlier conversation turns have been compacted...]" 240 [1] user: "Help me set up a FastAPI project" 241 [2] assistant: "[CONTEXT COMPACTION] Earlier turns were compacted... 242 243 ## Goal 244 Set up a FastAPI project with tests and error handling 245 246 ## Progress 247 ### Done 248 - Created project structure: main.py, tests/, requirements.txt 249 - Implemented 5 API endpoints in main.py 250 - Wrote 10 test cases in tests/test_api.py 251 - 8/10 tests passing 252 253 ### In Progress 254 - Fixing 2 failing tests (test_create_user, test_delete_user) 255 256 ## Relevant Files 257 - main.py — FastAPI app with 5 endpoints 258 - tests/test_api.py — 10 test cases 259 - requirements.txt — fastapi, pytest, httpx 260 261 ## Next Steps 262 - Fix failing test fixtures 263 - Add error handling" 264 [3] user: "Fix the failing tests" 265 [4] assistant: <tool_call> read_file: tests/test_api.py </tool_call> 266 [5] tool: "import pytest\n..." 267 [6] assistant: "I see the issue with the test fixtures..." 268 [7] user: "Great, also add error handling" 269 ``` 270 271 272 ## Prompt Caching (Anthropic) 273 274 Source: `agent/prompt_caching.py` 275 276 Reduces input token costs by ~75% on multi-turn conversations by caching the 277 conversation prefix. Uses Anthropic's `cache_control` breakpoints. 278 279 ### Strategy: system_and_3 280 281 Anthropic allows a maximum of 4 `cache_control` breakpoints per request. Hermes 282 uses the "system_and_3" strategy: 283 284 ``` 285 Breakpoint 1: System prompt (stable across all turns) 286 Breakpoint 2: 3rd-to-last non-system message ─┐ 287 Breakpoint 3: 2nd-to-last non-system message ├─ Rolling window 288 Breakpoint 4: Last non-system message ─┘ 289 ``` 290 291 ### How It Works 292 293 `apply_anthropic_cache_control()` deep-copies the messages and injects 294 `cache_control` markers: 295 296 ```python 297 # Cache marker format 298 marker = {"type": "ephemeral"} 299 # Or for 1-hour TTL: 300 marker = {"type": "ephemeral", "ttl": "1h"} 301 ``` 302 303 The marker is applied differently based on content type: 304 305 | Content Type | Where Marker Goes | 306 |-------------|-------------------| 307 | String content | Converted to `[{"type": "text", "text": ..., "cache_control": ...}]` | 308 | List content | Added to the last element's dict | 309 | None/empty | Added as `msg["cache_control"]` | 310 | Tool messages | Added as `msg["cache_control"]` (native Anthropic only) | 311 312 ### Cache-Aware Design Patterns 313 314 1. **Stable system prompt**: The system prompt is breakpoint 1 and cached across 315 all turns. Avoid mutating it mid-conversation (compression appends a note 316 only on the first compaction). 317 318 2. **Message ordering matters**: Cache hits require prefix matching. Adding or 319 removing messages in the middle invalidates the cache for everything after. 320 321 3. **Compression cache interaction**: After compression, the cache is invalidated 322 for the compressed region but the system prompt cache survives. The rolling 323 3-message window re-establishes caching within 1-2 turns. 324 325 4. **TTL selection**: Default is `5m` (5 minutes). Use `1h` for long-running 326 sessions where the user takes breaks between turns. 327 328 ### Enabling Prompt Caching 329 330 Prompt caching is automatically enabled when: 331 - The model is an Anthropic Claude model (detected by model name) 332 - The provider supports `cache_control` (native Anthropic API or OpenRouter) 333 334 ```yaml 335 # config.yaml — TTL is configurable (must be "5m" or "1h") 336 prompt_caching: 337 cache_ttl: "5m" 338 ``` 339 340 The CLI shows caching status at startup: 341 ``` 342 💾 Prompt caching: ENABLED (Claude via OpenRouter, 5m TTL) 343 ``` 344 345 346 ## Context Pressure Warnings 347 348 Intermediate context-pressure warnings have been removed (see the iteration-budget block in `run_agent.py`, which notes: "No intermediate pressure warnings — they caused models to 'give up' prematurely on complex tasks"). Compression fires when prompt tokens reach the configured `compression.threshold` (default 50%) with no prior warning step; gateway session hygiene fires as the secondary safety net at 85% of the model's context window.