sigil.md
1 # SIGIL Protocol Specification 2 3 **Streaming Inference Grammar Interface Language** 4 5 SIGIL is the attestation layer for AI infrastructure - a verified wire format for streaming LLM tokens with semantic mode tracking. This document describes the protocol design, wire format, and reset-on-ambiguity semantics. 6 7 ## Design Philosophy 8 9 SIGIL enforces a strict discipline: **ambiguity must reset the connection**. When parsing upstream data (SSE, JSON, tool calls), any ambiguity triggers a full state reset rather than "best effort" parsing. 10 11 Why this matters: 12 1. **Trustworthy AI** - malformed input cannot corrupt reasoning chains 13 2. **Epistemic hygiene** - if parse is ambiguous, you don't know what you thought 14 3. **Future memory systems** - corrupted parse leads to corrupted memory leads to corrupted identity 15 16 ## Wire Format 17 18 ### Byte Encoding 19 20 SIGIL uses a compact byte encoding optimized for LLM token streams: 21 22 ``` 23 0xxxxxxx = hot token (ID in lower 7 bits, 0x00-0x7E) 24 10xxxxxx = extended token (varint follows) 25 1100xxxx = stream control (0xC0-0xCF) 26 ``` 27 28 ### Hot Tokens (0x00-0x7E) 29 30 The 127 most frequent tokens get single-byte encoding. These are typically common words, punctuation, and whitespace. The token ID is the byte value directly. 31 32 ### Extended Tokens (0x80-0xBF) 33 34 Less frequent tokens use a two-part encoding: 35 1. Marker byte (0x80-0xBF) - upper 6 bits are part of the token ID 36 2. LEB128 varint - remaining bits of the token ID 37 38 This allows token IDs up to 2^32 while keeping common tokens compact. 39 40 ### Control Opcodes (0xC0-0xCF) 41 42 | Opcode | Byte | Description | 43 |--------|------|-------------| 44 | CHUNK_END | 0xC0 | End of semantic chunk | 45 | TOOL_CALL_START | 0xC1 | Begin tool call block | 46 | TOOL_CALL_END | 0xC2 | End tool call block | 47 | THINK_START | 0xC3 | Begin thinking block | 48 | THINK_END | 0xC4 | End thinking block | 49 | CODE_BLOCK_START | 0xC5 | Begin code block | 50 | CODE_BLOCK_END | 0xC6 | End code block | 51 | FLUSH | 0xC7 | Flush (chunk incomplete) | 52 | Reserved | 0xC8-0xCE | Reserved for future use | 53 | STREAM_END | 0xCF | End of stream | 54 55 ## Parse Modes 56 57 The decoder tracks semantic context through parse modes: 58 59 | Mode | Description | 60 |------|-------------| 61 | `text` | Normal text output (default) | 62 | `think` | Thinking/reasoning block (may be hidden from user) | 63 | `toolCall` | Tool call JSON (must parse as valid JSON) | 64 | `codeBlock` | Code block content | 65 66 ### Mode Transitions 67 68 Valid transitions: 69 - `text` → any other mode (via START opcode) 70 - any mode → `text` (via matching END opcode) 71 72 Invalid transitions trigger reset: 73 - Starting a mode while not in `text` (nested modes) 74 - Ending a mode you're not in (unmatched end) 75 76 ## Reset-on-Ambiguity 77 78 ### StrictParseResult 79 80 Unlike binary ok/fail parsing, SIGIL uses a three-way result: 81 82 ```lean 83 inductive StrictParseResult (α : Type) where 84 | ok : α → Bytes → StrictParseResult α -- Unambiguous success 85 | incomplete : StrictParseResult α -- Need more bytes 86 | ambiguous : AmbiguityReason → StrictParseResult α -- Reset required 87 ``` 88 89 The key insight: `incomplete` is fine (wait for more data), but `ambiguous` must trigger a full state reset. 90 91 ### Ambiguity Reasons 92 93 ```lean 94 inductive AmbiguityReason where 95 | unmatchedModeEnd : ParseMode → AmbiguityReason -- End without matching start 96 | nestedModeStart : ParseMode → ParseMode → AmbiguityReason -- Mode in non-text 97 | reservedOpcode : UInt8 → AmbiguityReason -- Reserved opcode used 98 | varintOverflow : AmbiguityReason -- Token ID > 2^32 99 | jsonStructural : String → AmbiguityReason -- Bad JSON in tool call 100 | sseFraming : String → AmbiguityReason -- SSE framing error 101 | upstreamError : String → AmbiguityReason -- In-band error signal 102 ``` 103 104 ### The Ground State 105 106 The `initDecodeState` is the unique ground state: 107 108 ```lean 109 def initDecodeState : DecodeState := { 110 parseMode := .text 111 buffer := [] 112 leftover := ByteArray.empty 113 } 114 ``` 115 116 Reset always produces this exact state - no information leaks across ambiguity boundaries. 117 118 ## Verified Properties 119 120 Cornell proves 18 theorems about SIGIL with 0 sorry (unproven assumptions): 121 122 ### Core Reset Theorems 123 124 | Theorem | Property | 125 |---------|----------| 126 | `reset_is_ground` | Reset always produces `initDecodeState` | 127 | `ground_unique` | All resets produce the same state | 128 | `reset_idempotent` | `reset(reset(s)) = reset(s)` | 129 | `no_leakage` | Different states reset to identical states | 130 | `reset_erases_mode` | Mode is `text` after reset | 131 | `reset_erases_buffer` | Buffer is empty after reset | 132 | `reset_erases_leftover` | Leftover is empty after reset | 133 134 ### Mode Transition Theorems 135 136 | Theorem | Property | 137 |---------|----------| 138 | `nested_start_ambiguous` | Nested mode starts are detected | 139 | `mismatched_end_ambiguous` | Mismatched mode ends are detected | 140 141 ### handleControl Theorems 142 143 | Theorem | Property | 144 |---------|----------| 145 | `handleControl_reserved_resets` | Reserved opcodes reset to init | 146 | `handleControl_toolCallStart_think_resets` | toolCallStart in think mode resets | 147 | `handleControl_toolCallStart_toolCall_resets` | toolCallStart in toolCall mode resets | 148 | `handleControl_toolCallStart_codeBlock_resets` | toolCallStart in codeBlock mode resets | 149 | `handleControl_toolCallEnd_text_resets` | toolCallEnd in text mode resets | 150 | `handleControl_toolCallEnd_think_resets` | toolCallEnd in think mode resets | 151 | `handleControl_toolCallEnd_codeBlock_resets` | toolCallEnd in codeBlock mode resets | 152 | `handleControl_ambiguity_is_ground` | Ambiguity always produces ground state | 153 154 ## Code Generation 155 156 Cornell generates verified implementations: 157 158 ### C++ (evring/sigil.h) 159 160 ```cpp 161 #include "straylight/evring/sigil.h" 162 163 // Satisfies evring::machine concept 164 static_assert(evring::machine<evring::sigil::sigil_machine>); 165 166 // Use the machine 167 evring::sigil::sigil_machine decoder; 168 auto [new_state, ops] = decoder.step(state, byte_event); 169 ``` 170 171 ### Haskell (Evring/Sigil.hs) 172 173 ```haskell 174 import Evring.Sigil (SigilMachine, initSigilState, stepSigil) 175 176 -- Machine instance 177 instance Machine SigilMachine 178 179 -- Use the machine 180 let (newState, ops) = stepSigil state byteEvent 181 ``` 182 183 ## Example: Decoding a Stream 184 185 ``` 186 Input bytes: [0x48, 0x65, 0x6C, 0x6C, 0x6F, 0xC3, 0x01, 0x02, 0xC4, 0xC0] 187 "H" "e" "l" "l" "o" THINK tok tok /THINK CHUNK 188 189 Decode trace: 190 state: text, buffer: [] 191 0x48 → text, buffer: [72] 192 0x65 → text, buffer: [72, 101] 193 0x6C → text, buffer: [72, 101, 108] 194 0x6C → text, buffer: [72, 101, 108, 108] 195 0x6F → text, buffer: [72, 101, 108, 108, 111] 196 0xC3 → think, buffer: [], emit: Chunk{text: [72,101,108,108,111], complete: false} 197 0x01 → think, buffer: [1] 198 0x02 → think, buffer: [1, 2] 199 0xC4 → text, buffer: [], emit: Chunk{think: [1, 2], complete: true} 200 0xC0 → text, buffer: [], emit: Chunk{text: [], complete: true} 201 ``` 202 203 ## Example: Ambiguity Reset 204 205 ``` 206 Input bytes: [0x48, 0xC3, 0x01, 0xC1, ...] 207 "H" THINK tok TOOL_START ← nested mode! 208 209 Decode trace: 210 state: text, buffer: [] 211 0x48 → text, buffer: [72] 212 0xC3 → think, buffer: [], emit: Chunk{text: [72], complete: false} 213 0x01 → think, buffer: [1] 214 0xC1 → AMBIGUITY: nestedModeStart(think, toolCall) 215 Reset to initDecodeState 216 emit: Chunk{ambiguityReset: nestedModeStart(think, toolCall)} 217 ``` 218 219 ## Integration with Agentic Systems 220 221 SIGIL is designed for agentic LLM systems where: 222 223 1. **Tool calls must be trustworthy** - JSON in toolCall mode is parsed; structural errors reset 224 2. **Thinking may be private** - think blocks can be filtered before presenting to user 225 3. **Streaming is first-class** - incomplete parses wait for more data 226 4. **Ambiguity is fatal** - corrupted state cannot propagate 227 228 The reset-on-ambiguity discipline ensures that an agent can trust its own reasoning chain. If parsing ever becomes ambiguous, the agent returns to a known-good state rather than proceeding with potentially corrupted data.