Cradicle Explorer

/ docs / sigil.md
sigil.md
  1  # SIGIL Protocol Specification
  2  
  3  **Streaming Inference Grammar Interface Language**
  4  
  5  SIGIL is the attestation layer for AI infrastructure - a verified wire format for streaming LLM tokens with semantic mode tracking. This document describes the protocol design, wire format, and reset-on-ambiguity semantics.
  6  
  7  ## Design Philosophy
  8  
  9  SIGIL enforces a strict discipline: **ambiguity must reset the connection**. When parsing upstream data (SSE, JSON, tool calls), any ambiguity triggers a full state reset rather than "best effort" parsing.
 10  
 11  Why this matters:
 12  1. **Trustworthy AI** - malformed input cannot corrupt reasoning chains
 13  2. **Epistemic hygiene** - if parse is ambiguous, you don't know what you thought
 14  3. **Future memory systems** - corrupted parse leads to corrupted memory leads to corrupted identity
 15  
 16  ## Wire Format
 17  
 18  ### Byte Encoding
 19  
 20  SIGIL uses a compact byte encoding optimized for LLM token streams:
 21  
 22  ```
 23  0xxxxxxx  = hot token (ID in lower 7 bits, 0x00-0x7E)
 24  10xxxxxx  = extended token (varint follows)
 25  1100xxxx  = stream control (0xC0-0xCF)
 26  ```
 27  
 28  ### Hot Tokens (0x00-0x7E)
 29  
 30  The 127 most frequent tokens get single-byte encoding. These are typically common words, punctuation, and whitespace. The token ID is the byte value directly.
 31  
 32  ### Extended Tokens (0x80-0xBF)
 33  
 34  Less frequent tokens use a two-part encoding:
 35  1. Marker byte (0x80-0xBF) - upper 6 bits are part of the token ID
 36  2. LEB128 varint - remaining bits of the token ID
 37  
 38  This allows token IDs up to 2^32 while keeping common tokens compact.
 39  
 40  ### Control Opcodes (0xC0-0xCF)
 41  
 42  | Opcode | Byte | Description |
 43  |--------|------|-------------|
 44  | CHUNK_END | 0xC0 | End of semantic chunk |
 45  | TOOL_CALL_START | 0xC1 | Begin tool call block |
 46  | TOOL_CALL_END | 0xC2 | End tool call block |
 47  | THINK_START | 0xC3 | Begin thinking block |
 48  | THINK_END | 0xC4 | End thinking block |
 49  | CODE_BLOCK_START | 0xC5 | Begin code block |
 50  | CODE_BLOCK_END | 0xC6 | End code block |
 51  | FLUSH | 0xC7 | Flush (chunk incomplete) |
 52  | Reserved | 0xC8-0xCE | Reserved for future use |
 53  | STREAM_END | 0xCF | End of stream |
 54  
 55  ## Parse Modes
 56  
 57  The decoder tracks semantic context through parse modes:
 58  
 59  | Mode | Description |
 60  |------|-------------|
 61  | `text` | Normal text output (default) |
 62  | `think` | Thinking/reasoning block (may be hidden from user) |
 63  | `toolCall` | Tool call JSON (must parse as valid JSON) |
 64  | `codeBlock` | Code block content |
 65  
 66  ### Mode Transitions
 67  
 68  Valid transitions:
 69  - `text` → any other mode (via START opcode)
 70  - any mode → `text` (via matching END opcode)
 71  
 72  Invalid transitions trigger reset:
 73  - Starting a mode while not in `text` (nested modes)
 74  - Ending a mode you're not in (unmatched end)
 75  
 76  ## Reset-on-Ambiguity
 77  
 78  ### StrictParseResult
 79  
 80  Unlike binary ok/fail parsing, SIGIL uses a three-way result:
 81  
 82  ```lean
 83  inductive StrictParseResult (α : Type) where
 84    | ok : α → Bytes → StrictParseResult α        -- Unambiguous success
 85    | incomplete : StrictParseResult α             -- Need more bytes
 86    | ambiguous : AmbiguityReason → StrictParseResult α  -- Reset required
 87  ```
 88  
 89  The key insight: `incomplete` is fine (wait for more data), but `ambiguous` must trigger a full state reset.
 90  
 91  ### Ambiguity Reasons
 92  
 93  ```lean
 94  inductive AmbiguityReason where
 95    | unmatchedModeEnd : ParseMode → AmbiguityReason      -- End without matching start
 96    | nestedModeStart : ParseMode → ParseMode → AmbiguityReason  -- Mode in non-text
 97    | reservedOpcode : UInt8 → AmbiguityReason            -- Reserved opcode used
 98    | varintOverflow : AmbiguityReason                     -- Token ID > 2^32
 99    | jsonStructural : String → AmbiguityReason            -- Bad JSON in tool call
100    | sseFraming : String → AmbiguityReason                -- SSE framing error
101    | upstreamError : String → AmbiguityReason             -- In-band error signal
102  ```
103  
104  ### The Ground State
105  
106  The `initDecodeState` is the unique ground state:
107  
108  ```lean
109  def initDecodeState : DecodeState := {
110    parseMode := .text
111    buffer := []
112    leftover := ByteArray.empty
113  }
114  ```
115  
116  Reset always produces this exact state - no information leaks across ambiguity boundaries.
117  
118  ## Verified Properties
119  
120  Cornell proves 18 theorems about SIGIL with 0 sorry (unproven assumptions):
121  
122  ### Core Reset Theorems
123  
124  | Theorem | Property |
125  |---------|----------|
126  | `reset_is_ground` | Reset always produces `initDecodeState` |
127  | `ground_unique` | All resets produce the same state |
128  | `reset_idempotent` | `reset(reset(s)) = reset(s)` |
129  | `no_leakage` | Different states reset to identical states |
130  | `reset_erases_mode` | Mode is `text` after reset |
131  | `reset_erases_buffer` | Buffer is empty after reset |
132  | `reset_erases_leftover` | Leftover is empty after reset |
133  
134  ### Mode Transition Theorems
135  
136  | Theorem | Property |
137  |---------|----------|
138  | `nested_start_ambiguous` | Nested mode starts are detected |
139  | `mismatched_end_ambiguous` | Mismatched mode ends are detected |
140  
141  ### handleControl Theorems
142  
143  | Theorem | Property |
144  |---------|----------|
145  | `handleControl_reserved_resets` | Reserved opcodes reset to init |
146  | `handleControl_toolCallStart_think_resets` | toolCallStart in think mode resets |
147  | `handleControl_toolCallStart_toolCall_resets` | toolCallStart in toolCall mode resets |
148  | `handleControl_toolCallStart_codeBlock_resets` | toolCallStart in codeBlock mode resets |
149  | `handleControl_toolCallEnd_text_resets` | toolCallEnd in text mode resets |
150  | `handleControl_toolCallEnd_think_resets` | toolCallEnd in think mode resets |
151  | `handleControl_toolCallEnd_codeBlock_resets` | toolCallEnd in codeBlock mode resets |
152  | `handleControl_ambiguity_is_ground` | Ambiguity always produces ground state |
153  
154  ## Code Generation
155  
156  Cornell generates verified implementations:
157  
158  ### C++ (evring/sigil.h)
159  
160  ```cpp
161  #include "straylight/evring/sigil.h"
162  
163  // Satisfies evring::machine concept
164  static_assert(evring::machine<evring::sigil::sigil_machine>);
165  
166  // Use the machine
167  evring::sigil::sigil_machine decoder;
168  auto [new_state, ops] = decoder.step(state, byte_event);
169  ```
170  
171  ### Haskell (Evring/Sigil.hs)
172  
173  ```haskell
174  import Evring.Sigil (SigilMachine, initSigilState, stepSigil)
175  
176  -- Machine instance
177  instance Machine SigilMachine
178  
179  -- Use the machine
180  let (newState, ops) = stepSigil state byteEvent
181  ```
182  
183  ## Example: Decoding a Stream
184  
185  ```
186  Input bytes: [0x48, 0x65, 0x6C, 0x6C, 0x6F, 0xC3, 0x01, 0x02, 0xC4, 0xC0]
187               "H"   "e"   "l"   "l"   "o"   THINK  tok   tok   /THINK CHUNK
188  
189  Decode trace:
190    state: text, buffer: []
191    0x48 → text, buffer: [72]
192    0x65 → text, buffer: [72, 101]
193    0x6C → text, buffer: [72, 101, 108]
194    0x6C → text, buffer: [72, 101, 108, 108]
195    0x6F → text, buffer: [72, 101, 108, 108, 111]
196    0xC3 → think, buffer: [], emit: Chunk{text: [72,101,108,108,111], complete: false}
197    0x01 → think, buffer: [1]
198    0x02 → think, buffer: [1, 2]
199    0xC4 → text, buffer: [], emit: Chunk{think: [1, 2], complete: true}
200    0xC0 → text, buffer: [], emit: Chunk{text: [], complete: true}
201  ```
202  
203  ## Example: Ambiguity Reset
204  
205  ```
206  Input bytes: [0x48, 0xC3, 0x01, 0xC1, ...]
207               "H"   THINK  tok   TOOL_START  ← nested mode!
208  
209  Decode trace:
210    state: text, buffer: []
211    0x48 → text, buffer: [72]
212    0xC3 → think, buffer: [], emit: Chunk{text: [72], complete: false}
213    0x01 → think, buffer: [1]
214    0xC1 → AMBIGUITY: nestedModeStart(think, toolCall)
215           Reset to initDecodeState
216           emit: Chunk{ambiguityReset: nestedModeStart(think, toolCall)}
217  ```
218  
219  ## Integration with Agentic Systems
220  
221  SIGIL is designed for agentic LLM systems where:
222  
223  1. **Tool calls must be trustworthy** - JSON in toolCall mode is parsed; structural errors reset
224  2. **Thinking may be private** - think blocks can be filtered before presenting to user
225  3. **Streaming is first-class** - incomplete parses wait for more data
226  4. **Ambiguity is fatal** - corrupted state cannot propagate
227  
228  The reset-on-ambiguity discipline ensures that an agent can trust its own reasoning chain. If parsing ever becomes ambiguous, the agent returns to a known-good state rather than proceeding with potentially corrupted data.