/ README.md
README.md
  1  # Cornell
  2  
  3  > "Cornell boxes... the man made these things. Boxes with things inside them."
  4  
  5  **Verified Binary Format DSL in Lean4** - bidirectional codecs with machine-checked round-trip correctness.
  6  
  7  ## What is Cornell?
  8  
  9  Cornell is a domain-specific language for defining binary wire formats with **proofs**. If your format definition compiles, you get:
 10  
 11  1. **Roundtrip correctness**: `parse(serialize(a)) = a`
 12  2. **Consumption property**: `parse(serialize(a) ++ extra) = (a, extra)`
 13  3. **Determinism**: parsing is a function, not a relation
 14  4. **Code generation**: C++ and Haskell implementations extracted from the spec
 15  
 16  ## Reset-on-Ambiguity
 17  
 18  Cornell enforces a strict discipline for parsing upstream data: **ambiguity must reset the connection**. This is critical for agentic systems where trusting corrupted reasoning chains is worse than starting fresh.
 19  
 20  ```
 21  When parsing fails ambiguously:
 22    - DON'T: guess, recover, or best-effort parse
 23    - DO: reset to ground state, emit AmbiguityReset event
 24  ```
 25  
 26  This is proven correct in the SIGIL and ZMTP modules with theorems like:
 27  - `reset_is_ground`: reset always produces `initState`
 28  - `no_leakage`: all resets produce identical states
 29  - `reset_idempotent`: `reset(reset(s)) = reset(s)`
 30  
 31  ## Verified Protocols
 32  
 33  ### SIGIL - LLM Token Streaming (800 lines, 18 theorems, 0 sorry)
 34  
 35  Wire format for streaming LLM tokens with semantic mode tracking:
 36  
 37  ```
 38  Byte encoding:
 39    0xxxxxxx  = hot token (ID in lower 7 bits, 0x00-0x7E)
 40    10xxxxxx  = extended token (varint follows)
 41    1100xxxx  = stream control (0xC0-0xCF)
 42  ```
 43  
 44  Modes: Text, Think, ToolCall, CodeBlock - with proven reset on invalid transitions.
 45  
 46  See [docs/sigil.md](docs/sigil.md) for details.
 47  
 48  ### ZMTP 3.x - ZeroMQ Transport (563 lines, 6 theorems, 0 sorry)
 49  
 50  ZMTP protocol parser with proven properties:
 51  
 52  ```
 53  greeting = signature[10] version[2] mechanism[20] as-server[1] filler[31]
 54  frame    = flags[1] size[1|8] body[size]
 55  ```
 56  
 57  Reset triggers: reserved flags, invalid signature, invalid command name.
 58  
 59  See [docs/zmtp.md](docs/zmtp.md) for details.
 60  
 61  ### Nix Daemon Protocol (935 lines, 10 theorems, 5 axioms)
 62  
 63  Nix daemon wire format with handshake state machine:
 64  
 65  ```
 66  Primitives:
 67    u64le     = little-endian 64-bit integer
 68    NixString = u64 length + bytes + padding to 8-byte boundary
 69    Bool      = u64 (0 = false, nonzero = true)
 70  
 71  Handshake:
 72    client → WORKER_MAGIC_1 (0x6e697863) + version
 73    server → WORKER_MAGIC_2 (0x6478696f) + version
 74  ```
 75  
 76  Reset triggers: invalid magic, unsupported version, unknown worker op.
 77  
 78  See [docs/nix.md](docs/nix.md) for details.
 79  
 80  ### Protobuf Wire Format (767 lines, 12 theorems, 0 sorry)
 81  
 82  Protocol Buffers encoding with gRPC framing:
 83  
 84  ```
 85  Wire types:
 86    0 = varint     (int32, int64, bool, enum)
 87    1 = fixed64    (fixed64, double)
 88    2 = len-delim  (string, bytes, embedded messages)
 89    5 = fixed32    (fixed32, float)
 90    3,4 = DEPRECATED (groups - trigger ambiguity)
 91  
 92  gRPC frame:
 93    flag[1] + length[4 big-endian] + message[N]
 94  ```
 95  
 96  Reset triggers: varint overflow, deprecated wire types, invalid UTF-8.
 97  
 98  See [docs/protobuf.md](docs/protobuf.md) for details.
 99  
100  ## The Box Abstraction
101  
102  A `Box α` is a verified bidirectional codec:
103  
104  ```lean
105  structure Box (α : Type) where
106    parse : Bytes → ParseResult α
107    serialize : α → Bytes
108    roundtrip : ∀ a, parse (serialize a) = ok a empty
109    consumption : ∀ a extra, parse (serialize a ++ extra) = ok a extra
110  ```
111  
112  ## Code Generation
113  
114  Cornell generates implementation code from verified specs:
115  
116  | Executable | Output | Description |
117  |------------|--------|-------------|
118  | `cornell_sigil_gen` | `evring/sigil.h` | C++ SIGIL decoder |
119  | `cornell_haskell_gen` | `Evring/Sigil.hs` | Haskell SIGIL decoder |
120  | `cornell_zmtp_gen` | `evring/zmtp_gen.h` | C++ ZMTP parser |
121  | `cornell_zmtp_haskell_gen` | `Evring/Zmtp.hs` | Haskell ZMTP parser |
122  | `cornell_cpp_gen` | wire format code | C++ primitives |
123  | `cornell_sm_gen` | state machine code | C++ state machines |
124  
125  ### Regenerating Code
126  
127  ```bash
128  # Build all generators
129  lake build
130  
131  # Generate C++ headers
132  lake exe cornell_sigil_gen 2>/dev/null > path/to/sigil.h
133  lake exe cornell_zmtp_gen 2>/dev/null > path/to/zmtp_gen.h
134  
135  # Generate Haskell modules
136  lake exe cornell_haskell_gen 2>/dev/null > path/to/Sigil.hs
137  lake exe cornell_zmtp_haskell_gen 2>/dev/null > path/to/Zmtp.hs
138  ```
139  
140  ## Project Structure
141  
142  ```
143  Cornell/
144  ├── Basic.lean              # Core Box DSL + primitives
145  ├── Sigil.lean              # SIGIL wire format (18 theorems)
146  ├── Zmtp.lean               # ZMTP 3.x parser (6 theorems)
147  ├── ExtractSigil.lean       # C++ codegen for SIGIL
148  ├── ExtractHaskell.lean     # Haskell codegen for SIGIL
149  ├── ExtractZmtp.lean        # C++ codegen for ZMTP
150  ├── ExtractZmtpHaskell.lean # Haskell codegen for ZMTP
151  ├── Nix.lean                # Nix daemon wire format (10 theorems)
152  ├── Protobuf.lean           # Protobuf wire format (12 theorems)
153  ├── StateMachine.lean       # State machine DSL
154  ├── ExtractCpp.lean         # C++ wire format codegen
155  └── ExtractStateMachine.lean # C++ state machine codegen
156  
157  docs/
158  ├── sigil.md                # SIGIL protocol documentation
159  ├── zmtp.md                 # ZMTP protocol documentation
160  ├── nix.md                  # Nix daemon protocol documentation
161  ├── protobuf.md             # Protobuf wire format documentation
162  └── state-machine-dsl.md    # State machine DSL guide
163  ```
164  
165  ## Theorem Summary
166  
167  ### SIGIL (Cornell.Sigil)
168  
169  | Theorem | Property |
170  |---------|----------|
171  | `reset_is_ground` | Reset always produces `initDecodeState` |
172  | `no_leakage` | Different states reset to identical states |
173  | `reset_idempotent` | `reset(reset(s)) = reset(s)` |
174  | `reset_erases_mode` | Mode is `text` after reset |
175  | `reset_erases_buffer` | Buffer is empty after reset |
176  | `handleControl_reserved_resets` | Reserved opcodes trigger reset |
177  | `handleControl_nested_start_resets` | Nested mode starts trigger reset |
178  | `handleControl_unmatched_end_resets` | Unmatched mode ends trigger reset |
179  | ... | (18 total) |
180  
181  ### ZMTP (Cornell.Zmtp)
182  
183  | Theorem | Property |
184  |---------|----------|
185  | `parseGreeting_deterministic` | Same bytes → same result |
186  | `greeting_exactly_64` | Greeting consumes exactly 64 bytes |
187  | `flags_unambiguous` | With enough bytes, parsing completes |
188  | `reset_is_initial` | Reset returns to initial state |
189  | `no_backtrack_greeting` | Parser never re-reads bytes |
190  | `parseFrameHeader_deterministic` | Same bytes → same result |
191  
192  ### Nix Daemon (Cornell.Nix)
193  
194  | Theorem | Property |
195  |---------|----------|
196  | `reset_is_initial` | Reset always produces `initConnState` |
197  | `reset_idempotent` | `reset(reset(s)) = reset(s)` |
198  | `no_leakage` | Different states reset to identical states |
199  | `parseU64Strict_deterministic` | Same bytes → same result |
200  | `parseClientHello_invalid_magic_resets` | Invalid magic triggers ambiguity |
201  | `parseClientHello_unsupported_version_resets` | Unsupported version triggers ambiguity |
202  | `parseWorkerOp_unknown_resets` | Unknown op code triggers ambiguity |
203  | `parseU64Strict_complete` | With 8+ bytes, parse succeeds |
204  | `parseU64Strict_consumes_8` | Parse consumes exactly 8 bytes |
205  | `init_is_await_client` | Initial state is `awaitClientHello` |
206  
207  ### Protobuf (Cornell.Protobuf)
208  
209  | Theorem | Property |
210  |---------|----------|
211  | `reset_is_initial` | Reset always produces `initParseState` |
212  | `reset_idempotent` | `reset(reset(s)) = reset(s)` |
213  | `no_leakage` | Different states reset to identical states |
214  | `decodeVarintStrict_deterministic` | Same bytes → same result |
215  | `wireType3_ambiguous` | startGroup triggers ambiguity |
216  | `wireType4_ambiguous` | endGroup triggers ambiguity |
217  | `wireType6_invalid` | Wire type 6 is invalid |
218  | `wireType7_invalid` | Wire type 7 is invalid |
219  | `valid_wire_types` | 0, 1, 2, 5 are valid |
220  | `grpc_invalid_flag` | Compression flag > 1 triggers ambiguity |
221  | ... | (12 total) |
222  
223  ## Building
224  
225  ```bash
226  # Lean4
227  lake build
228  
229  # Run all proofs (type checking)
230  lake build Cornell
231  
232  # Build specific generator
233  lake build cornell_sigil_gen
234  ```
235  
236  ## Integration
237  
238  ### evring (C++)
239  
240  ```cpp
241  #include "straylight/evring/sigil.h"
242  #include "straylight/evring/zmtp_gen.h"
243  
244  // Both satisfy evring::machine concept
245  static_assert(evring::machine<evring::sigil::sigil_machine>);
246  static_assert(evring::machine<evring::zmtp::zmtp_machine>);
247  ```
248  
249  ### libevring (Haskell)
250  
251  ```haskell
252  import Evring.Sigil (SigilMachine, initSigilState)
253  import Evring.Zmtp (ZmtpMachine, initialZmtpState)
254  
255  -- Both are Machine instances
256  instance Machine SigilMachine
257  instance Machine ZmtpMachine
258  ```
259  
260  ## License
261  
262  MIT