/ README.md
README.md
1 # Cornell 2 3 > "Cornell boxes... the man made these things. Boxes with things inside them." 4 5 **Verified Binary Format DSL in Lean4** - bidirectional codecs with machine-checked round-trip correctness. 6 7 ## What is Cornell? 8 9 Cornell is a domain-specific language for defining binary wire formats with **proofs**. If your format definition compiles, you get: 10 11 1. **Roundtrip correctness**: `parse(serialize(a)) = a` 12 2. **Consumption property**: `parse(serialize(a) ++ extra) = (a, extra)` 13 3. **Determinism**: parsing is a function, not a relation 14 4. **Code generation**: C++ and Haskell implementations extracted from the spec 15 16 ## Reset-on-Ambiguity 17 18 Cornell enforces a strict discipline for parsing upstream data: **ambiguity must reset the connection**. This is critical for agentic systems where trusting corrupted reasoning chains is worse than starting fresh. 19 20 ``` 21 When parsing fails ambiguously: 22 - DON'T: guess, recover, or best-effort parse 23 - DO: reset to ground state, emit AmbiguityReset event 24 ``` 25 26 This is proven correct in the SIGIL and ZMTP modules with theorems like: 27 - `reset_is_ground`: reset always produces `initState` 28 - `no_leakage`: all resets produce identical states 29 - `reset_idempotent`: `reset(reset(s)) = reset(s)` 30 31 ## Verified Protocols 32 33 ### SIGIL - LLM Token Streaming (800 lines, 18 theorems, 0 sorry) 34 35 Wire format for streaming LLM tokens with semantic mode tracking: 36 37 ``` 38 Byte encoding: 39 0xxxxxxx = hot token (ID in lower 7 bits, 0x00-0x7E) 40 10xxxxxx = extended token (varint follows) 41 1100xxxx = stream control (0xC0-0xCF) 42 ``` 43 44 Modes: Text, Think, ToolCall, CodeBlock - with proven reset on invalid transitions. 45 46 See [docs/sigil.md](docs/sigil.md) for details. 47 48 ### ZMTP 3.x - ZeroMQ Transport (563 lines, 6 theorems, 0 sorry) 49 50 ZMTP protocol parser with proven properties: 51 52 ``` 53 greeting = signature[10] version[2] mechanism[20] as-server[1] filler[31] 54 frame = flags[1] size[1|8] body[size] 55 ``` 56 57 Reset triggers: reserved flags, invalid signature, invalid command name. 58 59 See [docs/zmtp.md](docs/zmtp.md) for details. 60 61 ### Nix Daemon Protocol (935 lines, 10 theorems, 5 axioms) 62 63 Nix daemon wire format with handshake state machine: 64 65 ``` 66 Primitives: 67 u64le = little-endian 64-bit integer 68 NixString = u64 length + bytes + padding to 8-byte boundary 69 Bool = u64 (0 = false, nonzero = true) 70 71 Handshake: 72 client → WORKER_MAGIC_1 (0x6e697863) + version 73 server → WORKER_MAGIC_2 (0x6478696f) + version 74 ``` 75 76 Reset triggers: invalid magic, unsupported version, unknown worker op. 77 78 See [docs/nix.md](docs/nix.md) for details. 79 80 ### Protobuf Wire Format (767 lines, 12 theorems, 0 sorry) 81 82 Protocol Buffers encoding with gRPC framing: 83 84 ``` 85 Wire types: 86 0 = varint (int32, int64, bool, enum) 87 1 = fixed64 (fixed64, double) 88 2 = len-delim (string, bytes, embedded messages) 89 5 = fixed32 (fixed32, float) 90 3,4 = DEPRECATED (groups - trigger ambiguity) 91 92 gRPC frame: 93 flag[1] + length[4 big-endian] + message[N] 94 ``` 95 96 Reset triggers: varint overflow, deprecated wire types, invalid UTF-8. 97 98 See [docs/protobuf.md](docs/protobuf.md) for details. 99 100 ## The Box Abstraction 101 102 A `Box α` is a verified bidirectional codec: 103 104 ```lean 105 structure Box (α : Type) where 106 parse : Bytes → ParseResult α 107 serialize : α → Bytes 108 roundtrip : ∀ a, parse (serialize a) = ok a empty 109 consumption : ∀ a extra, parse (serialize a ++ extra) = ok a extra 110 ``` 111 112 ## Code Generation 113 114 Cornell generates implementation code from verified specs: 115 116 | Executable | Output | Description | 117 |------------|--------|-------------| 118 | `cornell_sigil_gen` | `evring/sigil.h` | C++ SIGIL decoder | 119 | `cornell_haskell_gen` | `Evring/Sigil.hs` | Haskell SIGIL decoder | 120 | `cornell_zmtp_gen` | `evring/zmtp_gen.h` | C++ ZMTP parser | 121 | `cornell_zmtp_haskell_gen` | `Evring/Zmtp.hs` | Haskell ZMTP parser | 122 | `cornell_cpp_gen` | wire format code | C++ primitives | 123 | `cornell_sm_gen` | state machine code | C++ state machines | 124 125 ### Regenerating Code 126 127 ```bash 128 # Build all generators 129 lake build 130 131 # Generate C++ headers 132 lake exe cornell_sigil_gen 2>/dev/null > path/to/sigil.h 133 lake exe cornell_zmtp_gen 2>/dev/null > path/to/zmtp_gen.h 134 135 # Generate Haskell modules 136 lake exe cornell_haskell_gen 2>/dev/null > path/to/Sigil.hs 137 lake exe cornell_zmtp_haskell_gen 2>/dev/null > path/to/Zmtp.hs 138 ``` 139 140 ## Project Structure 141 142 ``` 143 Cornell/ 144 ├── Basic.lean # Core Box DSL + primitives 145 ├── Sigil.lean # SIGIL wire format (18 theorems) 146 ├── Zmtp.lean # ZMTP 3.x parser (6 theorems) 147 ├── ExtractSigil.lean # C++ codegen for SIGIL 148 ├── ExtractHaskell.lean # Haskell codegen for SIGIL 149 ├── ExtractZmtp.lean # C++ codegen for ZMTP 150 ├── ExtractZmtpHaskell.lean # Haskell codegen for ZMTP 151 ├── Nix.lean # Nix daemon wire format (10 theorems) 152 ├── Protobuf.lean # Protobuf wire format (12 theorems) 153 ├── StateMachine.lean # State machine DSL 154 ├── ExtractCpp.lean # C++ wire format codegen 155 └── ExtractStateMachine.lean # C++ state machine codegen 156 157 docs/ 158 ├── sigil.md # SIGIL protocol documentation 159 ├── zmtp.md # ZMTP protocol documentation 160 ├── nix.md # Nix daemon protocol documentation 161 ├── protobuf.md # Protobuf wire format documentation 162 └── state-machine-dsl.md # State machine DSL guide 163 ``` 164 165 ## Theorem Summary 166 167 ### SIGIL (Cornell.Sigil) 168 169 | Theorem | Property | 170 |---------|----------| 171 | `reset_is_ground` | Reset always produces `initDecodeState` | 172 | `no_leakage` | Different states reset to identical states | 173 | `reset_idempotent` | `reset(reset(s)) = reset(s)` | 174 | `reset_erases_mode` | Mode is `text` after reset | 175 | `reset_erases_buffer` | Buffer is empty after reset | 176 | `handleControl_reserved_resets` | Reserved opcodes trigger reset | 177 | `handleControl_nested_start_resets` | Nested mode starts trigger reset | 178 | `handleControl_unmatched_end_resets` | Unmatched mode ends trigger reset | 179 | ... | (18 total) | 180 181 ### ZMTP (Cornell.Zmtp) 182 183 | Theorem | Property | 184 |---------|----------| 185 | `parseGreeting_deterministic` | Same bytes → same result | 186 | `greeting_exactly_64` | Greeting consumes exactly 64 bytes | 187 | `flags_unambiguous` | With enough bytes, parsing completes | 188 | `reset_is_initial` | Reset returns to initial state | 189 | `no_backtrack_greeting` | Parser never re-reads bytes | 190 | `parseFrameHeader_deterministic` | Same bytes → same result | 191 192 ### Nix Daemon (Cornell.Nix) 193 194 | Theorem | Property | 195 |---------|----------| 196 | `reset_is_initial` | Reset always produces `initConnState` | 197 | `reset_idempotent` | `reset(reset(s)) = reset(s)` | 198 | `no_leakage` | Different states reset to identical states | 199 | `parseU64Strict_deterministic` | Same bytes → same result | 200 | `parseClientHello_invalid_magic_resets` | Invalid magic triggers ambiguity | 201 | `parseClientHello_unsupported_version_resets` | Unsupported version triggers ambiguity | 202 | `parseWorkerOp_unknown_resets` | Unknown op code triggers ambiguity | 203 | `parseU64Strict_complete` | With 8+ bytes, parse succeeds | 204 | `parseU64Strict_consumes_8` | Parse consumes exactly 8 bytes | 205 | `init_is_await_client` | Initial state is `awaitClientHello` | 206 207 ### Protobuf (Cornell.Protobuf) 208 209 | Theorem | Property | 210 |---------|----------| 211 | `reset_is_initial` | Reset always produces `initParseState` | 212 | `reset_idempotent` | `reset(reset(s)) = reset(s)` | 213 | `no_leakage` | Different states reset to identical states | 214 | `decodeVarintStrict_deterministic` | Same bytes → same result | 215 | `wireType3_ambiguous` | startGroup triggers ambiguity | 216 | `wireType4_ambiguous` | endGroup triggers ambiguity | 217 | `wireType6_invalid` | Wire type 6 is invalid | 218 | `wireType7_invalid` | Wire type 7 is invalid | 219 | `valid_wire_types` | 0, 1, 2, 5 are valid | 220 | `grpc_invalid_flag` | Compression flag > 1 triggers ambiguity | 221 | ... | (12 total) | 222 223 ## Building 224 225 ```bash 226 # Lean4 227 lake build 228 229 # Run all proofs (type checking) 230 lake build Cornell 231 232 # Build specific generator 233 lake build cornell_sigil_gen 234 ``` 235 236 ## Integration 237 238 ### evring (C++) 239 240 ```cpp 241 #include "straylight/evring/sigil.h" 242 #include "straylight/evring/zmtp_gen.h" 243 244 // Both satisfy evring::machine concept 245 static_assert(evring::machine<evring::sigil::sigil_machine>); 246 static_assert(evring::machine<evring::zmtp::zmtp_machine>); 247 ``` 248 249 ### libevring (Haskell) 250 251 ```haskell 252 import Evring.Sigil (SigilMachine, initSigilState) 253 import Evring.Zmtp (ZmtpMachine, initialZmtpState) 254 255 -- Both are Machine instances 256 instance Machine SigilMachine 257 instance Machine ZmtpMachine 258 ``` 259 260 ## License 261 262 MIT