Cradicle Explorer

/ docs / EVALUATION-REPORT.adoc
EVALUATION-REPORT.adoc
  1  = Cerro Torre Evaluation Report
  2  :toc: left
  3  :toclevels: 2
  4  
  5  This document records periodic, evidence-backed evaluations of completion against the repo's checklists.
  6  It is intentionally *veridical*: every claim must link to a command output, file path, or CI run.
  7  
  8  == How to use this
  9  * Create a new section per evaluation date.
 10  * For each domain, choose a rating and provide evidence.
 11  * Link to the exact checklist file and (where relevant) the exact checklist item IDs (e.g., "A1", "C2").
 12  
 13  == Rating scale
 14  * *Fullest* — Delight + polish; minimal friction; interop + migrations are boring and reliable.
 15  * *Full* — Paved road end-to-end; strong diagnostics; stable surfaces; CI gates enforce the contract.
 16  * *Strong* — Very usable; small sharp edges remain; most seams covered by conformance and interop.
 17  * *Moderate* — Works for insiders; occasional setup/debug friction; partial conformance/interop.
 18  * *Basic* — Core functions exist but are fragile/manual; little automation or diagnostics.
 19  * *Not implemented* — Missing or not verifiable.
 20  
 21  == Evidence rules
 22  A rating without evidence is invalid.
 23  Accepted evidence types:
 24  * CI run URL(s) (preferred)
 25  * exact command + output excerpt (redacted if needed) committed under `evidence/`
 26  * file paths + commit hashes
 27  
 28  == Change control
 29  * Any claim of *Full* or above MUST reference passing required CI checks.
 30  * If a rating drops, record the regression cause and link to the issue/PR.
 31  
 32  ---
 33  
 34  === 2025-12-30 (Evaluator: claude-code)
 35  
 36  Checklist: link:QOL-AUDIT.adoc[QOL-AUDIT.adoc] @ commit `d4e378d7cfb688488795718847514aff530a33f5`
 37  
 38  Scope: Branch `claude/cerro-torre-mvp-plan-tKJdw`, pre-v0.1
 39  
 40  Platforms covered: linux x86_64 (specification only, no runtime tests)
 41  
 42  ==== Overall rating
 43  
 44  *Rating:* Basic
 45  
 46  *Justification:*
 47  
 48  - All core CLI commands scaffolded with help text but no functional implementation
 49  - Comprehensive specifications exist (canonicalization, policy, index, crypto suites)
 50  - No tests directory, no conformance vectors, no CI job separation
 51  - Mental model doc missing; onboarding story incomplete
 52  
 53  ==== Domain ratings (with evidence)
 54  
 55  [cols="1,1,4",options="header"]
 56  |===
 57  | Domain | Rating | Evidence (links / commands)
 58  
 59  | A. Onboarding + doctor
 60  | Basic
 61  | * `ct doctor` scaffolded but returns "Not yet implemented"
 62    * Evidence: `evidence/2025-12-30/cli-scaffold.txt`
 63    * CLI source: `src/cli/cerro_cli.adb:357-392`
 64    * Missing: actual crypto backend check, config validation
 65  
 66  | B. CLI ergonomics + surfaces
 67  | Basic-Moderate
 68  | * 17+ commands scaffolded in `src/cli/cerro_cli.ads`
 69    * Exit codes defined: `src/cli/ct_errors.ads` (0-12)
 70    * Spec: `spec/cli-ergonomics.adoc` (comprehensive)
 71    * Evidence: `evidence/2025-12-30/cli-scaffold.txt`
 72    * Missing: implementation, --json output
 73  
 74  | C. Happy-path E2E (README is executable)
 75  | Not implemented
 76  | * `ct pack` returns "Not yet implemented"
 77    * `ct verify` returns "Not yet implemented"
 78    * No tests/ directory exists
 79    * Evidence: `evidence/2025-12-30/tests-dir.txt`
 80    * No golden tests, no determinism check
 81  
 82  | D. Config/state/cleanup
 83  | Basic
 84  | * Key storage location documented: `~/.config/cerro/keys/`
 85    * Keystore policy spec: `spec/keystore-policy.json`
 86    * Argon2id policy documented in spec
 87    * Evidence: `evidence/2025-12-30/schema-files.txt`
 88    * Missing: implementation, config.toml spec for mirrors
 89  
 90  | E. Release/reproducibility
 91  | Moderate
 92  | * Canonicalization spec: `spec/manifest-canonicalization.adoc` (comprehensive)
 93    * Manifest format spec: `spec/manifest-format.md`
 94    * Bundle format: `spec/ctp-bundle-format.adoc`
 95    * Missing: conformance vectors, reproducible build evidence
 96  
 97  | F. Seam + surface checks (conformance/interop)
 98  | Not implemented
 99  | * Svalinn integration spec exists: `spec/svalinn-integration.adoc`
100    * No conformance test harness
101    * No interop CI job
102    * CI not separated: single `ada-spark-ci.yml`
103    * Evidence: `.github/workflows/ada-spark-ci.yml` (40 lines, minimal)
104  
105  | G. Smoothing docs + usability
106  | Not implemented
107  | * No `docs/mental-model.adoc`
108    * No `.github/PULL_REQUEST_TEMPLATE.md`
109    * Troubleshooting: `ct doctor` scaffolded only
110    * Example policies exist: `evidence/2025-12-30/example-policies.txt`
111  |===
112  
113  ==== Evidence files committed
114  
115  [source]
116  ----
117  evidence/2025-12-30/
118  ├── cli-scaffold.txt        # ls -la src/cli/*.ad[sb]
119  ├── example-policies.txt    # ls -la examples/*.json
120  ├── schema-files.txt        # ls -la spec/*.json
121  └── tests-dir.txt           # ls tests/ (shows non-existent)
122  ----
123  
124  ==== Checklist item status (cross-reference to QOL-AUDIT.adoc)
125  
126  [cols="1,1,3",options="header"]
127  |===
128  | Item | Status | Notes
129  
130  | A1. `ct pack` deterministic
131  | Basic
132  | CLI scaffold exists, not implemented
133  
134  | A2. `ct verify` structured errors
135  | Basic
136  | Exit codes defined in `ct_errors.ads`, not implemented
137  
138  | A3. `ct explain` narrative
139  | Basic
140  | CLI scaffold exists, not implemented
141  
142  | B1. Key lifecycle
143  | Basic
144  | All subcommands scaffolded, none implemented
145  
146  | B2. Policy UX
147  | Moderate
148  | Schema complete (`policy-schema.json`), examples exist
149  
150  | B3. Rotation/multi-signer
151  | Moderate
152  | Schema supports threshold/deny, `ct re-sign` scaffolded
153  
154  | C1. Offline export/import
155  | Basic
156  | Scaffolded, not implemented
157  
158  | C2. Mirror resolution
159  | Not impl
160  | No mirror config section, no --offline flag
161  
162  | D1. Canonicalization rules
163  | Strong
164  | Spec complete, ATS shadow exists, vectors missing
165  
166  | D2. Turing-incomplete
167  | Strong
168  | TOML inherently bounded, spec documents limits
169  
170  | E1. Proof gates separated
171  | Moderate
172  | Single CI workflow, SPARK proof optional
173  
174  | E2. Developer mode
175  | Not impl
176  | No --dev flag, no dev markers
177  
178  | F1. Spec version pin
179  | Basic
180  | `ct version` prints 0.1.0-dev, no spec version
181  
182  | F2. Conformance harness
183  | Not impl
184  | No harness, no vectors imported
185  
186  | F3. Svalinn interop
187  | Not impl
188  | Spec exists, no CI job
189  
190  | G1. `ct doctor`
191  | Basic
192  | Scaffolded with check list, not implemented
193  
194  | G2. `ct diff`
195  | Basic
196  | Scaffolded with sample output, not implemented
197  
198  | G3. Mental model doc
199  | Not impl
200  | Does not exist
201  |===
202  
203  ==== Regressions since last evaluation
204  
205  * N/A — First evaluation
206  
207  ==== Top 5 next actions (ranked)
208  
209  . [ ] *Implement core crypto* — SHA-256 + Ed25519 via libsodium bindings. Required for any verification.
210  . [ ] *Create tests/ directory with conformance vectors* — `tests/canon/valid/`, `tests/canon/invalid/`, `tests/errors/`. Enables CI gates.
211  . [ ] *Implement `ct pack` and `ct verify`* — Core loop. Without this, nothing is usable.
212  . [ ] *Create `docs/mental-model.adoc`* — 2-page user guide. Critical for onboarding.
213  . [ ] *Separate CI into job matrix* — `build`, `unit-test`, `spark-proof`, `lint`. Enables proper gates.
214  
215  ==== Notes
216  
217  * Schema completeness is high — `policy-schema.json`, `index-schema.json`, `crypto-suites.json`, `keystore-policy.schema.json` all exist and are comprehensive.
218  * CLI scaffold is thorough with detailed help text for all 17+ commands.
219  * The gap between "specified" and "implemented" is the primary issue.
220  * ATS2 shadow verifier exists (`tools/ats-shadow/`) but is non-authoritative.
221  * Next evaluation should be after implementing pack/verify and creating test vectors.
222  
223  ---
224  
225  == Evaluation Template (copy for next evaluation)
226  
227  === YYYY-MM-DD (Evaluator: NAME/HANDLE)
228  
229  Checklist: link:QOL-AUDIT.adoc[QOL-AUDIT.adoc] @ commit `COMMIT_HASH`
230  
231  Scope: (release tag, branch, or commit range)
232  
233  Platforms covered: (e.g., linux x86_64 rootless, linux aarch64 rootful, etc.)
234  
235  ==== Overall rating
236  
237  *Rating:* (Fullest | Full | Strong | Moderate | Basic | Not implemented)
238  
239  *Justification (1–3 sentences):*
240  
241  - ...
242  
243  ==== Domain ratings (with evidence)
244  
245  [cols="1,1,4",options="header"]
246  |===
247  | Domain | Rating | Evidence (links / commands)
248  
249  | A. Onboarding + doctor
250  | (..)
251  | * CI: ...
252    * Evidence file: `evidence/YYYY-MM-DD/doctor.txt`
253    * Notes: ...
254  
255  | B. CLI ergonomics + surfaces
256  | (..)
257  | * CI: ...
258    * Schema: `spec/*.json` @ commit ...
259    * Notes: ...
260  
261  | C. Happy-path E2E (README is executable)
262  | (..)
263  | * CI: ...
264    * Logs: ...
265    * Notes: ...
266  
267  | D. Config/state/cleanup
268  | (..)
269  | * Docs: `docs/...`
270    * CI: ...
271    * Notes: ...
272  
273  | E. Release/reproducibility
274  | (..)
275  | * Release assets: ...
276    * Attestations: ...
277    * Notes: ...
278  
279  | F. Seam + surface checks (conformance/interop)
280  | (..)
281  | * Conformance: ...
282    * Interop matrix: ...
283    * Notes: ...
284  
285  | G. Smoothing docs + usability
286  | (..)
287  | * Troubleshooting: ...
288    * Mental model: ...
289    * Notes: ...
290  |===
291  
292  ==== Regressions since last evaluation
293  
294  * None / list issues
295  - (issue/PR link) — what regressed, impact, mitigation
296  
297  ==== Top 5 next actions (ranked)
298  
299  . [ ] (action) — link to issue/PR, owner, target milestone
300  . [ ] ...
301  . [ ] ...
302  . [ ] ...
303  . [ ] ...
304  
305  ==== Notes
306  
307  * Anything learned during evaluation that should become a checklist item:
308  - ...