README.md
1 # /eval — AI Output Evaluation Skill for Claude Code 2 3 Automated evaluation suite for any AI-powered feature. Works with chat agents, content generators, classifiers, summarizers, and any system that takes input and produces AI output. 4 5 ## What it does 6 7 1. Asks you what "good" looks like for your AI feature 8 2. Generates test scenarios and pass/fail criteria 9 3. Runs the scenarios against your API 10 4. Scores results and detects regressions 11 5. Helps you fix failures 12 13 ## Install 14 15 Copy the `eval/` directory into your project's `.claude/skills/`: 16 17 ```bash 18 cp -r eval/ your-project/.claude/skills/eval/ 19 ``` 20 21 Or clone and symlink: 22 23 ```bash 24 git clone https://github.com/nichetools/ai-marketing-skills.git 25 ln -s ai-marketing-skills/eval your-project/.claude/skills/eval 26 ``` 27 28 ## Quick start 29 30 1. In Claude Code, type `/eval` 31 2. Answer the questions about what your AI does and what good/bad looks like 32 3. The skill generates `eval.config.json` with test scenarios 33 4. It runs the scenarios and shows you the results 34 5. Fix failures, re-run, repeat 35 36 ## Running evals manually 37 38 ```bash 39 # Run all scenarios 40 npx tsx .claude/skills/eval/run-eval.ts 41 42 # Use a specific config 43 npx tsx .claude/skills/eval/run-eval.ts --config my-eval.config.json 44 45 # See full AI outputs 46 npx tsx .claude/skills/eval/run-eval.ts --verbose 47 48 # Save current results as baseline 49 npx tsx .claude/skills/eval/run-eval.ts --baseline 50 ``` 51 52 ## Supported AI types 53 54 | Type | How it works | 55 |------|-------------| 56 | **Chat agent** | Sends multi-turn messages, evaluates full conversation | 57 | **Content generator** | Sends single input, evaluates output quality | 58 | **Classifier** | Sends inputs with known labels, checks accuracy | 59 | **Summarizer** | Sends documents, checks summary quality | 60 | **Custom** | Any input/output API you define | 61 62 ## Criterion types 63 64 | Type | What it checks | 65 |------|---------------| 66 | `contains` | Output includes a string (case-insensitive) | 67 | `not_contains` | Output does NOT include a string | 68 | `regex` | Output matches a regex pattern | 69 | `max_length` | Output is under N characters | 70 | `min_length` | Output is at least N characters | 71 | `max_sentences` | Output is under N sentences | 72 | `response_time` | API responds within N milliseconds | 73 | `json_valid` | Output is valid JSON | 74 | `json_field_equals` | A JSON field equals an expected value | 75 | `no_hallucination` | Output doesn't contain claims not in the input | 76 | `preserves_names` | Key names from input appear in output | 77 | `preserves_numbers` | Key numbers from input appear in output | 78 79 ## Config format 80 81 ```json 82 { 83 "name": "My AI Feature Eval", 84 "type": "chat", 85 "endpoint": "https://my-api.com/chat", 86 "method": "POST", 87 "headers": { "Content-Type": "application/json" }, 88 "request_template": { 89 "message": "{{message}}", 90 "history": "{{history}}" 91 }, 92 "response_field": "response", 93 "threshold": 80, 94 "scenarios": [ 95 { 96 "name": "basic_greeting", 97 "messages": ["Hello"], 98 "criteria": [ 99 { "type": "contains", "value": "help", "description": "Offers help" } 100 ] 101 } 102 ] 103 } 104 ``` 105 106 See `eval.config.example.json` for a full example. 107 108 ## Regression detection 109 110 Save a baseline after your first passing run: 111 112 ```bash 113 npx tsx .claude/skills/eval/run-eval.ts --baseline 114 ``` 115 116 Future runs automatically compare against the baseline and flag: 117 - Score drops 118 - Individual scenarios that got worse 119 - New failures that didn't exist before 120 121 ## When to run evals 122 123 - After every prompt or model change 124 - Before deploying to production 125 - Weekly to catch quality drift from content/data changes 126 - When onboarding a new AI feature 127 128 ## Adding custom criteria 129 130 Edit `run-eval.ts` and add a case to the `evaluateCriterion` function: 131 132 ```typescript 133 case 'my_custom_check': 134 // Your logic here 135 return someCondition; 136 ``` 137 138 Then use it in your config: 139 140 ```json 141 { "type": "my_custom_check", "value": "whatever", "description": "My check" } 142 ``` 143 144 ## Philosophy 145 146 > "Don't ship prompts without evals. It's the AI equivalent of shipping code without tests." 147 148 Manual testing is important for tone and feel. But automated evals catch regressions, enforce quality standards, and give you a score to track over time. Use both.