/ eval / README.md
README.md
  1  # /eval — AI Output Evaluation Skill for Claude Code
  2  
  3  Automated evaluation suite for any AI-powered feature. Works with chat agents, content generators, classifiers, summarizers, and any system that takes input and produces AI output.
  4  
  5  ## What it does
  6  
  7  1. Asks you what "good" looks like for your AI feature
  8  2. Generates test scenarios and pass/fail criteria
  9  3. Runs the scenarios against your API
 10  4. Scores results and detects regressions
 11  5. Helps you fix failures
 12  
 13  ## Install
 14  
 15  Copy the `eval/` directory into your project's `.claude/skills/`:
 16  
 17  ```bash
 18  cp -r eval/ your-project/.claude/skills/eval/
 19  ```
 20  
 21  Or clone and symlink:
 22  
 23  ```bash
 24  git clone https://github.com/nichetools/ai-marketing-skills.git
 25  ln -s ai-marketing-skills/eval your-project/.claude/skills/eval
 26  ```
 27  
 28  ## Quick start
 29  
 30  1. In Claude Code, type `/eval`
 31  2. Answer the questions about what your AI does and what good/bad looks like
 32  3. The skill generates `eval.config.json` with test scenarios
 33  4. It runs the scenarios and shows you the results
 34  5. Fix failures, re-run, repeat
 35  
 36  ## Running evals manually
 37  
 38  ```bash
 39  # Run all scenarios
 40  npx tsx .claude/skills/eval/run-eval.ts
 41  
 42  # Use a specific config
 43  npx tsx .claude/skills/eval/run-eval.ts --config my-eval.config.json
 44  
 45  # See full AI outputs
 46  npx tsx .claude/skills/eval/run-eval.ts --verbose
 47  
 48  # Save current results as baseline
 49  npx tsx .claude/skills/eval/run-eval.ts --baseline
 50  ```
 51  
 52  ## Supported AI types
 53  
 54  | Type | How it works |
 55  |------|-------------|
 56  | **Chat agent** | Sends multi-turn messages, evaluates full conversation |
 57  | **Content generator** | Sends single input, evaluates output quality |
 58  | **Classifier** | Sends inputs with known labels, checks accuracy |
 59  | **Summarizer** | Sends documents, checks summary quality |
 60  | **Custom** | Any input/output API you define |
 61  
 62  ## Criterion types
 63  
 64  | Type | What it checks |
 65  |------|---------------|
 66  | `contains` | Output includes a string (case-insensitive) |
 67  | `not_contains` | Output does NOT include a string |
 68  | `regex` | Output matches a regex pattern |
 69  | `max_length` | Output is under N characters |
 70  | `min_length` | Output is at least N characters |
 71  | `max_sentences` | Output is under N sentences |
 72  | `response_time` | API responds within N milliseconds |
 73  | `json_valid` | Output is valid JSON |
 74  | `json_field_equals` | A JSON field equals an expected value |
 75  | `no_hallucination` | Output doesn't contain claims not in the input |
 76  | `preserves_names` | Key names from input appear in output |
 77  | `preserves_numbers` | Key numbers from input appear in output |
 78  
 79  ## Config format
 80  
 81  ```json
 82  {
 83    "name": "My AI Feature Eval",
 84    "type": "chat",
 85    "endpoint": "https://my-api.com/chat",
 86    "method": "POST",
 87    "headers": { "Content-Type": "application/json" },
 88    "request_template": {
 89      "message": "{{message}}",
 90      "history": "{{history}}"
 91    },
 92    "response_field": "response",
 93    "threshold": 80,
 94    "scenarios": [
 95      {
 96        "name": "basic_greeting",
 97        "messages": ["Hello"],
 98        "criteria": [
 99          { "type": "contains", "value": "help", "description": "Offers help" }
100        ]
101      }
102    ]
103  }
104  ```
105  
106  See `eval.config.example.json` for a full example.
107  
108  ## Regression detection
109  
110  Save a baseline after your first passing run:
111  
112  ```bash
113  npx tsx .claude/skills/eval/run-eval.ts --baseline
114  ```
115  
116  Future runs automatically compare against the baseline and flag:
117  - Score drops
118  - Individual scenarios that got worse
119  - New failures that didn't exist before
120  
121  ## When to run evals
122  
123  - After every prompt or model change
124  - Before deploying to production
125  - Weekly to catch quality drift from content/data changes
126  - When onboarding a new AI feature
127  
128  ## Adding custom criteria
129  
130  Edit `run-eval.ts` and add a case to the `evaluateCriterion` function:
131  
132  ```typescript
133  case 'my_custom_check':
134    // Your logic here
135    return someCondition;
136  ```
137  
138  Then use it in your config:
139  
140  ```json
141  { "type": "my_custom_check", "value": "whatever", "description": "My check" }
142  ```
143  
144  ## Philosophy
145  
146  > "Don't ship prompts without evals. It's the AI equivalent of shipping code without tests."
147  
148  Manual testing is important for tone and feel. But automated evals catch regressions, enforce quality standards, and give you a score to track over time. Use both.