<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <title>Cradicle Explorer</title>
    <link href="/css/bootstrap/bootstrap.min.css" rel="stylesheet">
    <style>
      .form-control-dark::placeholder {
          color: #aaa;
          opacity: 1;
      }
    </style>
    <link rel="stylesheet" href="/assets/fontawesome/css/all.min.css">
    <link rel="icon" type="image/png" href="/favicon.png">


                <link href="/css/dashboard.css" rel="stylesheet">
                </head>
                <body>
                <header class="navbar navbar-dark sticky-top bg-dark flex-md-nowrap p-0 shadow">
                  <a class="navbar-brand col-md-3 col-lg-2 me-0 px-3 fs-6" href="/">Cradicle Explorer</a>
                  <button class="navbar-toggler position-absolute d-md-none collapsed" type="button" data-bs-toggle="collapse" data-bs-target="#sidebarMenu" aria-controls="sidebarMenu" aria-expanded="false" aria-label="Toggle navigation">
                    <span class="navbar-toggler-icon"></span>
                  </button>
                  <form method="get" action="/cgi-bin/main" style="width:100%;"><input class="form-control form-control-dark w-100 rounded-0 border-0" type="text" name="q" placeholder="Search repos" aria-label="Search"></form>
                  <div class="navbar-nav flex-row">
                    <div class="nav-item text-nowrap">
                      <a class="nav-link px-3 active" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw">ai-ops-automation_ai-marketing-skills</a>
                    </div>
                  </div>
                </header>
                <div class="container-fluid">
                  <div class="row">
                    <nav id="sidebarMenu" class="col-md-3 col-lg-2 d-md-block bg-dark sidebar collapse">
                      <div class="position-sticky pt-3 sidebar-sticky">
                        <ul class="nav flex-column">
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw">
                              <i class="align-text-bottom fa-solid fa-info"></i>
                              Info
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&issue=list">
                              <i class="align-text-bottom fa-solid fa-layer-group"></i>
                              Issues
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&patch=list">
                              <i class="align-text-bottom fa-solid fa-vest-patches"></i>
                              Patches
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&wallet=list">
                              <i class="align-text-bottom fa-solid fa-wallet"></i>
                              Wallets
                            </a>
                          </li>
                          <li class="nav-item">
                            <a class="nav-link active" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=.">
                              <i class="align-text-bottom fa-solid fa-code"></i>
                              Source
                            </a>
                          </li>
                        <h6 class="sidebar-heading d-flex justify-content-between align-items-center px-3 mt-4 mb-1 text-muted text-uppercase">
                          <span></span>
                        </h6>
                        <ul class="nav flex-column mb-2">
                        
    <h6 class="sidebar-heading d-flex justify-content-between align-items-center px-3 mt-1 mb-1 text-muted text-uppercase">
      <span>Source</span>
    </h6>
    <li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=autoresearch"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> autoresearch</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=clone-site"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> clone-site</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=content-eval"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> content-eval</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=content-ops"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> content-ops</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=conversion-ops"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> conversion-ops</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=deck-generator"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> deck-generator</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=eval"><i class="fa-solid fa-folder-open" style="color:#f0c040;"></i> eval</a></li><li><a class="nav-link py-0 active" style="padding-left:32px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=eval%2FCLAUDE.md"><i class="fa-solid fa-file" style="color:#888;"></i> CLAUDE.md</a></li><li><a class="nav-link py-0" style="padding-left:32px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=eval%2FREADME.md"><i class="fa-solid fa-file" style="color:#888;"></i> README.md</a></li><li><a class="nav-link py-0" style="padding-left:32px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=eval%2Feval.config.example.json"><i class="fa-solid fa-file" style="color:#888;"></i> eval.config.example.json</a></li><li><a class="nav-link py-0" style="padding-left:32px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=eval%2Frun-eval.ts"><i class="fa-solid fa-file" style="color:#888;"></i> run-eval.ts</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=finance-ops"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> finance-ops</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=growth-engine"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> growth-engine</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=lead-dossier"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> lead-dossier</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=outbound-engine"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> outbound-engine</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=podcast-ops"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> podcast-ops</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=revenue-intelligence"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> revenue-intelligence</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=sales-pipeline"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> sales-pipeline</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=sales-playbook"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> sales-playbook</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=security"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> security</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=seo-ops"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> seo-ops</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=short-form-pipeline"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> short-form-pipeline</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=team-ops"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> team-ops</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=telemetry"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> telemetry</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=video-caption-generator"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> video-caption-generator</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=video-clip-pipeline"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> video-clip-pipeline</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=x-longform-post"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> x-longform-post</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=yt-competitive-analysis"><i class="fa-solid fa-folder" style="color:#f0c040;"></i> yt-competitive-analysis</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=.gitignore"><i class="fa-solid fa-file" style="color:#888;"></i> .gitignore</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=CONTRIBUTING.md"><i class="fa-solid fa-file" style="color:#888;"></i> CONTRIBUTING.md</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=LICENSE"><i class="fa-solid fa-file" style="color:#888;"></i> LICENSE</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=Makefile"><i class="fa-solid fa-file" style="color:#888;"></i> Makefile</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=README.md"><i class="fa-solid fa-file" style="color:#888;"></i> README.md</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=VERSION"><i class="fa-solid fa-file" style="color:#888;"></i> VERSION</a></li><li><a class="nav-link py-0" style="padding-left:16px;" href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&file=skill-safety.yml"><i class="fa-solid fa-file" style="color:#888;"></i> skill-safety.yml</a></li>
    
                        </ul>
                      </div>
                    </nav>
                <main class="col-md-9 ms-sm-auto col-lg-10">
                  <div class="container px-1 py-3">
        
<div class="mb-2" style="font-size:1.1rem;"><a href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=.">/</a> <a href="/cgi-bin/repo?id=z3VeDLzZu3FFvota9WFbezAjRHibw&source=eval">eval</a> / CLAUDE.md</div>
        <div class="list-group">
        <div class="list-group-item">
        <div class="mb-2" style="font-weight:bold;"><i class="fa-solid fa-file"></i> CLAUDE.md</div>
        <pre style="margin:0; font-size:0.85rem; overflow-x:auto; color:#fafafa;"><span style="color:#666; user-select:none;">  1</span>  # /eval — AI Output Evaluation Suite
<span style="color:#666; user-select:none;">  2</span>  
<span style="color:#666; user-select:none;">  3</span>  Evaluate any AI-powered feature: chat agents, content generators, classifiers, summarizers, code generators, or any system that takes input and produces AI output. Defines what &quot;good&quot; looks like through conversation, generates test cases, runs them, and scores results.
<span style="color:#666; user-select:none;">  4</span>  
<span style="color:#666; user-select:none;">  5</span>  ## When to use
<span style="color:#666; user-select:none;">  6</span>  - After changing prompts, models, or system instructions
<span style="color:#666; user-select:none;">  7</span>  - Before deploying any AI feature to production
<span style="color:#666; user-select:none;">  8</span>  - Weekly to catch quality drift
<span style="color:#666; user-select:none;">  9</span>  - When defining quality standards for a new AI feature
<span style="color:#666; user-select:none;"> 10</span>  
<span style="color:#666; user-select:none;"> 11</span>  ## How to invoke
<span style="color:#666; user-select:none;"> 12</span>  
<span style="color:#666; user-select:none;"> 13</span>  ```
<span style="color:#666; user-select:none;"> 14</span>  /eval                          # Start fresh or run existing config
<span style="color:#666; user-select:none;"> 15</span>  /eval --run                    # Skip setup, run existing eval.config.json
<span style="color:#666; user-select:none;"> 16</span>  /eval --verbose                # Show full outputs during run
<span style="color:#666; user-select:none;"> 17</span>  /eval --baseline               # Save results as baseline for future comparisons
<span style="color:#666; user-select:none;"> 18</span>  ```
<span style="color:#666; user-select:none;"> 19</span>  
<span style="color:#666; user-select:none;"> 20</span>  ## Instructions for Claude
<span style="color:#666; user-select:none;"> 21</span>  
<span style="color:#666; user-select:none;"> 22</span>  When the user invokes `/eval`:
<span style="color:#666; user-select:none;"> 23</span>  
<span style="color:#666; user-select:none;"> 24</span>  ### Step 1: Check for existing config
<span style="color:#666; user-select:none;"> 25</span>  
<span style="color:#666; user-select:none;"> 26</span>  ```bash
<span style="color:#666; user-select:none;"> 27</span>  ls eval.config.json 2&gt;/dev/null &amp;&amp; echo &quot;CONFIG_EXISTS&quot; || echo &quot;NO_CONFIG&quot;
<span style="color:#666; user-select:none;"> 28</span>  ```
<span style="color:#666; user-select:none;"> 29</span>  
<span style="color:#666; user-select:none;"> 30</span>  If `CONFIG_EXISTS` and user did NOT pass `--run`: ask &quot;You have an existing eval config with N scenarios. Want to run it, update it, or start fresh?&quot;
<span style="color:#666; user-select:none;"> 31</span>  If `CONFIG_EXISTS` and user passed `--run`: skip to Step 4.
<span style="color:#666; user-select:none;"> 32</span>  If `NO_CONFIG`: proceed to Step 2.
<span style="color:#666; user-select:none;"> 33</span>  
<span style="color:#666; user-select:none;"> 34</span>  ### Step 2: Understand what you&#x27;re evaluating
<span style="color:#666; user-select:none;"> 35</span>  
<span style="color:#666; user-select:none;"> 36</span>  Ask the user these questions ONE AT A TIME via AskUserQuestion. The first question determines the flow for the rest.
<span style="color:#666; user-select:none;"> 37</span>  
<span style="color:#666; user-select:none;"> 38</span>  **Q1: &quot;What type of AI output are you evaluating?&quot;**
<span style="color:#666; user-select:none;"> 39</span>  Options:
<span style="color:#666; user-select:none;"> 40</span>  - A) Chat agent / conversational AI (multi-turn: user sends messages, agent responds)
<span style="color:#666; user-select:none;"> 41</span>  - B) Content generator (single input, long-form output: blog posts, emails, plans)
<span style="color:#666; user-select:none;"> 42</span>  - C) Classifier / scorer (input data, output label or score)
<span style="color:#666; user-select:none;"> 43</span>  - D) Summarizer / extractor (input document, output summary or structured data)
<span style="color:#666; user-select:none;"> 44</span>  - E) Something else (describe it)
<span style="color:#666; user-select:none;"> 45</span>  
<span style="color:#666; user-select:none;"> 46</span>  **Q2: &quot;Describe what it does in one sentence.&quot;**
<span style="color:#666; user-select:none;"> 47</span>  Example: &quot;It generates SEO-optimized blog post outlines from a keyword.&quot;
<span style="color:#666; user-select:none;"> 48</span>  
<span style="color:#666; user-select:none;"> 49</span>  **Q3 (varies by type):**
<span style="color:#666; user-select:none;"> 50</span>  
<span style="color:#666; user-select:none;"> 51</span>  For CHAT AGENTS (A):
<span style="color:#666; user-select:none;"> 52</span>  - &quot;What&#x27;s the API endpoint? How do I send a message and get a response?&quot;
<span style="color:#666; user-select:none;"> 53</span>  - &quot;What should a good conversation look like? List 3-5 things the agent should always do.&quot;
<span style="color:#666; user-select:none;"> 54</span>  - &quot;What should it never do?&quot;
<span style="color:#666; user-select:none;"> 55</span>  - &quot;Who are ideal users vs. who should be turned away?&quot;
<span style="color:#666; user-select:none;"> 56</span>  
<span style="color:#666; user-select:none;"> 57</span>  For CONTENT GENERATORS (B):
<span style="color:#666; user-select:none;"> 58</span>  - &quot;What&#x27;s the API endpoint or function? What input does it take?&quot;
<span style="color:#666; user-select:none;"> 59</span>  - &quot;What makes the output GOOD? (length, tone, structure, accuracy, keywords)&quot;
<span style="color:#666; user-select:none;"> 60</span>  - &quot;What makes the output BAD? (hallucinations, wrong tone, too short/long, missing sections)&quot;
<span style="color:#666; user-select:none;"> 61</span>  - &quot;Show me one example of good output if you have it.&quot;
<span style="color:#666; user-select:none;"> 62</span>  
<span style="color:#666; user-select:none;"> 63</span>  For CLASSIFIERS (C):
<span style="color:#666; user-select:none;"> 64</span>  - &quot;What&#x27;s the API endpoint or function?&quot;
<span style="color:#666; user-select:none;"> 65</span>  - &quot;What are the possible output labels/scores?&quot;
<span style="color:#666; user-select:none;"> 66</span>  - &quot;Do you have labeled test data (known correct answers)?&quot;
<span style="color:#666; user-select:none;"> 67</span>  - &quot;What&#x27;s the cost of a false positive vs. false negative?&quot;
<span style="color:#666; user-select:none;"> 68</span>  
<span style="color:#666; user-select:none;"> 69</span>  For SUMMARIZERS (D):
<span style="color:#666; user-select:none;"> 70</span>  - &quot;What&#x27;s the API endpoint or function?&quot;
<span style="color:#666; user-select:none;"> 71</span>  - &quot;What should the summary include? What should it exclude?&quot;
<span style="color:#666; user-select:none;"> 72</span>  - &quot;What&#x27;s the max length?&quot;
<span style="color:#666; user-select:none;"> 73</span>  - &quot;Should it preserve specific details (names, numbers, dates)?&quot;
<span style="color:#666; user-select:none;"> 74</span>  
<span style="color:#666; user-select:none;"> 75</span>  For OTHER (E):
<span style="color:#666; user-select:none;"> 76</span>  - &quot;Walk me through: what goes in, what comes out?&quot;
<span style="color:#666; user-select:none;"> 77</span>  - &quot;How do you know when the output is good vs. bad?&quot;
<span style="color:#666; user-select:none;"> 78</span>  - &quot;What are the failure modes you&#x27;re worried about?&quot;
<span style="color:#666; user-select:none;"> 79</span>  
<span style="color:#666; user-select:none;"> 80</span>  ### Step 3: Generate eval config
<span style="color:#666; user-select:none;"> 81</span>  
<span style="color:#666; user-select:none;"> 82</span>  Based on the user&#x27;s answers, generate test cases appropriate to the type:
<span style="color:#666; user-select:none;"> 83</span>  
<span style="color:#666; user-select:none;"> 84</span>  **Chat agents:** 10-20 multi-turn conversation scenarios (qualified users, unqualified users, edge cases, product knowledge tests, hostile users, capability boundaries)
<span style="color:#666; user-select:none;"> 85</span>  
<span style="color:#666; user-select:none;"> 86</span>  **Content generators:** 10-15 input variations testing different topics, edge cases, and quality dimensions (accuracy, tone, length, structure, keyword inclusion)
<span style="color:#666; user-select:none;"> 87</span>  
<span style="color:#666; user-select:none;"> 88</span>  **Classifiers:** 20-30 test inputs with known correct labels, covering each class, edge cases, and adversarial inputs
<span style="color:#666; user-select:none;"> 89</span>  
<span style="color:#666; user-select:none;"> 90</span>  **Summarizers:** 10-15 test documents of varying length and complexity, checking for completeness, accuracy, length compliance, and hallucination
<span style="color:#666; user-select:none;"> 91</span>  
<span style="color:#666; user-select:none;"> 92</span>  **For all types, generate criteria based on:**
<span style="color:#666; user-select:none;"> 93</span>  - Things the user said make output GOOD -&gt; `contains`, `regex`, `max_length` checks
<span style="color:#666; user-select:none;"> 94</span>  - Things the user said make output BAD -&gt; `not_contains` checks
<span style="color:#666; user-select:none;"> 95</span>  - Type-specific quality checks (see criterion types below)
<span style="color:#666; user-select:none;"> 96</span>  
<span style="color:#666; user-select:none;"> 97</span>  Write the config to `eval.config.json`. Show the user and ask: &quot;Does this cover what matters? Want to add or change anything?&quot;
<span style="color:#666; user-select:none;"> 98</span>  
<span style="color:#666; user-select:none;"> 99</span>  ### Step 4: Run the evals
<span style="color:#666; user-select:none;">100</span>  
<span style="color:#666; user-select:none;">101</span>  ```bash
<span style="color:#666; user-select:none;">102</span>  npx tsx .claude/skills/eval/run-eval.ts [--config eval.config.json] [--verbose] [--baseline]
<span style="color:#666; user-select:none;">103</span>  ```
<span style="color:#666; user-select:none;">104</span>  
<span style="color:#666; user-select:none;">105</span>  The runner handles all types. For chat agents, it sends messages sequentially and evaluates the full conversation. For single-input types, it sends one request per scenario and evaluates the output.
<span style="color:#666; user-select:none;">106</span>  
<span style="color:#666; user-select:none;">107</span>  ### Step 5: Report results
<span style="color:#666; user-select:none;">108</span>  
<span style="color:#666; user-select:none;">109</span>  1. Summary table (scenario x criterion, pass/fail)
<span style="color:#666; user-select:none;">110</span>  2. Overall score: &quot;X/Y criteria passed (Z%)&quot;
<span style="color:#666; user-select:none;">111</span>  3. If baseline exists: &quot;Score changed from A% to B%&quot;
<span style="color:#666; user-select:none;">112</span>  4. Regressions: scenarios that got worse since last run
<span style="color:#666; user-select:none;">113</span>  5. Top 3 failures with diagnosis
<span style="color:#666; user-select:none;">114</span>  6. Recommendation: fix or ship
<span style="color:#666; user-select:none;">115</span>  
<span style="color:#666; user-select:none;">116</span>  ### Step 6: Iterate
<span style="color:#666; user-select:none;">117</span>  
<span style="color:#666; user-select:none;">118</span>  If failures exist:
<span style="color:#666; user-select:none;">119</span>  1. Read the failing output from eval-results.json
<span style="color:#666; user-select:none;">120</span>  2. Diagnose root cause (prompt issue, missing data, model limitation)
<span style="color:#666; user-select:none;">121</span>  3. Suggest a fix
<span style="color:#666; user-select:none;">122</span>  4. After fix: re-run to verify
<span style="color:#666; user-select:none;">123</span>  
<span style="color:#666; user-select:none;">124</span>  ## Config format
<span style="color:#666; user-select:none;">125</span>  
<span style="color:#666; user-select:none;">126</span>  ```json
<span style="color:#666; user-select:none;">127</span>  {
<span style="color:#666; user-select:none;">128</span>    &quot;name&quot;: &quot;My AI Feature Eval&quot;,
<span style="color:#666; user-select:none;">129</span>    &quot;type&quot;: &quot;chat | content | classifier | summarizer | custom&quot;,
<span style="color:#666; user-select:none;">130</span>    &quot;endpoint&quot;: &quot;https://my-api.com/endpoint&quot;,
<span style="color:#666; user-select:none;">131</span>    &quot;method&quot;: &quot;POST&quot;,
<span style="color:#666; user-select:none;">132</span>    &quot;headers&quot;: {},
<span style="color:#666; user-select:none;">133</span>    &quot;request_template&quot;: {},
<span style="color:#666; user-select:none;">134</span>    &quot;response_field&quot;: &quot;response&quot;,
<span style="color:#666; user-select:none;">135</span>    &quot;threshold&quot;: 80,
<span style="color:#666; user-select:none;">136</span>    &quot;good_behaviors&quot;: [],
<span style="color:#666; user-select:none;">137</span>    &quot;bad_behaviors&quot;: [],
<span style="color:#666; user-select:none;">138</span>    &quot;scenarios&quot;: []
<span style="color:#666; user-select:none;">139</span>  }
<span style="color:#666; user-select:none;">140</span>  ```
<span style="color:#666; user-select:none;">141</span>  
<span style="color:#666; user-select:none;">142</span>  The `type` field determines how scenarios are executed:
<span style="color:#666; user-select:none;">143</span>  - `chat`: multi-turn (sends messages sequentially, maintains history)
<span style="color:#666; user-select:none;">144</span>  - `content | classifier | summarizer | custom`: single-turn (one request per scenario)
<span style="color:#666; user-select:none;">145</span>  
<span style="color:#666; user-select:none;">146</span>  ## Criterion types
<span style="color:#666; user-select:none;">147</span>  
<span style="color:#666; user-select:none;">148</span>  | Type | Works for | Description |
<span style="color:#666; user-select:none;">149</span>  |------|-----------|-------------|
<span style="color:#666; user-select:none;">150</span>  | `contains` | All | Output contains a string (case-insensitive) |
<span style="color:#666; user-select:none;">151</span>  | `not_contains` | All | Output does NOT contain a string |
<span style="color:#666; user-select:none;">152</span>  | `regex` | All | Output matches a regex pattern |
<span style="color:#666; user-select:none;">153</span>  | `max_length` | All | Output is under N characters |
<span style="color:#666; user-select:none;">154</span>  | `min_length` | Content | Output is at least N characters |
<span style="color:#666; user-select:none;">155</span>  | `max_sentences` | Chat, Content | Output is under N sentences |
<span style="color:#666; user-select:none;">156</span>  | `response_time` | All | API responds within N milliseconds |
<span style="color:#666; user-select:none;">157</span>  | `json_valid` | Classifier, Custom | Output is valid JSON |
<span style="color:#666; user-select:none;">158</span>  | `json_field_equals` | Classifier | A JSON field equals an expected value |
<span style="color:#666; user-select:none;">159</span>  | `no_hallucination` | Content, Summarizer | Output doesn&#x27;t contain claims not in the input |
<span style="color:#666; user-select:none;">160</span>  | `preserves_names` | Summarizer | Key names from input appear in output |
<span style="color:#666; user-select:none;">161</span>  | `preserves_numbers` | Summarizer | Key numbers from input appear in output |
</pre>
        </div>
        </div>

</div>
</main>
</div>
</div>


</body>
</html>

