prompt_optimization_code_review_example.ipynb
1 { 2 "cells": [ 3 { 4 "cell_type": "markdown", 5 "id": "15a7ac63", 6 "metadata": {}, 7 "source": [ 8 "# Prompt Optimization with Evidently: Code Review Quality Classifier\n", 9 "This tutorial demonstrates how to use Evidently's new `PromptOptimizer` API for optimizing prompts for LLM judges. \n", 10 "We'll walk through optimizing a prompt that classifies the quality of code reviews written for junior developers.\n", 11 "\n", 12 "## What you'll learn:\n", 13 "- How to set up a dataset for LLM evaluation\n", 14 "- How to define an LLM judge with a prompt template\n", 15 "- How to run the prompt optimization loop\n", 16 "- How to retrieve and inspect the best performing prompt" 17 ] 18 }, 19 { 20 "cell_type": "code", 21 "id": "133c6d18", 22 "metadata": {}, 23 "source": [ 24 "# If you haven't installed the required packages yet:\n", 25 "# !pip install evidently openai pandas" 26 ], 27 "outputs": [], 28 "execution_count": null 29 }, 30 { 31 "cell_type": "code", 32 "id": "3cc9af2e", 33 "metadata": {}, 34 "source": [ 35 "import pandas as pd\n", 36 "\n", 37 "from evidently import Dataset, DataDefinition, LLMClassification\n", 38 "from evidently.llm.templates import BinaryClassificationPromptTemplate\n", 39 "from evidently.descriptors import LLMEval\n", 40 "from evidently.llm.optimization import PromptOptimizer" 41 ], 42 "outputs": [], 43 "execution_count": null 44 }, 45 { 46 "cell_type": "code", 47 "id": "fd5f6441", 48 "metadata": {}, 49 "source": [ 50 "# Load your dataset\n", 51 "review_dataset = pd.read_csv(\"../datasets/code_review.csv\")\n", 52 "review_dataset.head()" 53 ], 54 "outputs": [], 55 "execution_count": null 56 }, 57 { 58 "cell_type": "code", 59 "id": "6e464810", 60 "metadata": {}, 61 "source": [ 62 "# Define how Evidently should interpret your dataset\n", 63 "dd = DataDefinition(\n", 64 " text_columns=[\"Generated review\", \"Expert comment\"],\n", 65 " categorical_columns=[\"Expert label\"],\n", 66 " llm=LLMClassification(input=\"Generated review\", target=\"Expert label\", reasoning=\"Expert comment\")\n", 67 ")" 68 ], 69 "outputs": [], 70 "execution_count": null 71 }, 72 { 73 "cell_type": "code", 74 "id": "3957c58d", 75 "metadata": {}, 76 "source": [ 77 "# Convert your pandas DataFrame into an Evidently Dataset\n", 78 "dataset = Dataset.from_pandas(review_dataset, data_definition=dd)" 79 ], 80 "outputs": [], 81 "execution_count": null 82 }, 83 { 84 "cell_type": "code", 85 "id": "af027bae", 86 "metadata": {}, 87 "source": [ 88 "# Define a prompt template and judge for classifying code review quality\n", 89 "criteria = '''A review is GOOD when it's actionable and constructive.\n", 90 "A review is BAD when it is non-actionable or overly critical.'''\n", 91 "\n", 92 "feedback_quality = BinaryClassificationPromptTemplate(\n", 93 " pre_messages=[(\"system\", \"You are evaluating the quality of code reviews given to junior developers.\")],\n", 94 " criteria=criteria,\n", 95 " target_category=\"bad\",\n", 96 " non_target_category=\"good\",\n", 97 " uncertainty=\"unknown\",\n", 98 " include_reasoning=True,\n", 99 ")\n", 100 "\n", 101 "judge = LLMEval(\n", 102 " alias=\"Code Review Judge\",\n", 103 " provider=\"openai\",\n", 104 " model=\"gpt-4o-mini\",\n", 105 " column_name=\"Generated review\",\n", 106 " template=feedback_quality\n", 107 ")" 108 ], 109 "outputs": [], 110 "execution_count": null 111 }, 112 { 113 "cell_type": "code", 114 "id": "6995309b", 115 "metadata": {}, 116 "source": [ 117 "# Initialize the optimizer and run optimization using feedback strategy\n", 118 "optimizer = PromptOptimizer(\"code_review_example\", strategy=\"feedback\", verbose=True)\n", 119 "await optimizer.arun(executor=judge, scorer=\"accuracy\", dataset=dataset, repetitions=5)\n", 120 "# for sync version:\n", 121 "# optimizer.run(judge, \"accuracy\")" 122 ], 123 "outputs": [], 124 "execution_count": null 125 }, 126 { 127 "cell_type": "code", 128 "id": "e7f3162d", 129 "metadata": {}, 130 "source": [ 131 "# Show the best-performing prompt template found by the optimizer\n", 132 "print(optimizer.best_prompt())" 133 ], 134 "outputs": [], 135 "execution_count": null 136 }, 137 { 138 "metadata": {}, 139 "cell_type": "code", 140 "source": [ 141 "# inspect optimizer statistics & logs\n", 142 "optimizer.print_stats()" 143 ], 144 "id": "9abd1c4ea5585177", 145 "outputs": [], 146 "execution_count": null 147 }, 148 { 149 "metadata": {}, 150 "cell_type": "markdown", 151 "source": [ 152 "### LLM Judge from scratch\n", 153 "\n", 154 "You can also create LLM judges from scratch if you have labeled dataset." 155 ], 156 "id": "19d88462d4f97af2" 157 }, 158 { 159 "metadata": {}, 160 "cell_type": "code", 161 "source": [ 162 "from evidently.llm.optimization import BlankLLMJudge\n", 163 "\n", 164 "optimizer = PromptOptimizer(\"code_review_scratch_example\", strategy=\"feedback\", verbose=True)\n", 165 "await optimizer.arun(executor=BlankLLMJudge(), scorer=\"accuracy\", dataset=dataset, repetitions=5)" 166 ], 167 "id": "d78caae22b9749cf", 168 "outputs": [], 169 "execution_count": null 170 }, 171 { 172 "metadata": {}, 173 "cell_type": "code", 174 "source": [ 175 "new_judge = optimizer.best_executor()\n", 176 "new_judge" 177 ], 178 "id": "8b168c087f71ddc", 179 "outputs": [], 180 "execution_count": null 181 }, 182 { 183 "metadata": {}, 184 "cell_type": "code", 185 "source": "optimizer.print_stats()", 186 "id": "57f46604626cd00d", 187 "outputs": [], 188 "execution_count": null 189 }, 190 { 191 "metadata": {}, 192 "cell_type": "markdown", 193 "source": "You can save and load optimized template for later use", 194 "id": "d0eb0d104d0174d9" 195 }, 196 { 197 "metadata": {}, 198 "cell_type": "code", 199 "source": [ 200 "from evidently.descriptors import LLMJudge\n", 201 "from evidently.llm.templates import BaseLLMPromptTemplate\n", 202 "\n", 203 "new_judge.template.dump(\"my_template.yaml\")\n", 204 "template = BaseLLMPromptTemplate.load(\"my_template.yaml\")\n", 205 "my_judge = LLMJudge(provider=\"openai\", model=\"gpt-4o-mini\", template=template, input_column=\"Generated review\")" 206 ], 207 "id": "51dde31029349848", 208 "outputs": [], 209 "execution_count": null 210 }, 211 { 212 "metadata": {}, 213 "cell_type": "code", 214 "source": "", 215 "id": "74dc06582a422643", 216 "outputs": [], 217 "execution_count": null 218 } 219 ], 220 "metadata": { 221 "kernelspec": { 222 "display_name": "Python 3 (ipykernel)", 223 "language": "python", 224 "name": "python3" 225 }, 226 "language_info": { 227 "codemirror_mode": { 228 "name": "ipython", 229 "version": 3 230 }, 231 "file_extension": ".py", 232 "mimetype": "text/x-python", 233 "name": "python", 234 "nbconvert_exporter": "python", 235 "pygments_lexer": "ipython3", 236 "version": "3.11.11" 237 } 238 }, 239 "nbformat": 4, 240 "nbformat_minor": 5 241 }