/ examples / cookbook / prompt_optimization_code_review_example.ipynb
prompt_optimization_code_review_example.ipynb
  1  {
  2   "cells": [
  3    {
  4     "cell_type": "markdown",
  5     "id": "15a7ac63",
  6     "metadata": {},
  7     "source": [
  8      "# Prompt Optimization with Evidently: Code Review Quality Classifier\n",
  9      "This tutorial demonstrates how to use Evidently's new `PromptOptimizer` API for optimizing prompts for LLM judges. \n",
 10      "We'll walk through optimizing a prompt that classifies the quality of code reviews written for junior developers.\n",
 11      "\n",
 12      "## What you'll learn:\n",
 13      "- How to set up a dataset for LLM evaluation\n",
 14      "- How to define an LLM judge with a prompt template\n",
 15      "- How to run the prompt optimization loop\n",
 16      "- How to retrieve and inspect the best performing prompt"
 17     ]
 18    },
 19    {
 20     "cell_type": "code",
 21     "id": "133c6d18",
 22     "metadata": {},
 23     "source": [
 24      "# If you haven't installed the required packages yet:\n",
 25      "# !pip install evidently openai pandas"
 26     ],
 27     "outputs": [],
 28     "execution_count": null
 29    },
 30    {
 31     "cell_type": "code",
 32     "id": "3cc9af2e",
 33     "metadata": {},
 34     "source": [
 35      "import pandas as pd\n",
 36      "\n",
 37      "from evidently import Dataset, DataDefinition, LLMClassification\n",
 38      "from evidently.llm.templates import BinaryClassificationPromptTemplate\n",
 39      "from evidently.descriptors import LLMEval\n",
 40      "from evidently.llm.optimization import PromptOptimizer"
 41     ],
 42     "outputs": [],
 43     "execution_count": null
 44    },
 45    {
 46     "cell_type": "code",
 47     "id": "fd5f6441",
 48     "metadata": {},
 49     "source": [
 50      "# Load your dataset\n",
 51      "review_dataset = pd.read_csv(\"../datasets/code_review.csv\")\n",
 52      "review_dataset.head()"
 53     ],
 54     "outputs": [],
 55     "execution_count": null
 56    },
 57    {
 58     "cell_type": "code",
 59     "id": "6e464810",
 60     "metadata": {},
 61     "source": [
 62      "# Define how Evidently should interpret your dataset\n",
 63      "dd = DataDefinition(\n",
 64      "    text_columns=[\"Generated review\", \"Expert comment\"],\n",
 65      "    categorical_columns=[\"Expert label\"],\n",
 66      "    llm=LLMClassification(input=\"Generated review\", target=\"Expert label\", reasoning=\"Expert comment\")\n",
 67      ")"
 68     ],
 69     "outputs": [],
 70     "execution_count": null
 71    },
 72    {
 73     "cell_type": "code",
 74     "id": "3957c58d",
 75     "metadata": {},
 76     "source": [
 77      "# Convert your pandas DataFrame into an Evidently Dataset\n",
 78      "dataset = Dataset.from_pandas(review_dataset, data_definition=dd)"
 79     ],
 80     "outputs": [],
 81     "execution_count": null
 82    },
 83    {
 84     "cell_type": "code",
 85     "id": "af027bae",
 86     "metadata": {},
 87     "source": [
 88      "# Define a prompt template and judge for classifying code review quality\n",
 89      "criteria = '''A review is GOOD when it's actionable and constructive.\n",
 90      "A review is BAD when it is non-actionable or overly critical.'''\n",
 91      "\n",
 92      "feedback_quality = BinaryClassificationPromptTemplate(\n",
 93      "    pre_messages=[(\"system\", \"You are evaluating the quality of code reviews given to junior developers.\")],\n",
 94      "    criteria=criteria,\n",
 95      "    target_category=\"bad\",\n",
 96      "    non_target_category=\"good\",\n",
 97      "    uncertainty=\"unknown\",\n",
 98      "    include_reasoning=True,\n",
 99      ")\n",
100      "\n",
101      "judge = LLMEval(\n",
102      "    alias=\"Code Review Judge\",\n",
103      "    provider=\"openai\",\n",
104      "    model=\"gpt-4o-mini\",\n",
105      "    column_name=\"Generated review\",\n",
106      "    template=feedback_quality\n",
107      ")"
108     ],
109     "outputs": [],
110     "execution_count": null
111    },
112    {
113     "cell_type": "code",
114     "id": "6995309b",
115     "metadata": {},
116     "source": [
117      "# Initialize the optimizer and run optimization using feedback strategy\n",
118      "optimizer = PromptOptimizer(\"code_review_example\", strategy=\"feedback\", verbose=True)\n",
119      "await optimizer.arun(executor=judge, scorer=\"accuracy\", dataset=dataset, repetitions=5)\n",
120      "# for sync version:\n",
121      "# optimizer.run(judge, \"accuracy\")"
122     ],
123     "outputs": [],
124     "execution_count": null
125    },
126    {
127     "cell_type": "code",
128     "id": "e7f3162d",
129     "metadata": {},
130     "source": [
131      "# Show the best-performing prompt template found by the optimizer\n",
132      "print(optimizer.best_prompt())"
133     ],
134     "outputs": [],
135     "execution_count": null
136    },
137    {
138     "metadata": {},
139     "cell_type": "code",
140     "source": [
141      "# inspect optimizer statistics & logs\n",
142      "optimizer.print_stats()"
143     ],
144     "id": "9abd1c4ea5585177",
145     "outputs": [],
146     "execution_count": null
147    },
148    {
149     "metadata": {},
150     "cell_type": "markdown",
151     "source": [
152      "### LLM Judge from scratch\n",
153      "\n",
154      "You can also create LLM judges from scratch if you have labeled dataset."
155     ],
156     "id": "19d88462d4f97af2"
157    },
158    {
159     "metadata": {},
160     "cell_type": "code",
161     "source": [
162      "from evidently.llm.optimization import BlankLLMJudge\n",
163      "\n",
164      "optimizer = PromptOptimizer(\"code_review_scratch_example\", strategy=\"feedback\", verbose=True)\n",
165      "await optimizer.arun(executor=BlankLLMJudge(), scorer=\"accuracy\", dataset=dataset, repetitions=5)"
166     ],
167     "id": "d78caae22b9749cf",
168     "outputs": [],
169     "execution_count": null
170    },
171    {
172     "metadata": {},
173     "cell_type": "code",
174     "source": [
175      "new_judge = optimizer.best_executor()\n",
176      "new_judge"
177     ],
178     "id": "8b168c087f71ddc",
179     "outputs": [],
180     "execution_count": null
181    },
182    {
183     "metadata": {},
184     "cell_type": "code",
185     "source": "optimizer.print_stats()",
186     "id": "57f46604626cd00d",
187     "outputs": [],
188     "execution_count": null
189    },
190    {
191     "metadata": {},
192     "cell_type": "markdown",
193     "source": "You can save and load optimized template for later use",
194     "id": "d0eb0d104d0174d9"
195    },
196    {
197     "metadata": {},
198     "cell_type": "code",
199     "source": [
200      "from evidently.descriptors import LLMJudge\n",
201      "from evidently.llm.templates import BaseLLMPromptTemplate\n",
202      "\n",
203      "new_judge.template.dump(\"my_template.yaml\")\n",
204      "template = BaseLLMPromptTemplate.load(\"my_template.yaml\")\n",
205      "my_judge = LLMJudge(provider=\"openai\", model=\"gpt-4o-mini\", template=template, input_column=\"Generated review\")"
206     ],
207     "id": "51dde31029349848",
208     "outputs": [],
209     "execution_count": null
210    },
211    {
212     "metadata": {},
213     "cell_type": "code",
214     "source": "",
215     "id": "74dc06582a422643",
216     "outputs": [],
217     "execution_count": null
218    }
219   ],
220   "metadata": {
221    "kernelspec": {
222     "display_name": "Python 3 (ipykernel)",
223     "language": "python",
224     "name": "python3"
225    },
226    "language_info": {
227     "codemirror_mode": {
228      "name": "ipython",
229      "version": 3
230     },
231     "file_extension": ".py",
232     "mimetype": "text/x-python",
233     "name": "python",
234     "nbconvert_exporter": "python",
235     "pygments_lexer": "ipython3",
236     "version": "3.11.11"
237    }
238   },
239   "nbformat": 4,
240   "nbformat_minor": 5
241  }