Cradicle Explorer

/ docs / docs / documentation / developing / evaluations.md
evaluations.md
  1  ---
  2  title: Evaluating Agents
  3  sidebar_position: 450
  4  ---
  5  
  6  # Evaluating Agents
  7  
  8  The framework includes an evaluation system that helps you test your agents' behavior in a structured way. You can define test suites, run them against your agents, and generate detailed reports to analyze the results. When running evaluations locally, you can also benchmark different language models to see how they affect your agents' responses.
  9  
 10  At a high level, the evaluation process involves three main components:
 11  
 12  *   **Test Case**: A `JSON` file that defines a single, specific task for an agent to perform. It includes the initial prompt, any required files (artifacts), and the criteria for a successful outcome.
 13  *   **Test Suite**: A `JSON` file that groups one or more test cases into a single evaluation run. It also defines the environment for the evaluation, such as whether to run the agents locally or connect to a remote Agent Mesh.
 14  *   **Evaluation Settings**: A configuration block within the test suite that specifies how to score the agent's performance. You can choose from several methods, from simple metric-based comparisons to more advanced evaluations using a language model.
 15  
 16  This document guides you through creating test cases, assembling them into test suites, and running evaluations to test your agents.
 17  
 18  ## Creating a Test Case
 19  
 20  A test case is a `JSON` file that defines a specific task for an agent to perform. It serves as the fundamental building block of an evaluation. You create a test case to represent a single interaction you want to test, such as asking a question, providing a file for processing, or requesting a specific action from an agent.
 21  
 22  ### Test Case Configuration
 23  
 24  The following fields are available in the test case `JSON` file.
 25  
 26  *   `test_case_id` (Required): A unique identifier for the test case.
 27  *   `query` (Required): The initial prompt to be sent to the agent.
 28  *   `target_agent` (Required): The name of the agent to which the query should be sent.
 29  *   `category` (Optional): The category of the test case. Defaults to `Other`.
 30  *   `description` (Optional): A description of the test case.
 31  *   `artifacts` (Optional): A list of artifacts to be sent with the initial query. Each artifact has a `type` and a `path`.
 32  *   `wait_time` (Optional): The maximum time in seconds to wait for a response from the agent. Defaults to `60`.
 33  *   `evaluation` (Optional): The evaluation criteria for the test case.
 34      *   `expected_tools` (Optional): A list of tools that the agent is expected to use. Defaults to an empty list.
 35      *   `expected_response` (Optional): The expected final response from the agent. Defaults to an empty string.
 36      *   `criterion` (Optional): The criterion to be used by the `llm_evaluator`. Defaults to an empty string.
 37  
 38  ### Test Case Examples
 39  
 40  Here is an example of a simple test case. It sends a greeting to an agent and checks for a standard response.
 41  
 42  ```json
 43  {
 44      "test_case_id": "hello_world",
 45      "category": "Content Generation",
 46      "description": "A simple test case to check the basic functionality of the system.",
 47      "query": "Hello, world!",
 48      "target_agent": "OrchestratorAgent",
 49      "wait_time": 30,
 50      "evaluation": {
 51          "expected_tools": [],
 52          "expected_response": "Hello! How can I help you today?",
 53          "criterion": "Evaluate if the agent provides a standard greeting."
 54      }
 55  }
 56  ```
 57  
 58  This next example shows a more complex test case. It includes a `CSV` file as an artifact and asks the agent to filter the data in the file.
 59  
 60  ```json
 61  {
 62    "test_case_id": "filter_csv_employees_by_age_and_country",
 63    "category": "Tool Usage",
 64    "description": "A test case to filter employees from a CSV file based on age and country.",
 65    "target_agent": "OrchestratorAgent",
 66    "query": "From the attached CSV, please list the names of all people who are older or equal to 30 and live in the USA.",
 67    "artifacts": [
 68      {
 69        "type": "file",
 70        "path": "artifacts/sample.csv"
 71      }
 72    ],
 73    "wait_time": 120,
 74    "evaluation": {
 75      "expected_tools": ["extract_content_from_artifact"],
 76      "expected_response": "The person who is 30 or older and lives in the USA is John Doe.",
 77      "criterion": "Evaluate if the agent correctly filters the CSV data."
 78    }
 79  }
 80  ```
 81  
 82  ## Creating a Test Suite
 83  
 84  The test suite is a `JSON` file that defines the parameters of an evaluation run. You use it to group test cases and configure the environment in which they run.
 85  
 86  A common convention in the test suite configuration is to use keys ending with `_VAR`. These keys indicate that the corresponding value is the name of an environment variable from which the framework should read the actual value. This practice helps you keep sensitive information—like API keys and credentials—out of your configuration files. This convention applies to the `broker` object, the `env` object within `llm_models`, and the `env` object within the `llm_evaluator` in `evaluation_settings`.
 87  
 88  You can run evaluations in two modes: local and remote. Both modes require a connection to a Solace event broker to function.
 89  
 90  ### Local Evaluation
 91  
 92  In a local evaluation, the evaluation framework brings up a local instance of Solace Agent Mesh (SAM) and runs the agents on your local machine. This mode is useful for development and testing because it allows you to iterate quickly on your agents and test cases. You can also use this mode to benchmark different language models against your agents to see how they perform.
 93  
 94  To run a local evaluation, you need to install the `sam-rest-gateway` plugin. This plugin allows the evaluation framework to communicate with the local SAM instance. You can install it with the following command:
 95  
 96  ```bash
 97  pip install sam-rest-gateway
 98  ```
 99  
100  #### Local Test Suite Configuration
101  
102  For a local evaluation, you must define the `agents`, `broker`, `llm_models`, and `test_cases` fields.
103  
104  The `agents` field is a required list of paths to the agent configuration files. You must specify at least one agent.
105  ```json
106  "agents": [ "examples/agents/a2a_agents_example.yaml" ]
107  ```
108  
109  The `broker` field is a required object containing the connection details for the Solace event broker.
110  ```json
111  "broker": {
112      "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL",
113      "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME",
114      "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD",
115      "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN"
116  }
117  ```
118  
119  The `llm_models` field is a required list of language models to use. You must specify at least one model. The `env` object contains environment variables required by the model, such as the model name, endpoint, and API key.
120  ```json
121  "llm_models": [
122      {
123          "name": "gpt-4-1",
124          "env": {
125              "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/azure-gpt-4-1",
126              "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
127              "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
128          }
129      },
130      {
131          "name": "gemini-1.5-pro",
132          "env": {
133              "LLM_SERVICE_PLANNING_MODEL_NAME": "google/gemini-1.5-pro-latest",
134              "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT_GOOGLE",
135              "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY_GOOGLE"
136          }
137      }
138  ]
139  ```
140  
141  The `test_cases` field is a required list of paths to the test case `JSON` files. You must specify at least one test case.
142  ```json
143  "test_cases": [ "tests/evaluation/test_cases/hello_world.test.json" ]
144  ```
145  
146  You can also provide optional settings for `results_dir_name`, `runs`, `workers`, and `evaluation_settings`.
147  
148  The `results_dir_name` field is an optional string that specifies the name of the directory for evaluation results. It defaults to `tests`.
149  ```json
150  "results_dir_name": "my-local-test-results"
151  ```
152  
153  The `runs` field is an optional integer that specifies the number of times to run each test case. It defaults to `1`.
154  ```json
155  "runs": 3
156  ```
157  
158  The `workers` field is an optional integer that specifies the number of parallel workers for running tests. It defaults to `4`.
159  ```json
160  "workers": 8
161  ```
162  
163  The `evaluation_settings` field is an optional object that allows you to configure the evaluation. This object can contain `tool_match`, `response_match`, and `llm_evaluator` settings.
164  ```json
165  "evaluation_settings": {
166      "tool_match": {
167          "enabled": true
168      },
169      "response_match": {
170          "enabled": true
171      },
172      "llm_evaluator": {
173          "enabled": true,
174          "env": {
175              "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro",
176              "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
177              "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
178          }
179      }
180  }
181  ```
182  
183  #### Example Local Test Suite
184  
185  ```json
186  {
187      "agents": [
188          "examples/agents/a2a_agents_example.yaml",
189          "examples/agents/multimodal_example.yaml",
190          "examples/agents/orchestrator_example.yaml"
191      ],
192      "broker": {
193          "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL",
194          "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME",
195          "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD",
196          "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN"
197      },
198      "llm_models": [
199          {
200              "name": "gpt-4-1",
201              "env": {
202                  "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/azure-gpt-4-1",
203                  "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
204                  "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
205              }
206          }
207      ],
208      "results_dir_name": "sam-local-eval-test",
209      "runs": 3,
210      "workers": 4,
211      "test_cases": [
212          "tests/evaluation/test_cases/filter_csv_employees_by_age_and_country.test.json",
213          "tests/evaluation/test_cases/hello_world.test.json"
214      ],
215      "evaluation_settings": {
216          "tool_match": {
217              "enabled": true
218          },
219          "response_match": {
220              "enabled": true
221          },
222          "llm_evaluator": {
223              "enabled": true,
224              "env": {
225                  "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro",
226                  "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
227                  "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
228              }
229          }
230      }
231  }
232  ```
233  
234  ### Remote Evaluation
235  
236  In a remote evaluation, the evaluation framework sends requests to a remote Agent Mesh instance. This mode is useful for testing agents in a production-like environment where the agents are running on a separate server. The remote environment must have a REST gateway running to accept requests from the evaluation framework. You can also use an authentication token to communicate securely with the remote SAM instance.
237  
238  #### Remote Test Suite Configuration
239  
240  For a remote evaluation, you must define the `broker`, `remote`, and `test_cases` fields.
241  
242  The `broker` field is a required object with connection details for the Solace event broker.
243  ```json
244  "broker": {
245      "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL",
246      "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME",
247      "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD",
248      "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN"
249  }
250  ```
251  
252  The `remote` field is a required object containing the connection details for the remote Agent Mesh instance.
253  ```json
254  "remote": {
255      "EVAL_REMOTE_URL_VAR": "EVAL_REMOTE_URL",
256      "EVAL_AUTH_TOKEN_VAR": "EVAL_AUTH_TOKEN",
257      "EVAL_NAMESPACE_VAR": "EVAL_NAMESPACE"
258  }
259  ```
260  
261  The `test_cases` field is a required list of paths to the test case `JSON` files. You must specify at least one test case.
262  ```json
263  "test_cases": [ "tests/evaluation/test_cases/hello_world.test.json" ]
264  ```
265  
266  You can also provide optional settings for `results_dir_name`, `runs`, and `evaluation_settings`.
267  
268  The `results_dir_name` field is an optional string that specifies the name of the directory for evaluation results. It defaults to `tests`.
269  ```json
270  "results_dir_name": "my-remote-test-results"
271  ```
272  
273  The `runs` field is an optional integer that specifies the number of times to run each test case. It defaults to `1`.
274  ```json
275  "runs": 5
276  ```
277  
278  The `evaluation_settings` field is an optional object that allows you to configure the evaluation.
279  ```json
280  "evaluation_settings": {
281      "tool_match": {
282          "enabled": true
283      },
284      "response_match": {
285          "enabled": true
286      }
287  }
288  ```
289  
290  #### Example Remote Test Suite
291  
292  ```json
293  {
294      "broker": {
295          "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL",
296          "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME",
297          "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD",
298          "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN"
299      },
300      "remote": {
301          "EVAL_REMOTE_URL_VAR": "EVAL_REMOTE_URL",
302          "EVAL_AUTH_TOKEN_VAR": "EVAL_AUTH_TOKEN",
303          "EVAL_NAMESPACE_VAR": "EVAL_NAMESPACE"
304      },
305      "results_dir_name": "sam-remote-eval-test",
306      "runs": 1,
307      "test_cases": [
308          "tests/evaluation/test_cases/filter_csv_employees_by_age_and_country.test.json",
309          "tests/evaluation/test_cases/hello_world.test.json"
310      ],
311      "evaluation_settings": {
312          "tool_match": {
313              "enabled": true
314          },
315          "response_match": {
316              "enabled": true
317          },
318          "llm_evaluator": {
319              "enabled": true,
320              "env": {
321                  "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro",
322                  "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT",
323                  "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY"
324              }
325          }
326      }
327  }
328  ```
329  
330  ## Evaluation Settings
331  
332  The `evaluation_settings` block in the test suite `JSON` file allows you to configure how the evaluation is performed. Each enabled setting provides a score from 0 to 1, which contributes to the overall score for the test case.
333  
334  ### `tool_match`
335  
336  The `tool_match` setting compares the tools the agent used with the `expected_tools` defined in the test case. This is a simple, direct comparison and does not use a language model for the evaluation. It is most effective when the agent's expected behavior is straightforward and there is a clear, correct sequence of tools to be used. In more complex scenarios where multiple paths could lead to a successful outcome, this method may not be the best way to evaluate the agent's performance.
337  
338  ### `response_match`
339  
340  The `response_match` setting compares the agent's final response with the `expected_response` from the test case. This comparison is based on the ROUGE metric, which evaluates the similarity between two responses by comparing their sequence of words. This method does not use a language model for the evaluation and does not work well with synonyms, so it is most effective when the expected answer is consistent. For more information about the ROUGE metric, see the [official documentation](https://pypi.org/project/rouge-metric/).
341  
342  ### `llm_evaluator`
343  
344  The `llm_evaluator` setting uses a language model to evaluate the entire lifecycle of a request within the agent mesh. This includes the initial prompt, all tool calls, delegation between agents, artifact inputs and outputs, and the final message output. The evaluation is based on a `criterion` you provide in the test case, which defines what a successful outcome looks like. This is the most comprehensive evaluation method because it considers the full context of the request's execution.
345  
346  ## Running Evaluations
347  
348  After you create your test cases and test suite, you can run the evaluation from the command line using the `sam eval` command.
349  
350  ### Command
351  
352  ```bash
353  sam eval <PATH> [OPTIONS]
354  ```
355  
356  The command takes the path to the evaluation test suite `JSON` file as a required argument.
357  
358  ### Options
359  
360  *   `-v`, `--verbose`: Enable verbose output to see detailed logs during the evaluation run.
361  *   `-h`, `--help`: Show a help message with information about the command and its options.
362  
363  ### Example
364  
365  ```bash
366  sam eval tests/evaluation/local_example.json --verbose
367  ```
368  
369  ## Interpreting the Results
370  
371  After an evaluation run is complete, the framework stores the results in a directory. The path to this directory is `results/` followed by the `results_dir_name` you specified in the test suite.
372  
373  ### Results Directory
374  
375  The results directory has the following structure:
376  
377  ```
378  <results_dir_name>/
379  ├── report.html
380  ├── stats.json
381  └── <model_name>/
382      ├── full_messages.json
383      ├── results.json
384      └── <test_case_id>/
385          └── run_1/
386              ├── messages.json
387              ├── summary.json
388              └── test_case_info.json
389          └── run_2/
390              ├── ...
391  ```
392  
393  *   **`report.html`**: An `HTML` report that provides a comprehensive overview of the evaluation results. It includes a summary of the test runs, a breakdown of the results for each test case, and detailed logs for each test run. This report is the primary tool for analyzing the results of an evaluation.
394  *   **`stats.json`**: A `JSON` file containing detailed statistics about the evaluation run, including scores for each evaluation metric.
395  *   **`<model_name>/`**: A directory for each language model tested (or a single `remote` directory for remote evaluations).
396      *   **`full_messages.json`**: A log of all messages exchanged during the evaluation for that model.
397      *   **`results.json`**: The raw evaluation results for each test case.
398      *   **`<test_case_id>/`**: A directory for each test case, containing a `run_n` subdirectory for each run of the test case. These directories contain detailed logs and artifacts for each run.
399  
400  ### HTML Report
401  
402  The `report.html` file provides a comprehensive overview of the evaluation results. It includes a summary of the test runs, a breakdown of the results for each test case, and detailed logs for each test run. This report is the primary tool for analyzing the results of an evaluation. You can open this file in a web browser to view the report.