evaluations.md
1 --- 2 title: Evaluating Agents 3 sidebar_position: 450 4 --- 5 6 # Evaluating Agents 7 8 The framework includes an evaluation system that helps you test your agents' behavior in a structured way. You can define test suites, run them against your agents, and generate detailed reports to analyze the results. When running evaluations locally, you can also benchmark different language models to see how they affect your agents' responses. 9 10 At a high level, the evaluation process involves three main components: 11 12 * **Test Case**: A `JSON` file that defines a single, specific task for an agent to perform. It includes the initial prompt, any required files (artifacts), and the criteria for a successful outcome. 13 * **Test Suite**: A `JSON` file that groups one or more test cases into a single evaluation run. It also defines the environment for the evaluation, such as whether to run the agents locally or connect to a remote Agent Mesh. 14 * **Evaluation Settings**: A configuration block within the test suite that specifies how to score the agent's performance. You can choose from several methods, from simple metric-based comparisons to more advanced evaluations using a language model. 15 16 This document guides you through creating test cases, assembling them into test suites, and running evaluations to test your agents. 17 18 ## Creating a Test Case 19 20 A test case is a `JSON` file that defines a specific task for an agent to perform. It serves as the fundamental building block of an evaluation. You create a test case to represent a single interaction you want to test, such as asking a question, providing a file for processing, or requesting a specific action from an agent. 21 22 ### Test Case Configuration 23 24 The following fields are available in the test case `JSON` file. 25 26 * `test_case_id` (Required): A unique identifier for the test case. 27 * `query` (Required): The initial prompt to be sent to the agent. 28 * `target_agent` (Required): The name of the agent to which the query should be sent. 29 * `category` (Optional): The category of the test case. Defaults to `Other`. 30 * `description` (Optional): A description of the test case. 31 * `artifacts` (Optional): A list of artifacts to be sent with the initial query. Each artifact has a `type` and a `path`. 32 * `wait_time` (Optional): The maximum time in seconds to wait for a response from the agent. Defaults to `60`. 33 * `evaluation` (Optional): The evaluation criteria for the test case. 34 * `expected_tools` (Optional): A list of tools that the agent is expected to use. Defaults to an empty list. 35 * `expected_response` (Optional): The expected final response from the agent. Defaults to an empty string. 36 * `criterion` (Optional): The criterion to be used by the `llm_evaluator`. Defaults to an empty string. 37 38 ### Test Case Examples 39 40 Here is an example of a simple test case. It sends a greeting to an agent and checks for a standard response. 41 42 ```json 43 { 44 "test_case_id": "hello_world", 45 "category": "Content Generation", 46 "description": "A simple test case to check the basic functionality of the system.", 47 "query": "Hello, world!", 48 "target_agent": "OrchestratorAgent", 49 "wait_time": 30, 50 "evaluation": { 51 "expected_tools": [], 52 "expected_response": "Hello! How can I help you today?", 53 "criterion": "Evaluate if the agent provides a standard greeting." 54 } 55 } 56 ``` 57 58 This next example shows a more complex test case. It includes a `CSV` file as an artifact and asks the agent to filter the data in the file. 59 60 ```json 61 { 62 "test_case_id": "filter_csv_employees_by_age_and_country", 63 "category": "Tool Usage", 64 "description": "A test case to filter employees from a CSV file based on age and country.", 65 "target_agent": "OrchestratorAgent", 66 "query": "From the attached CSV, please list the names of all people who are older or equal to 30 and live in the USA.", 67 "artifacts": [ 68 { 69 "type": "file", 70 "path": "artifacts/sample.csv" 71 } 72 ], 73 "wait_time": 120, 74 "evaluation": { 75 "expected_tools": ["extract_content_from_artifact"], 76 "expected_response": "The person who is 30 or older and lives in the USA is John Doe.", 77 "criterion": "Evaluate if the agent correctly filters the CSV data." 78 } 79 } 80 ``` 81 82 ## Creating a Test Suite 83 84 The test suite is a `JSON` file that defines the parameters of an evaluation run. You use it to group test cases and configure the environment in which they run. 85 86 A common convention in the test suite configuration is to use keys ending with `_VAR`. These keys indicate that the corresponding value is the name of an environment variable from which the framework should read the actual value. This practice helps you keep sensitive information—like API keys and credentials—out of your configuration files. This convention applies to the `broker` object, the `env` object within `llm_models`, and the `env` object within the `llm_evaluator` in `evaluation_settings`. 87 88 You can run evaluations in two modes: local and remote. Both modes require a connection to a Solace event broker to function. 89 90 ### Local Evaluation 91 92 In a local evaluation, the evaluation framework brings up a local instance of Solace Agent Mesh (SAM) and runs the agents on your local machine. This mode is useful for development and testing because it allows you to iterate quickly on your agents and test cases. You can also use this mode to benchmark different language models against your agents to see how they perform. 93 94 To run a local evaluation, you need to install the `sam-rest-gateway` plugin. This plugin allows the evaluation framework to communicate with the local SAM instance. You can install it with the following command: 95 96 ```bash 97 pip install sam-rest-gateway 98 ``` 99 100 #### Local Test Suite Configuration 101 102 For a local evaluation, you must define the `agents`, `broker`, `llm_models`, and `test_cases` fields. 103 104 The `agents` field is a required list of paths to the agent configuration files. You must specify at least one agent. 105 ```json 106 "agents": [ "examples/agents/a2a_agents_example.yaml" ] 107 ``` 108 109 The `broker` field is a required object containing the connection details for the Solace event broker. 110 ```json 111 "broker": { 112 "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL", 113 "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME", 114 "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD", 115 "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN" 116 } 117 ``` 118 119 The `llm_models` field is a required list of language models to use. You must specify at least one model. The `env` object contains environment variables required by the model, such as the model name, endpoint, and API key. 120 ```json 121 "llm_models": [ 122 { 123 "name": "gpt-4-1", 124 "env": { 125 "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/azure-gpt-4-1", 126 "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT", 127 "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY" 128 } 129 }, 130 { 131 "name": "gemini-1.5-pro", 132 "env": { 133 "LLM_SERVICE_PLANNING_MODEL_NAME": "google/gemini-1.5-pro-latest", 134 "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT_GOOGLE", 135 "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY_GOOGLE" 136 } 137 } 138 ] 139 ``` 140 141 The `test_cases` field is a required list of paths to the test case `JSON` files. You must specify at least one test case. 142 ```json 143 "test_cases": [ "tests/evaluation/test_cases/hello_world.test.json" ] 144 ``` 145 146 You can also provide optional settings for `results_dir_name`, `runs`, `workers`, and `evaluation_settings`. 147 148 The `results_dir_name` field is an optional string that specifies the name of the directory for evaluation results. It defaults to `tests`. 149 ```json 150 "results_dir_name": "my-local-test-results" 151 ``` 152 153 The `runs` field is an optional integer that specifies the number of times to run each test case. It defaults to `1`. 154 ```json 155 "runs": 3 156 ``` 157 158 The `workers` field is an optional integer that specifies the number of parallel workers for running tests. It defaults to `4`. 159 ```json 160 "workers": 8 161 ``` 162 163 The `evaluation_settings` field is an optional object that allows you to configure the evaluation. This object can contain `tool_match`, `response_match`, and `llm_evaluator` settings. 164 ```json 165 "evaluation_settings": { 166 "tool_match": { 167 "enabled": true 168 }, 169 "response_match": { 170 "enabled": true 171 }, 172 "llm_evaluator": { 173 "enabled": true, 174 "env": { 175 "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro", 176 "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT", 177 "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY" 178 } 179 } 180 } 181 ``` 182 183 #### Example Local Test Suite 184 185 ```json 186 { 187 "agents": [ 188 "examples/agents/a2a_agents_example.yaml", 189 "examples/agents/multimodal_example.yaml", 190 "examples/agents/orchestrator_example.yaml" 191 ], 192 "broker": { 193 "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL", 194 "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME", 195 "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD", 196 "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN" 197 }, 198 "llm_models": [ 199 { 200 "name": "gpt-4-1", 201 "env": { 202 "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/azure-gpt-4-1", 203 "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT", 204 "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY" 205 } 206 } 207 ], 208 "results_dir_name": "sam-local-eval-test", 209 "runs": 3, 210 "workers": 4, 211 "test_cases": [ 212 "tests/evaluation/test_cases/filter_csv_employees_by_age_and_country.test.json", 213 "tests/evaluation/test_cases/hello_world.test.json" 214 ], 215 "evaluation_settings": { 216 "tool_match": { 217 "enabled": true 218 }, 219 "response_match": { 220 "enabled": true 221 }, 222 "llm_evaluator": { 223 "enabled": true, 224 "env": { 225 "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro", 226 "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT", 227 "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY" 228 } 229 } 230 } 231 } 232 ``` 233 234 ### Remote Evaluation 235 236 In a remote evaluation, the evaluation framework sends requests to a remote Agent Mesh instance. This mode is useful for testing agents in a production-like environment where the agents are running on a separate server. The remote environment must have a REST gateway running to accept requests from the evaluation framework. You can also use an authentication token to communicate securely with the remote SAM instance. 237 238 #### Remote Test Suite Configuration 239 240 For a remote evaluation, you must define the `broker`, `remote`, and `test_cases` fields. 241 242 The `broker` field is a required object with connection details for the Solace event broker. 243 ```json 244 "broker": { 245 "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL", 246 "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME", 247 "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD", 248 "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN" 249 } 250 ``` 251 252 The `remote` field is a required object containing the connection details for the remote Agent Mesh instance. 253 ```json 254 "remote": { 255 "EVAL_REMOTE_URL_VAR": "EVAL_REMOTE_URL", 256 "EVAL_AUTH_TOKEN_VAR": "EVAL_AUTH_TOKEN", 257 "EVAL_NAMESPACE_VAR": "EVAL_NAMESPACE" 258 } 259 ``` 260 261 The `test_cases` field is a required list of paths to the test case `JSON` files. You must specify at least one test case. 262 ```json 263 "test_cases": [ "tests/evaluation/test_cases/hello_world.test.json" ] 264 ``` 265 266 You can also provide optional settings for `results_dir_name`, `runs`, and `evaluation_settings`. 267 268 The `results_dir_name` field is an optional string that specifies the name of the directory for evaluation results. It defaults to `tests`. 269 ```json 270 "results_dir_name": "my-remote-test-results" 271 ``` 272 273 The `runs` field is an optional integer that specifies the number of times to run each test case. It defaults to `1`. 274 ```json 275 "runs": 5 276 ``` 277 278 The `evaluation_settings` field is an optional object that allows you to configure the evaluation. 279 ```json 280 "evaluation_settings": { 281 "tool_match": { 282 "enabled": true 283 }, 284 "response_match": { 285 "enabled": true 286 } 287 } 288 ``` 289 290 #### Example Remote Test Suite 291 292 ```json 293 { 294 "broker": { 295 "SOLACE_BROKER_URL_VAR": "SOLACE_BROKER_URL", 296 "SOLACE_BROKER_USERNAME_VAR": "SOLACE_BROKER_USERNAME", 297 "SOLACE_BROKER_PASSWORD_VAR": "SOLACE_BROKER_PASSWORD", 298 "SOLACE_BROKER_VPN_VAR": "SOLACE_BROKER_VPN" 299 }, 300 "remote": { 301 "EVAL_REMOTE_URL_VAR": "EVAL_REMOTE_URL", 302 "EVAL_AUTH_TOKEN_VAR": "EVAL_AUTH_TOKEN", 303 "EVAL_NAMESPACE_VAR": "EVAL_NAMESPACE" 304 }, 305 "results_dir_name": "sam-remote-eval-test", 306 "runs": 1, 307 "test_cases": [ 308 "tests/evaluation/test_cases/filter_csv_employees_by_age_and_country.test.json", 309 "tests/evaluation/test_cases/hello_world.test.json" 310 ], 311 "evaluation_settings": { 312 "tool_match": { 313 "enabled": true 314 }, 315 "response_match": { 316 "enabled": true 317 }, 318 "llm_evaluator": { 319 "enabled": true, 320 "env": { 321 "LLM_SERVICE_PLANNING_MODEL_NAME": "openai/gemini-2.5-pro", 322 "LLM_SERVICE_ENDPOINT_VAR": "LLM_SERVICE_ENDPOINT", 323 "LLM_SERVICE_API_KEY_VAR": "LLM_SERVICE_API_KEY" 324 } 325 } 326 } 327 } 328 ``` 329 330 ## Evaluation Settings 331 332 The `evaluation_settings` block in the test suite `JSON` file allows you to configure how the evaluation is performed. Each enabled setting provides a score from 0 to 1, which contributes to the overall score for the test case. 333 334 ### `tool_match` 335 336 The `tool_match` setting compares the tools the agent used with the `expected_tools` defined in the test case. This is a simple, direct comparison and does not use a language model for the evaluation. It is most effective when the agent's expected behavior is straightforward and there is a clear, correct sequence of tools to be used. In more complex scenarios where multiple paths could lead to a successful outcome, this method may not be the best way to evaluate the agent's performance. 337 338 ### `response_match` 339 340 The `response_match` setting compares the agent's final response with the `expected_response` from the test case. This comparison is based on the ROUGE metric, which evaluates the similarity between two responses by comparing their sequence of words. This method does not use a language model for the evaluation and does not work well with synonyms, so it is most effective when the expected answer is consistent. For more information about the ROUGE metric, see the [official documentation](https://pypi.org/project/rouge-metric/). 341 342 ### `llm_evaluator` 343 344 The `llm_evaluator` setting uses a language model to evaluate the entire lifecycle of a request within the agent mesh. This includes the initial prompt, all tool calls, delegation between agents, artifact inputs and outputs, and the final message output. The evaluation is based on a `criterion` you provide in the test case, which defines what a successful outcome looks like. This is the most comprehensive evaluation method because it considers the full context of the request's execution. 345 346 ## Running Evaluations 347 348 After you create your test cases and test suite, you can run the evaluation from the command line using the `sam eval` command. 349 350 ### Command 351 352 ```bash 353 sam eval <PATH> [OPTIONS] 354 ``` 355 356 The command takes the path to the evaluation test suite `JSON` file as a required argument. 357 358 ### Options 359 360 * `-v`, `--verbose`: Enable verbose output to see detailed logs during the evaluation run. 361 * `-h`, `--help`: Show a help message with information about the command and its options. 362 363 ### Example 364 365 ```bash 366 sam eval tests/evaluation/local_example.json --verbose 367 ``` 368 369 ## Interpreting the Results 370 371 After an evaluation run is complete, the framework stores the results in a directory. The path to this directory is `results/` followed by the `results_dir_name` you specified in the test suite. 372 373 ### Results Directory 374 375 The results directory has the following structure: 376 377 ``` 378 <results_dir_name>/ 379 ├── report.html 380 ├── stats.json 381 └── <model_name>/ 382 ├── full_messages.json 383 ├── results.json 384 └── <test_case_id>/ 385 └── run_1/ 386 ├── messages.json 387 ├── summary.json 388 └── test_case_info.json 389 └── run_2/ 390 ├── ... 391 ``` 392 393 * **`report.html`**: An `HTML` report that provides a comprehensive overview of the evaluation results. It includes a summary of the test runs, a breakdown of the results for each test case, and detailed logs for each test run. This report is the primary tool for analyzing the results of an evaluation. 394 * **`stats.json`**: A `JSON` file containing detailed statistics about the evaluation run, including scores for each evaluation metric. 395 * **`<model_name>/`**: A directory for each language model tested (or a single `remote` directory for remote evaluations). 396 * **`full_messages.json`**: A log of all messages exchanged during the evaluation for that model. 397 * **`results.json`**: The raw evaluation results for each test case. 398 * **`<test_case_id>/`**: A directory for each test case, containing a `run_n` subdirectory for each run of the test case. These directories contain detailed logs and artifacts for each run. 399 400 ### HTML Report 401 402 The `report.html` file provides a comprehensive overview of the evaluation results. It includes a summary of the test runs, a breakdown of the results for each test case, and detailed logs for each test run. This report is the primary tool for analyzing the results of an evaluation. You can open this file in a web browser to view the report.