evaluators_api.md
1 --- 2 title: "Evaluators" 3 id: evaluators-api 4 description: "Evaluate your pipelines or individual components." 5 slug: "/evaluators-api" 6 --- 7 8 9 ## answer_exact_match 10 11 ### AnswerExactMatchEvaluator 12 13 An answer exact match evaluator class. 14 15 The evaluator that checks if the predicted answers matches any of the ground truth answers exactly. 16 The result is a number from 0.0 to 1.0, it represents the proportion of predicted answers 17 that matched one of the ground truth answers. 18 There can be multiple ground truth answers and multiple predicted answers as input. 19 20 Usage example: 21 22 ```python 23 from haystack.components.evaluators import AnswerExactMatchEvaluator 24 25 evaluator = AnswerExactMatchEvaluator() 26 result = evaluator.run( 27 ground_truth_answers=["Berlin", "Paris"], 28 predicted_answers=["Berlin", "Lyon"], 29 ) 30 31 print(result["individual_scores"]) 32 # [1, 0] 33 print(result["score"]) 34 # 0.5 35 ``` 36 37 #### run 38 39 ```python 40 run( 41 ground_truth_answers: list[str], predicted_answers: list[str] 42 ) -> dict[str, Any] 43 ``` 44 45 Run the AnswerExactMatchEvaluator on the given inputs. 46 47 The `ground_truth_answers` and `retrieved_answers` must have the same length. 48 49 **Parameters:** 50 51 - **ground_truth_answers** (<code>list\[str\]</code>) – A list of expected answers. 52 - **predicted_answers** (<code>list\[str\]</code>) – A list of predicted answers. 53 54 **Returns:** 55 56 - <code>dict\[str, Any\]</code> – A dictionary with the following outputs: 57 - `individual_scores` - A list of 0s and 1s, where 1 means that the predicted answer matched one of the 58 ground truth. 59 - `score` - A number from 0.0 to 1.0 that represents the proportion of questions where any predicted 60 answer matched one of the ground truth answers. 61 62 ## context_relevance 63 64 ### ContextRelevanceEvaluator 65 66 Bases: <code>LLMEvaluator</code> 67 68 Evaluator that checks if a provided context is relevant to the question. 69 70 An LLM breaks up a context into multiple statements and checks whether each statement 71 is relevant for answering a question. 72 The score for each context is either binary score of 1 or 0, where 1 indicates that the context is relevant 73 to the question and 0 indicates that the context is not relevant. 74 The evaluator also provides the relevant statements from the context and an average score over all the provided 75 input questions contexts pairs. 76 77 Usage example: 78 79 ```python 80 from haystack.components.evaluators import ContextRelevanceEvaluator 81 82 questions = ["Who created the Python language?", "Why does Java needs a JVM?", "Is C++ better than Python?"] 83 contexts = [ 84 [( 85 "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming " 86 "language. Its design philosophy emphasizes code readability, and its language constructs aim to help " 87 "programmers write clear, logical code for both small and large-scale software projects." 88 )], 89 [( 90 "Java is a high-level, class-based, object-oriented programming language that is designed to have as few " 91 "implementation dependencies as possible. The JVM has two primary functions: to allow Java programs to run" 92 "on any device or operating system (known as the 'write once, run anywhere' principle), and to manage and" 93 "optimize program memory." 94 )], 95 [( 96 "C++ is a general-purpose programming language created by Bjarne Stroustrup as an extension of the C " 97 "programming language." 98 )], 99 ] 100 101 evaluator = ContextRelevanceEvaluator() 102 result = evaluator.run(questions=questions, contexts=contexts) 103 print(result["score"]) 104 # 0.67 105 print(result["individual_scores"]) 106 # [1,1,0] 107 print(result["results"]) 108 # [{ 109 # 'relevant_statements': ['Python, created by Guido van Rossum in the late 1980s.'], 110 # 'score': 1.0 111 # }, 112 # { 113 # 'relevant_statements': ['The JVM has two primary functions: to allow Java programs to run on any device or 114 # operating system (known as the "write once, run anywhere" principle), and to manage and 115 # optimize program memory'], 116 # 'score': 1.0 117 # }, 118 # { 119 # 'relevant_statements': [], 120 # 'score': 0.0 121 # }] 122 ``` 123 124 #### __init__ 125 126 ```python 127 __init__( 128 examples: list[dict[str, Any]] | None = None, 129 progress_bar: bool = True, 130 raise_on_failure: bool = True, 131 chat_generator: ChatGenerator | None = None, 132 ) -> None 133 ``` 134 135 Creates an instance of ContextRelevanceEvaluator. 136 137 If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode. 138 139 **Parameters:** 140 141 - **examples** (<code>list\[dict\[str, Any\]\] | None</code>) – Optional few-shot examples conforming to the expected input and output format of ContextRelevanceEvaluator. 142 Default examples will be used if none are provided. 143 Each example must be a dictionary with keys "inputs" and "outputs". 144 "inputs" must be a dictionary with keys "questions" and "contexts". 145 "outputs" must be a dictionary with "relevant_statements". 146 Expected format: 147 148 ```python 149 [{ 150 "inputs": { 151 "questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."], 152 }, 153 "outputs": { 154 "relevant_statements": ["Rome is the capital of Italy."], 155 }, 156 }] 157 ``` 158 159 - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation. 160 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the API call fails. 161 - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM. 162 In order for the component to work, the LLM should be configured to return a JSON object. For example, 163 when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the 164 `generation_kwargs`. 165 166 #### run 167 168 ```python 169 run(**inputs: Any) -> dict[str, Any] 170 ``` 171 172 Run the LLM evaluator. 173 174 **Parameters:** 175 176 - **questions** – A list of questions. 177 - **contexts** – A list of lists of contexts. Each list of contexts corresponds to one question. 178 179 **Returns:** 180 181 - <code>dict\[str, Any\]</code> – A dictionary with the following outputs: 182 - `score`: Mean context relevance score over all the provided input questions. 183 - `results`: A list of dictionaries with `relevant_statements` and `score` for each input context. 184 185 #### to_dict 186 187 ```python 188 to_dict() -> dict[str, Any] 189 ``` 190 191 Serialize this component to a dictionary. 192 193 **Returns:** 194 195 - <code>dict\[str, Any\]</code> – A dictionary with serialized data. 196 197 #### from_dict 198 199 ```python 200 from_dict(data: dict[str, Any]) -> ContextRelevanceEvaluator 201 ``` 202 203 Deserialize this component from a dictionary. 204 205 **Parameters:** 206 207 - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component. 208 209 **Returns:** 210 211 - <code>ContextRelevanceEvaluator</code> – The deserialized component instance. 212 213 ## document_map 214 215 ### DocumentMAPEvaluator 216 217 A Mean Average Precision (MAP) evaluator for documents. 218 219 Evaluator that calculates the mean average precision of the retrieved documents, a metric 220 that measures how high retrieved documents are ranked. 221 Each question can have multiple ground truth documents and multiple retrieved documents. 222 223 `DocumentMAPEvaluator` doesn't normalize its inputs, the `DocumentCleaner` component 224 should be used to clean and normalize the documents before passing them to this evaluator. 225 226 Usage example: 227 228 ```python 229 from haystack import Document 230 from haystack.components.evaluators import DocumentMAPEvaluator 231 232 evaluator = DocumentMAPEvaluator() 233 result = evaluator.run( 234 ground_truth_documents=[ 235 [Document(content="France")], 236 [Document(content="9th century"), Document(content="9th")], 237 ], 238 retrieved_documents=[ 239 [Document(content="France")], 240 [Document(content="9th century"), Document(content="10th century"), Document(content="9th")], 241 ], 242 ) 243 244 print(result["individual_scores"]) 245 # [1.0, 0.8333333333333333] 246 print(result["score"]) 247 # 0.9166666666666666 248 ``` 249 250 #### __init__ 251 252 ```python 253 __init__(document_comparison_field: str = 'content') -> None 254 ``` 255 256 Create a DocumentMAPEvaluator component. 257 258 **Parameters:** 259 260 - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options: 261 - `"content"`: uses `doc.content` 262 - `"id"`: uses `doc.id` 263 - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]` 264 (e.g. `"meta.file_id"`, `"meta.page_number"`) 265 Nested keys are supported (e.g. `"meta.source.url"`). 266 267 #### to_dict 268 269 ```python 270 to_dict() -> dict[str, Any] 271 ``` 272 273 Serializes the component to a dictionary. 274 275 **Returns:** 276 277 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 278 279 #### run 280 281 ```python 282 run( 283 ground_truth_documents: list[list[Document]], 284 retrieved_documents: list[list[Document]], 285 ) -> dict[str, Any] 286 ``` 287 288 Run the DocumentMAPEvaluator on the given inputs. 289 290 All lists must have the same length. 291 292 **Parameters:** 293 294 - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question. 295 - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question. 296 297 **Returns:** 298 299 - <code>dict\[str, Any\]</code> – A dictionary with the following outputs: 300 - `score` - The average of calculated scores. 301 - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high retrieved documents 302 are ranked. 303 304 ## document_mrr 305 306 ### DocumentMRREvaluator 307 308 Evaluator that calculates the mean reciprocal rank of the retrieved documents. 309 310 MRR measures how high the first retrieved document is ranked. 311 Each question can have multiple ground truth documents and multiple retrieved documents. 312 313 `DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component 314 should be used to clean and normalize the documents before passing them to this evaluator. 315 316 Usage example: 317 318 ```python 319 from haystack import Document 320 from haystack.components.evaluators import DocumentMRREvaluator 321 322 evaluator = DocumentMRREvaluator() 323 result = evaluator.run( 324 ground_truth_documents=[ 325 [Document(content="France")], 326 [Document(content="9th century"), Document(content="9th")], 327 ], 328 retrieved_documents=[ 329 [Document(content="France")], 330 [Document(content="9th century"), Document(content="10th century"), Document(content="9th")], 331 ], 332 ) 333 print(result["individual_scores"]) 334 # [1.0, 1.0] 335 print(result["score"]) 336 # 1.0 337 ``` 338 339 #### __init__ 340 341 ```python 342 __init__(document_comparison_field: str = 'content') -> None 343 ``` 344 345 Create a DocumentMRREvaluator component. 346 347 **Parameters:** 348 349 - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options: 350 - `"content"`: uses `doc.content` 351 - `"id"`: uses `doc.id` 352 - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]` 353 (e.g. `"meta.file_id"`, `"meta.page_number"`) 354 Nested keys are supported (e.g. `"meta.source.url"`). 355 356 #### to_dict 357 358 ```python 359 to_dict() -> dict[str, Any] 360 ``` 361 362 Serializes the component to a dictionary. 363 364 **Returns:** 365 366 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 367 368 #### run 369 370 ```python 371 run( 372 ground_truth_documents: list[list[Document]], 373 retrieved_documents: list[list[Document]], 374 ) -> dict[str, Any] 375 ``` 376 377 Run the DocumentMRREvaluator on the given inputs. 378 379 `ground_truth_documents` and `retrieved_documents` must have the same length. 380 381 **Parameters:** 382 383 - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question. 384 - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question. 385 386 **Returns:** 387 388 - <code>dict\[str, Any\]</code> – A dictionary with the following outputs: 389 - `score` - The average of calculated scores. 390 - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved 391 document is ranked. 392 393 ## document_ndcg 394 395 ### DocumentNDCGEvaluator 396 397 Evaluator that calculates the normalized discounted cumulative gain (NDCG) of retrieved documents. 398 399 Each question can have multiple ground truth documents and multiple retrieved documents. 400 If the ground truth documents have relevance scores, the NDCG calculation uses these scores. 401 Otherwise, it assumes binary relevance of all ground truth documents. 402 403 Usage example: 404 405 ```python 406 from haystack import Document 407 from haystack.components.evaluators import DocumentNDCGEvaluator 408 409 evaluator = DocumentNDCGEvaluator() 410 result = evaluator.run( 411 ground_truth_documents=[[Document(content="France", score=1.0), Document(content="Paris", score=0.5)]], 412 retrieved_documents=[[Document(content="France"), Document(content="Germany"), Document(content="Paris")]], 413 ) 414 print(result["individual_scores"]) 415 # [0.8869] 416 print(result["score"]) 417 # 0.8869 418 ``` 419 420 #### run 421 422 ```python 423 run( 424 ground_truth_documents: list[list[Document]], 425 retrieved_documents: list[list[Document]], 426 ) -> dict[str, Any] 427 ``` 428 429 Run the DocumentNDCGEvaluator on the given inputs. 430 431 `ground_truth_documents` and `retrieved_documents` must have the same length. 432 The list items within `ground_truth_documents` and `retrieved_documents` can differ in length. 433 434 **Parameters:** 435 436 - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – Lists of expected documents, one list per question. Binary relevance is used if documents have no scores. 437 - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – Lists of retrieved documents, one list per question. 438 439 **Returns:** 440 441 - <code>dict\[str, Any\]</code> – A dictionary with the following outputs: 442 - `score` - The average of calculated scores. 443 - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents the NDCG for each question. 444 445 #### validate_inputs 446 447 ```python 448 validate_inputs( 449 gt_docs: list[list[Document]], ret_docs: list[list[Document]] 450 ) -> None 451 ``` 452 453 Validate the input parameters. 454 455 **Parameters:** 456 457 - **gt_docs** (<code>list\[list\[Document\]\]</code>) – The ground_truth_documents to validate. 458 - **ret_docs** (<code>list\[list\[Document\]\]</code>) – The retrieved_documents to validate. 459 460 **Raises:** 461 462 - <code>ValueError</code> – If the ground_truth_documents or the retrieved_documents are an empty a list. 463 If the length of ground_truth_documents and retrieved_documents differs. 464 If any list of documents in ground_truth_documents contains a mix of documents with and without a score. 465 466 #### calculate_dcg 467 468 ```python 469 calculate_dcg(gt_docs: list[Document], ret_docs: list[Document]) -> float 470 ``` 471 472 Calculate the discounted cumulative gain (DCG) of the retrieved documents. 473 474 **Parameters:** 475 476 - **gt_docs** (<code>list\[Document\]</code>) – The ground truth documents. 477 - **ret_docs** (<code>list\[Document\]</code>) – The retrieved documents. 478 479 **Returns:** 480 481 - <code>float</code> – The discounted cumulative gain (DCG) of the retrieved 482 documents based on the ground truth documents. 483 484 #### calculate_idcg 485 486 ```python 487 calculate_idcg(gt_docs: list[Document]) -> float 488 ``` 489 490 Calculate the ideal discounted cumulative gain (IDCG) of the ground truth documents. 491 492 **Parameters:** 493 494 - **gt_docs** (<code>list\[Document\]</code>) – The ground truth documents. 495 496 **Returns:** 497 498 - <code>float</code> – The ideal discounted cumulative gain (IDCG) of the ground truth documents. 499 500 ## document_recall 501 502 ### RecallMode 503 504 Bases: <code>Enum</code> 505 506 Enum for the mode to use for calculating the recall score. 507 508 #### from_str 509 510 ```python 511 from_str(string: str) -> RecallMode 512 ``` 513 514 Convert a string to a RecallMode enum. 515 516 ### DocumentRecallEvaluator 517 518 Evaluator that calculates the Recall score for a list of documents. 519 520 Returns both a list of scores for each question and the average. 521 There can be multiple ground truth documents and multiple predicted documents as input. 522 523 Usage example: 524 525 ```python 526 from haystack import Document 527 from haystack.components.evaluators import DocumentRecallEvaluator 528 529 evaluator = DocumentRecallEvaluator() 530 result = evaluator.run( 531 ground_truth_documents=[ 532 [Document(content="France")], 533 [Document(content="9th century"), Document(content="9th")], 534 ], 535 retrieved_documents=[ 536 [Document(content="France")], 537 [Document(content="9th century"), Document(content="10th century"), Document(content="9th")], 538 ], 539 ) 540 print(result["individual_scores"]) 541 # [1.0, 1.0] 542 print(result["score"]) 543 # 1.0 544 ``` 545 546 #### __init__ 547 548 ```python 549 __init__( 550 mode: str | RecallMode = RecallMode.SINGLE_HIT, 551 document_comparison_field: str = "content", 552 ) -> None 553 ``` 554 555 Create a DocumentRecallEvaluator component. 556 557 **Parameters:** 558 559 - **mode** (<code>str | RecallMode</code>) – Mode to use for calculating the recall score. 560 - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options: 561 - `"content"`: uses `doc.content` 562 - `"id"`: uses `doc.id` 563 - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]` 564 (e.g. `"meta.file_id"`, `"meta.page_number"`) 565 Nested keys are supported (e.g. `"meta.source.url"`). 566 567 #### run 568 569 ```python 570 run( 571 ground_truth_documents: list[list[Document]], 572 retrieved_documents: list[list[Document]], 573 ) -> dict[str, Any] 574 ``` 575 576 Run the DocumentRecallEvaluator on the given inputs. 577 578 `ground_truth_documents` and `retrieved_documents` must have the same length. 579 580 **Parameters:** 581 582 - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question. 583 - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question. 584 A dictionary with the following outputs: 585 - `score` - The average of calculated scores. 586 - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents the proportion of matching 587 documents retrieved. If the mode is `single_hit`, the individual scores are 0 or 1. 588 589 #### to_dict 590 591 ```python 592 to_dict() -> dict[str, Any] 593 ``` 594 595 Serializes the component to a dictionary. 596 597 **Returns:** 598 599 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 600 601 ## faithfulness 602 603 ### FaithfulnessEvaluator 604 605 Bases: <code>LLMEvaluator</code> 606 607 Evaluator that checks if a generated answer can be inferred from the provided contexts. 608 609 An LLM separates the answer into multiple statements and checks whether the statement can be inferred from the 610 context or not. The final score for the full answer is a number from 0.0 to 1.0. It represents the proportion of 611 statements that can be inferred from the provided contexts. 612 613 Usage example: 614 615 ```python 616 from haystack.components.evaluators import FaithfulnessEvaluator 617 618 questions = ["Who created the Python language?"] 619 contexts = [ 620 [( 621 "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming " 622 "language. Its design philosophy emphasizes code readability, and its language constructs aim to help " 623 "programmers write clear, logical code for both small and large-scale software projects." 624 )], 625 ] 626 predicted_answers = [ 627 "Python is a high-level general-purpose programming language that was created by George Lucas." 628 ] 629 evaluator = FaithfulnessEvaluator() 630 result = evaluator.run(questions=questions, contexts=contexts, predicted_answers=predicted_answers) 631 632 print(result["individual_scores"]) 633 # [0.5] 634 print(result["score"]) 635 # 0.5 636 print(result["results"]) 637 # [{'statements': ['Python is a high-level general-purpose programming language.', 638 # 'Python was created by George Lucas.'], 'statement_scores': [1, 0], 'score': 0.5}] 639 ``` 640 641 #### __init__ 642 643 ```python 644 __init__( 645 examples: list[dict[str, Any]] | None = None, 646 progress_bar: bool = True, 647 raise_on_failure: bool = True, 648 chat_generator: ChatGenerator | None = None, 649 ) -> None 650 ``` 651 652 Creates an instance of FaithfulnessEvaluator. 653 654 If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode. 655 656 **Parameters:** 657 658 - **examples** (<code>list\[dict\[str, Any\]\] | None</code>) – Optional few-shot examples conforming to the expected input and output format of FaithfulnessEvaluator. 659 Default examples will be used if none are provided. 660 Each example must be a dictionary with keys "inputs" and "outputs". 661 "inputs" must be a dictionary with keys "questions", "contexts", and "predicted_answers". 662 "outputs" must be a dictionary with "statements" and "statement_scores". 663 Expected format: 664 665 ```python 666 [{ 667 "inputs": { 668 "questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."], 669 "predicted_answers": "Rome is the capital of Italy with more than 4 million inhabitants.", 670 }, 671 "outputs": { 672 "statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."], 673 "statement_scores": [1, 0], 674 }, 675 }] 676 ``` 677 678 - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation. 679 - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the API call fails. 680 - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM. 681 In order for the component to work, the LLM should be configured to return a JSON object. For example, 682 when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the 683 `generation_kwargs`. 684 685 #### run 686 687 ```python 688 run(**inputs: Any) -> dict[str, Any] 689 ``` 690 691 Run the LLM evaluator. 692 693 **Parameters:** 694 695 - **questions** – A list of questions. 696 - **contexts** – A nested list of contexts that correspond to the questions. 697 - **predicted_answers** – A list of predicted answers. 698 699 **Returns:** 700 701 - <code>dict\[str, Any\]</code> – A dictionary with the following outputs: 702 - `score`: Mean faithfulness score over all the provided input answers. 703 - `individual_scores`: A list of faithfulness scores for each input answer. 704 - `results`: A list of dictionaries with `statements` and `statement_scores` for each input answer. 705 706 #### to_dict 707 708 ```python 709 to_dict() -> dict[str, Any] 710 ``` 711 712 Serialize this component to a dictionary. 713 714 **Returns:** 715 716 - <code>dict\[str, Any\]</code> – A dictionary with serialized data. 717 718 #### from_dict 719 720 ```python 721 from_dict(data: dict[str, Any]) -> FaithfulnessEvaluator 722 ``` 723 724 Deserialize this component from a dictionary. 725 726 **Parameters:** 727 728 - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component. 729 730 **Returns:** 731 732 - <code>FaithfulnessEvaluator</code> – The deserialized component instance. 733 734 ## llm_evaluator 735 736 ### LLMEvaluator 737 738 Uses an LLM to evaluate inputs based on a prompt containing instructions and examples. 739 740 The default API requires an OpenAI API key to be provided as an environment variable "OPENAI_API_KEY". 741 The inputs are lists that are user-defined depending on the desired metric. 742 The output is a dictionary with a key `results` containing a list of evaluation results. 743 Each result is a dictionary with user-defined keys and values of either 0 for FALSE or 1 for TRUE respectively. 744 745 Usage example: 746 747 ```python 748 from haystack.components.evaluators import LLMEvaluator 749 evaluator = LLMEvaluator( 750 instructions="Is this answer problematic for children?", 751 inputs=[("predicted_answers", list[str])], 752 outputs=["score"], 753 examples=[ 754 {"inputs": {"predicted_answers": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}}, 755 {"inputs": {"predicted_answers": "Football is the most popular sport."}, "outputs": {"score": 0}}, 756 ], 757 ) 758 predicted_answers = [ 759 "Football is the most popular sport with around 4 billion followers worldwide", 760 "Python language was created by Guido van Rossum.", 761 ] 762 results = evaluator.run(predicted_answers=predicted_answers) 763 print(results) 764 # {'results': [{'score': 0}, {'score': 0}]} 765 ``` 766 767 #### __init__ 768 769 ```python 770 __init__( 771 instructions: str, 772 inputs: list[tuple[str, type[list]]], 773 outputs: list[str], 774 examples: list[dict[str, Any]], 775 progress_bar: bool = True, 776 *, 777 raise_on_failure: bool = True, 778 chat_generator: ChatGenerator | None = None 779 ) -> None 780 ``` 781 782 Creates an instance of LLMEvaluator. 783 784 If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode. 785 786 **Parameters:** 787 788 - **instructions** (<code>str</code>) – The prompt instructions to use for evaluation. 789 Should be a question about the inputs that can be answered with yes or no. 790 - **inputs** (<code>list\[tuple\[str, type\[list\]\]\]</code>) – The inputs that the component expects as incoming connections and that it evaluates. 791 Each input is a tuple of an input name and input type. Input types must be lists. 792 - **outputs** (<code>list\[str\]</code>) – Output names of the evaluation results. They correspond to keys in the output dictionary. 793 - **examples** (<code>list\[dict\[str, Any\]\]</code>) – Few-shot examples conforming to the expected input and output format as defined in the `inputs` and 794 `outputs` parameters. 795 Each example is a dictionary with keys "inputs" and "outputs" 796 They contain the input and output as dictionaries respectively. 797 - **raise_on_failure** (<code>bool</code>) – If True, the component will raise an exception on an unsuccessful API call. 798 - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation. 799 - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM. 800 In order for the component to work, the LLM should be configured to return a JSON object. For example, 801 when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the 802 `generation_kwargs`. 803 804 #### warm_up 805 806 ```python 807 warm_up() -> None 808 ``` 809 810 Warm up the component by warming up the underlying chat generator. 811 812 #### validate_init_parameters 813 814 ```python 815 validate_init_parameters( 816 inputs: list[tuple[str, type[list]]], 817 outputs: list[str], 818 examples: list[dict[str, Any]], 819 ) -> None 820 ``` 821 822 Validate the init parameters. 823 824 **Parameters:** 825 826 - **inputs** (<code>list\[tuple\[str, type\[list\]\]\]</code>) – The inputs to validate. 827 - **outputs** (<code>list\[str\]</code>) – The outputs to validate. 828 - **examples** (<code>list\[dict\[str, Any\]\]</code>) – The examples to validate. 829 830 **Raises:** 831 832 - <code>ValueError</code> – If the inputs are not a list of tuples with a string and a type of list. 833 If the outputs are not a list of strings. 834 If the examples are not a list of dictionaries. 835 If any example does not have keys "inputs" and "outputs" with values that are dictionaries with string keys. 836 837 #### run 838 839 ```python 840 run(**inputs: Any) -> dict[str, Any] 841 ``` 842 843 Run the LLM evaluator. 844 845 **Parameters:** 846 847 - **inputs** (<code>Any</code>) – The input values to evaluate. The keys are the input names and the values are lists of input values. 848 849 **Returns:** 850 851 - <code>dict\[str, Any\]</code> – A dictionary with a `results` entry that contains a list of results. 852 Each result is a dictionary containing the keys as defined in the `outputs` parameter of the LLMEvaluator 853 and the evaluation results as the values. If an exception occurs for a particular input value, the result 854 will be `None` for that entry. 855 If the API is "openai" and the response contains a "meta" key, the metadata from OpenAI will be included 856 in the output dictionary, under the key "meta". 857 858 **Raises:** 859 860 - <code>ValueError</code> – Only in the case that `raise_on_failure` is set to True and the received inputs are not lists or have 861 different lengths, or if the output is not a valid JSON or doesn't contain the expected keys. 862 863 #### prepare_template 864 865 ```python 866 prepare_template() -> str 867 ``` 868 869 Prepare the prompt template. 870 871 Combine instructions, inputs, outputs, and examples into one prompt template with the following format: 872 Instructions: 873 `<instructions>` 874 875 Generate the response in JSON format with the following keys: 876 `<list of output keys>` 877 Consider the instructions and the examples below to determine those values. 878 879 Examples: 880 `<examples>` 881 882 Inputs: 883 `<inputs>` 884 Outputs: 885 886 **Returns:** 887 888 - <code>str</code> – The prompt template. 889 890 #### to_dict 891 892 ```python 893 to_dict() -> dict[str, Any] 894 ``` 895 896 Serialize this component to a dictionary. 897 898 **Returns:** 899 900 - <code>dict\[str, Any\]</code> – The serialized component as a dictionary. 901 902 #### from_dict 903 904 ```python 905 from_dict(data: dict[str, Any]) -> LLMEvaluator 906 ``` 907 908 Deserialize this component from a dictionary. 909 910 **Parameters:** 911 912 - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component. 913 914 **Returns:** 915 916 - <code>LLMEvaluator</code> – The deserialized component instance. 917 918 #### validate_input_parameters 919 920 ```python 921 validate_input_parameters( 922 expected: dict[str, Any], received: dict[str, Any] 923 ) -> None 924 ``` 925 926 Validate the input parameters. 927 928 **Parameters:** 929 930 - **expected** (<code>dict\[str, Any\]</code>) – The expected input parameters. 931 - **received** (<code>dict\[str, Any\]</code>) – The received input parameters. 932 933 **Raises:** 934 935 - <code>ValueError</code> – If not all expected inputs are present in the received inputs 936 If the received inputs are not lists or have different lengths 937 938 ## sas_evaluator 939 940 ### SASEvaluator 941 942 SASEvaluator computes the Semantic Answer Similarity (SAS) between a list of predictions and a one of ground truths. 943 944 It's usually used in Retrieval Augmented Generation (RAG) pipelines to evaluate the quality of the generated 945 answers. The SAS is computed using a pre-trained model from the Hugging Face model hub. The model can be either a 946 Bi-Encoder or a Cross-Encoder. The choice of the model is based on the `model` parameter. 947 948 Usage example: 949 950 ```python 951 from haystack.components.evaluators.sas_evaluator import SASEvaluator 952 953 evaluator = SASEvaluator(model="cross-encoder/ms-marco-MiniLM-L-6-v2") 954 ground_truths = [ 955 "A construction budget of US $2.3 billion", 956 "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.", 957 "The Meiji Restoration in 1868 transformed Japan into a modernized world power.", 958 ] 959 predictions = [ 960 "A construction budget of US $2.3 billion", 961 "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.", 962 "The Meiji Restoration in 1868 transformed Japan into a modernized world power.", 963 ] 964 result = evaluator.run( 965 ground_truths_answers=ground_truths, predicted_answers=predictions 966 ) 967 968 print(result["score"]) 969 # 0.9999673763910929 970 971 print(result["individual_scores"]) 972 # [0.9999765157699585, 0.999968409538269, 0.9999572038650513] 973 ``` 974 975 #### __init__ 976 977 ```python 978 __init__( 979 model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2", 980 batch_size: int = 32, 981 device: ComponentDevice | None = None, 982 token: Secret = Secret.from_env_var( 983 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 984 ), 985 ) -> None 986 ``` 987 988 Creates a new instance of SASEvaluator. 989 990 **Parameters:** 991 992 - **model** (<code>str</code>) – SentenceTransformers semantic textual similarity model, should be path or string pointing to a downloadable 993 model. 994 - **batch_size** (<code>int</code>) – Number of prediction-label pairs to encode at once. 995 - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically selected. 996 - **token** (<code>Secret</code>) – The Hugging Face token for HTTP bearer authorization. 997 You can find your HF token in your [account settings](https://huggingface.co/settings/tokens) 998 999 #### to_dict 1000 1001 ```python 1002 to_dict() -> dict[str, Any] 1003 ``` 1004 1005 Serialize this component to a dictionary. 1006 1007 **Returns:** 1008 1009 - <code>dict\[str, Any\]</code> – The serialized component as a dictionary. 1010 1011 #### from_dict 1012 1013 ```python 1014 from_dict(data: dict[str, Any]) -> SASEvaluator 1015 ``` 1016 1017 Deserialize this component from a dictionary. 1018 1019 **Parameters:** 1020 1021 - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component. 1022 1023 **Returns:** 1024 1025 - <code>SASEvaluator</code> – The deserialized component instance. 1026 1027 #### warm_up 1028 1029 ```python 1030 warm_up() -> None 1031 ``` 1032 1033 Initializes the component. 1034 1035 #### run 1036 1037 ```python 1038 run( 1039 ground_truth_answers: list[str], predicted_answers: list[str] 1040 ) -> dict[str, float | list[float]] 1041 ``` 1042 1043 SASEvaluator component run method. 1044 1045 Run the SASEvaluator to compute the Semantic Answer Similarity (SAS) between a list of predicted answers 1046 and a list of ground truth answers. Both must be list of strings of same length. 1047 1048 **Parameters:** 1049 1050 - **ground_truth_answers** (<code>list\[str\]</code>) – A list of expected answers for each question. 1051 - **predicted_answers** (<code>list\[str\]</code>) – A list of generated answers for each question. 1052 1053 **Returns:** 1054 1055 - <code>dict\[str, float | list\[float\]\]</code> – A dictionary with the following outputs: 1056 - `score`: Mean SAS score over all the predictions/ground-truth pairs. 1057 - `individual_scores`: A list of similarity scores for each prediction/ground-truth pair.