Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.25 / haystack-api / evaluators_api.md
evaluators_api.md
   1  ---
   2  title: "Evaluators"
   3  id: evaluators-api
   4  description: "Evaluate your pipelines or individual components."
   5  slug: "/evaluators-api"
   6  ---
   7  
   8  
   9  ## answer_exact_match
  10  
  11  ### AnswerExactMatchEvaluator
  12  
  13  An answer exact match evaluator class.
  14  
  15  The evaluator that checks if the predicted answers matches any of the ground truth answers exactly.
  16  The result is a number from 0.0 to 1.0, it represents the proportion of predicted answers
  17  that matched one of the ground truth answers.
  18  There can be multiple ground truth answers and multiple predicted answers as input.
  19  
  20  Usage example:
  21  
  22  ```python
  23  from haystack.components.evaluators import AnswerExactMatchEvaluator
  24  
  25  evaluator = AnswerExactMatchEvaluator()
  26  result = evaluator.run(
  27      ground_truth_answers=["Berlin", "Paris"],
  28      predicted_answers=["Berlin", "Lyon"],
  29  )
  30  
  31  print(result["individual_scores"])
  32  # [1, 0]
  33  print(result["score"])
  34  # 0.5
  35  ```
  36  
  37  #### run
  38  
  39  ```python
  40  run(
  41      ground_truth_answers: list[str], predicted_answers: list[str]
  42  ) -> dict[str, Any]
  43  ```
  44  
  45  Run the AnswerExactMatchEvaluator on the given inputs.
  46  
  47  The `ground_truth_answers` and `retrieved_answers` must have the same length.
  48  
  49  **Parameters:**
  50  
  51  - **ground_truth_answers** (<code>list\[str\]</code>) – A list of expected answers.
  52  - **predicted_answers** (<code>list\[str\]</code>) – A list of predicted answers.
  53  
  54  **Returns:**
  55  
  56  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
  57  - `individual_scores` - A list of 0s and 1s, where 1 means that the predicted answer matched one of the
  58    ground truth.
  59  - `score` - A number from 0.0 to 1.0 that represents the proportion of questions where any predicted
  60    answer matched one of the ground truth answers.
  61  
  62  ## context_relevance
  63  
  64  ### ContextRelevanceEvaluator
  65  
  66  Bases: <code>LLMEvaluator</code>
  67  
  68  Evaluator that checks if a provided context is relevant to the question.
  69  
  70  An LLM breaks up a context into multiple statements and checks whether each statement
  71  is relevant for answering a question.
  72  The score for each context is either binary score of 1 or 0, where 1 indicates that the context is relevant
  73  to the question and 0 indicates that the context is not relevant.
  74  The evaluator also provides the relevant statements from the context and an average score over all the provided
  75  input questions contexts pairs.
  76  
  77  Usage example:
  78  
  79  ```python
  80  from haystack.components.evaluators import ContextRelevanceEvaluator
  81  
  82  questions = ["Who created the Python language?", "Why does Java needs a JVM?", "Is C++ better than Python?"]
  83  contexts = [
  84      [(
  85          "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming "
  86          "language. Its design philosophy emphasizes code readability, and its language constructs aim to help "
  87          "programmers write clear, logical code for both small and large-scale software projects."
  88      )],
  89      [(
  90          "Java is a high-level, class-based, object-oriented programming language that is designed to have as few "
  91          "implementation dependencies as possible. The JVM has two primary functions: to allow Java programs to run"
  92          "on any device or operating system (known as the 'write once, run anywhere' principle), and to manage and"
  93          "optimize program memory."
  94      )],
  95      [(
  96          "C++ is a general-purpose programming language created by Bjarne Stroustrup as an extension of the C "
  97          "programming language."
  98      )],
  99  ]
 100  
 101  evaluator = ContextRelevanceEvaluator()
 102  result = evaluator.run(questions=questions, contexts=contexts)
 103  print(result["score"])
 104  # 0.67
 105  print(result["individual_scores"])
 106  # [1,1,0]
 107  print(result["results"])
 108  # [{
 109  #   'relevant_statements': ['Python, created by Guido van Rossum in the late 1980s.'],
 110  #    'score': 1.0
 111  #  },
 112  #  {
 113  #   'relevant_statements': ['The JVM has two primary functions: to allow Java programs to run on any device or
 114  #                           operating system (known as the "write once, run anywhere" principle), and to manage and
 115  #                           optimize program memory'],
 116  #   'score': 1.0
 117  #  },
 118  #  {
 119  #   'relevant_statements': [],
 120  #   'score': 0.0
 121  #  }]
 122  ```
 123  
 124  #### __init__
 125  
 126  ```python
 127  __init__(
 128      examples: list[dict[str, Any]] | None = None,
 129      progress_bar: bool = True,
 130      raise_on_failure: bool = True,
 131      chat_generator: ChatGenerator | None = None,
 132  )
 133  ```
 134  
 135  Creates an instance of ContextRelevanceEvaluator.
 136  
 137  If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode.
 138  
 139  **Parameters:**
 140  
 141  - **examples** (<code>list\[dict\[str, Any\]\] | None</code>) – Optional few-shot examples conforming to the expected input and output format of ContextRelevanceEvaluator.
 142    Default examples will be used if none are provided.
 143    Each example must be a dictionary with keys "inputs" and "outputs".
 144    "inputs" must be a dictionary with keys "questions" and "contexts".
 145    "outputs" must be a dictionary with "relevant_statements".
 146    Expected format:
 147  
 148  ```python
 149  [{
 150      "inputs": {
 151          "questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
 152      },
 153      "outputs": {
 154          "relevant_statements": ["Rome is the capital of Italy."],
 155      },
 156  }]
 157  ```
 158  
 159  - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation.
 160  - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the API call fails.
 161  - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM.
 162    In order for the component to work, the LLM should be configured to return a JSON object. For example,
 163    when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the
 164    `generation_kwargs`.
 165  
 166  #### run
 167  
 168  ```python
 169  run(**inputs) -> dict[str, Any]
 170  ```
 171  
 172  Run the LLM evaluator.
 173  
 174  **Parameters:**
 175  
 176  - **questions** – A list of questions.
 177  - **contexts** – A list of lists of contexts. Each list of contexts corresponds to one question.
 178  
 179  **Returns:**
 180  
 181  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 182    - `score`: Mean context relevance score over all the provided input questions.
 183    - `results`: A list of dictionaries with `relevant_statements` and `score` for each input context.
 184  
 185  #### to_dict
 186  
 187  ```python
 188  to_dict() -> dict[str, Any]
 189  ```
 190  
 191  Serialize this component to a dictionary.
 192  
 193  **Returns:**
 194  
 195  - <code>dict\[str, Any\]</code> – A dictionary with serialized data.
 196  
 197  #### from_dict
 198  
 199  ```python
 200  from_dict(data: dict[str, Any]) -> ContextRelevanceEvaluator
 201  ```
 202  
 203  Deserialize this component from a dictionary.
 204  
 205  **Parameters:**
 206  
 207  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
 208  
 209  **Returns:**
 210  
 211  - <code>ContextRelevanceEvaluator</code> – The deserialized component instance.
 212  
 213  ## document_map
 214  
 215  ### DocumentMAPEvaluator
 216  
 217  A Mean Average Precision (MAP) evaluator for documents.
 218  
 219  Evaluator that calculates the mean average precision of the retrieved documents, a metric
 220  that measures how high retrieved documents are ranked.
 221  Each question can have multiple ground truth documents and multiple retrieved documents.
 222  
 223  `DocumentMAPEvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
 224  should be used to clean and normalize the documents before passing them to this evaluator.
 225  
 226  Usage example:
 227  
 228  ```python
 229  from haystack import Document
 230  from haystack.components.evaluators import DocumentMAPEvaluator
 231  
 232  evaluator = DocumentMAPEvaluator()
 233  result = evaluator.run(
 234      ground_truth_documents=[
 235          [Document(content="France")],
 236          [Document(content="9th century"), Document(content="9th")],
 237      ],
 238      retrieved_documents=[
 239          [Document(content="France")],
 240          [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
 241      ],
 242  )
 243  
 244  print(result["individual_scores"])
 245  # [1.0, 0.8333333333333333]
 246  print(result["score"])
 247  # 0.9166666666666666
 248  ```
 249  
 250  #### __init__
 251  
 252  ```python
 253  __init__(document_comparison_field: str = 'content')
 254  ```
 255  
 256  Create a DocumentMAPEvaluator component.
 257  
 258  **Parameters:**
 259  
 260  - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options:
 261  - `"content"`: uses `doc.content`
 262  - `"id"`: uses `doc.id`
 263  - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]`
 264    (e.g. `"meta.file_id"`, `"meta.page_number"`)
 265    Nested keys are supported (e.g. `"meta.source.url"`).
 266  
 267  #### to_dict
 268  
 269  ```python
 270  to_dict() -> dict[str, Any]
 271  ```
 272  
 273  Serializes the component to a dictionary.
 274  
 275  **Returns:**
 276  
 277  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 278  
 279  #### run
 280  
 281  ```python
 282  run(
 283      ground_truth_documents: list[list[Document]],
 284      retrieved_documents: list[list[Document]],
 285  ) -> dict[str, Any]
 286  ```
 287  
 288  Run the DocumentMAPEvaluator on the given inputs.
 289  
 290  All lists must have the same length.
 291  
 292  **Parameters:**
 293  
 294  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question.
 295  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question.
 296  
 297  **Returns:**
 298  
 299  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 300  - `score` - The average of calculated scores.
 301  - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high retrieved documents
 302    are ranked.
 303  
 304  ## document_mrr
 305  
 306  ### DocumentMRREvaluator
 307  
 308  Evaluator that calculates the mean reciprocal rank of the retrieved documents.
 309  
 310  MRR measures how high the first retrieved document is ranked.
 311  Each question can have multiple ground truth documents and multiple retrieved documents.
 312  
 313  `DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
 314  should be used to clean and normalize the documents before passing them to this evaluator.
 315  
 316  Usage example:
 317  
 318  ```python
 319  from haystack import Document
 320  from haystack.components.evaluators import DocumentMRREvaluator
 321  
 322  evaluator = DocumentMRREvaluator()
 323  result = evaluator.run(
 324      ground_truth_documents=[
 325          [Document(content="France")],
 326          [Document(content="9th century"), Document(content="9th")],
 327      ],
 328      retrieved_documents=[
 329          [Document(content="France")],
 330          [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
 331      ],
 332  )
 333  print(result["individual_scores"])
 334  # [1.0, 1.0]
 335  print(result["score"])
 336  # 1.0
 337  ```
 338  
 339  #### __init__
 340  
 341  ```python
 342  __init__(document_comparison_field: str = 'content')
 343  ```
 344  
 345  Create a DocumentMRREvaluator component.
 346  
 347  **Parameters:**
 348  
 349  - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options:
 350  - `"content"`: uses `doc.content`
 351  - `"id"`: uses `doc.id`
 352  - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]`
 353    (e.g. `"meta.file_id"`, `"meta.page_number"`)
 354    Nested keys are supported (e.g. `"meta.source.url"`).
 355  
 356  #### to_dict
 357  
 358  ```python
 359  to_dict() -> dict[str, Any]
 360  ```
 361  
 362  Serializes the component to a dictionary.
 363  
 364  **Returns:**
 365  
 366  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 367  
 368  #### run
 369  
 370  ```python
 371  run(
 372      ground_truth_documents: list[list[Document]],
 373      retrieved_documents: list[list[Document]],
 374  ) -> dict[str, Any]
 375  ```
 376  
 377  Run the DocumentMRREvaluator on the given inputs.
 378  
 379  `ground_truth_documents` and `retrieved_documents` must have the same length.
 380  
 381  **Parameters:**
 382  
 383  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question.
 384  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question.
 385  
 386  **Returns:**
 387  
 388  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 389  - `score` - The average of calculated scores.
 390  - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved
 391    document is ranked.
 392  
 393  ## document_ndcg
 394  
 395  ### DocumentNDCGEvaluator
 396  
 397  Evaluator that calculates the normalized discounted cumulative gain (NDCG) of retrieved documents.
 398  
 399  Each question can have multiple ground truth documents and multiple retrieved documents.
 400  If the ground truth documents have relevance scores, the NDCG calculation uses these scores.
 401  Otherwise, it assumes binary relevance of all ground truth documents.
 402  
 403  Usage example:
 404  
 405  ```python
 406  from haystack import Document
 407  from haystack.components.evaluators import DocumentNDCGEvaluator
 408  
 409  evaluator = DocumentNDCGEvaluator()
 410  result = evaluator.run(
 411      ground_truth_documents=[[Document(content="France", score=1.0), Document(content="Paris", score=0.5)]],
 412      retrieved_documents=[[Document(content="France"), Document(content="Germany"), Document(content="Paris")]],
 413  )
 414  print(result["individual_scores"])
 415  # [0.8869]
 416  print(result["score"])
 417  # 0.8869
 418  ```
 419  
 420  #### run
 421  
 422  ```python
 423  run(
 424      ground_truth_documents: list[list[Document]],
 425      retrieved_documents: list[list[Document]],
 426  ) -> dict[str, Any]
 427  ```
 428  
 429  Run the DocumentNDCGEvaluator on the given inputs.
 430  
 431  `ground_truth_documents` and `retrieved_documents` must have the same length.
 432  The list items within `ground_truth_documents` and `retrieved_documents` can differ in length.
 433  
 434  **Parameters:**
 435  
 436  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – Lists of expected documents, one list per question. Binary relevance is used if documents have no scores.
 437  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – Lists of retrieved documents, one list per question.
 438  
 439  **Returns:**
 440  
 441  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 442  - `score` - The average of calculated scores.
 443  - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents the NDCG for each question.
 444  
 445  #### validate_inputs
 446  
 447  ```python
 448  validate_inputs(gt_docs: list[list[Document]], ret_docs: list[list[Document]])
 449  ```
 450  
 451  Validate the input parameters.
 452  
 453  **Parameters:**
 454  
 455  - **gt_docs** (<code>list\[list\[Document\]\]</code>) – The ground_truth_documents to validate.
 456  - **ret_docs** (<code>list\[list\[Document\]\]</code>) – The retrieved_documents to validate.
 457  
 458  **Raises:**
 459  
 460  - <code>ValueError</code> – If the ground_truth_documents or the retrieved_documents are an empty a list.
 461    If the length of ground_truth_documents and retrieved_documents differs.
 462    If any list of documents in ground_truth_documents contains a mix of documents with and without a score.
 463  
 464  #### calculate_dcg
 465  
 466  ```python
 467  calculate_dcg(gt_docs: list[Document], ret_docs: list[Document]) -> float
 468  ```
 469  
 470  Calculate the discounted cumulative gain (DCG) of the retrieved documents.
 471  
 472  **Parameters:**
 473  
 474  - **gt_docs** (<code>list\[Document\]</code>) – The ground truth documents.
 475  - **ret_docs** (<code>list\[Document\]</code>) – The retrieved documents.
 476  
 477  **Returns:**
 478  
 479  - <code>float</code> – The discounted cumulative gain (DCG) of the retrieved
 480    documents based on the ground truth documents.
 481  
 482  #### calculate_idcg
 483  
 484  ```python
 485  calculate_idcg(gt_docs: list[Document]) -> float
 486  ```
 487  
 488  Calculate the ideal discounted cumulative gain (IDCG) of the ground truth documents.
 489  
 490  **Parameters:**
 491  
 492  - **gt_docs** (<code>list\[Document\]</code>) – The ground truth documents.
 493  
 494  **Returns:**
 495  
 496  - <code>float</code> – The ideal discounted cumulative gain (IDCG) of the ground truth documents.
 497  
 498  ## document_recall
 499  
 500  ### RecallMode
 501  
 502  Bases: <code>Enum</code>
 503  
 504  Enum for the mode to use for calculating the recall score.
 505  
 506  #### from_str
 507  
 508  ```python
 509  from_str(string: str) -> RecallMode
 510  ```
 511  
 512  Convert a string to a RecallMode enum.
 513  
 514  ### DocumentRecallEvaluator
 515  
 516  Evaluator that calculates the Recall score for a list of documents.
 517  
 518  Returns both a list of scores for each question and the average.
 519  There can be multiple ground truth documents and multiple predicted documents as input.
 520  
 521  Usage example:
 522  
 523  ```python
 524  from haystack import Document
 525  from haystack.components.evaluators import DocumentRecallEvaluator
 526  
 527  evaluator = DocumentRecallEvaluator()
 528  result = evaluator.run(
 529      ground_truth_documents=[
 530          [Document(content="France")],
 531          [Document(content="9th century"), Document(content="9th")],
 532      ],
 533      retrieved_documents=[
 534          [Document(content="France")],
 535          [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
 536      ],
 537  )
 538  print(result["individual_scores"])
 539  # [1.0, 1.0]
 540  print(result["score"])
 541  # 1.0
 542  ```
 543  
 544  #### __init__
 545  
 546  ```python
 547  __init__(
 548      mode: str | RecallMode = RecallMode.SINGLE_HIT,
 549      document_comparison_field: str = "content",
 550  )
 551  ```
 552  
 553  Create a DocumentRecallEvaluator component.
 554  
 555  **Parameters:**
 556  
 557  - **mode** (<code>str | RecallMode</code>) – Mode to use for calculating the recall score.
 558  - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options:
 559  - `"content"`: uses `doc.content`
 560  - `"id"`: uses `doc.id`
 561  - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]`
 562    (e.g. `"meta.file_id"`, `"meta.page_number"`)
 563    Nested keys are supported (e.g. `"meta.source.url"`).
 564  
 565  #### run
 566  
 567  ```python
 568  run(
 569      ground_truth_documents: list[list[Document]],
 570      retrieved_documents: list[list[Document]],
 571  ) -> dict[str, Any]
 572  ```
 573  
 574  Run the DocumentRecallEvaluator on the given inputs.
 575  
 576  `ground_truth_documents` and `retrieved_documents` must have the same length.
 577  
 578  **Parameters:**
 579  
 580  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question.
 581  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question.
 582    A dictionary with the following outputs:
 583    - `score` - The average of calculated scores.
 584    - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents the proportion of matching
 585      documents retrieved. If the mode is `single_hit`, the individual scores are 0 or 1.
 586  
 587  #### to_dict
 588  
 589  ```python
 590  to_dict() -> dict[str, Any]
 591  ```
 592  
 593  Serializes the component to a dictionary.
 594  
 595  **Returns:**
 596  
 597  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 598  
 599  ## faithfulness
 600  
 601  ### FaithfulnessEvaluator
 602  
 603  Bases: <code>LLMEvaluator</code>
 604  
 605  Evaluator that checks if a generated answer can be inferred from the provided contexts.
 606  
 607  An LLM separates the answer into multiple statements and checks whether the statement can be inferred from the
 608  context or not. The final score for the full answer is a number from 0.0 to 1.0. It represents the proportion of
 609  statements that can be inferred from the provided contexts.
 610  
 611  Usage example:
 612  
 613  ```python
 614  from haystack.components.evaluators import FaithfulnessEvaluator
 615  
 616  questions = ["Who created the Python language?"]
 617  contexts = [
 618      [(
 619          "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming "
 620          "language. Its design philosophy emphasizes code readability, and its language constructs aim to help "
 621          "programmers write clear, logical code for both small and large-scale software projects."
 622      )],
 623  ]
 624  predicted_answers = [
 625      "Python is a high-level general-purpose programming language that was created by George Lucas."
 626  ]
 627  evaluator = FaithfulnessEvaluator()
 628  result = evaluator.run(questions=questions, contexts=contexts, predicted_answers=predicted_answers)
 629  
 630  print(result["individual_scores"])
 631  # [0.5]
 632  print(result["score"])
 633  # 0.5
 634  print(result["results"])
 635  # [{'statements': ['Python is a high-level general-purpose programming language.',
 636  'Python was created by George Lucas.'], 'statement_scores': [1, 0], 'score': 0.5}]
 637  ```
 638  
 639  #### __init__
 640  
 641  ```python
 642  __init__(
 643      examples: list[dict[str, Any]] | None = None,
 644      progress_bar: bool = True,
 645      raise_on_failure: bool = True,
 646      chat_generator: ChatGenerator | None = None,
 647  )
 648  ```
 649  
 650  Creates an instance of FaithfulnessEvaluator.
 651  
 652  If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode.
 653  
 654  **Parameters:**
 655  
 656  - **examples** (<code>list\[dict\[str, Any\]\] | None</code>) – Optional few-shot examples conforming to the expected input and output format of FaithfulnessEvaluator.
 657    Default examples will be used if none are provided.
 658    Each example must be a dictionary with keys "inputs" and "outputs".
 659    "inputs" must be a dictionary with keys "questions", "contexts", and "predicted_answers".
 660    "outputs" must be a dictionary with "statements" and "statement_scores".
 661    Expected format:
 662  
 663  ```python
 664  [{
 665      "inputs": {
 666          "questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
 667          "predicted_answers": "Rome is the capital of Italy with more than 4 million inhabitants.",
 668      },
 669      "outputs": {
 670          "statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
 671          "statement_scores": [1, 0],
 672      },
 673  }]
 674  ```
 675  
 676  - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation.
 677  - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the API call fails.
 678  - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM.
 679    In order for the component to work, the LLM should be configured to return a JSON object. For example,
 680    when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the
 681    `generation_kwargs`.
 682  
 683  #### run
 684  
 685  ```python
 686  run(**inputs) -> dict[str, Any]
 687  ```
 688  
 689  Run the LLM evaluator.
 690  
 691  **Parameters:**
 692  
 693  - **questions** – A list of questions.
 694  - **contexts** – A nested list of contexts that correspond to the questions.
 695  - **predicted_answers** – A list of predicted answers.
 696  
 697  **Returns:**
 698  
 699  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 700    - `score`: Mean faithfulness score over all the provided input answers.
 701    - `individual_scores`: A list of faithfulness scores for each input answer.
 702    - `results`: A list of dictionaries with `statements` and `statement_scores` for each input answer.
 703  
 704  #### to_dict
 705  
 706  ```python
 707  to_dict() -> dict[str, Any]
 708  ```
 709  
 710  Serialize this component to a dictionary.
 711  
 712  **Returns:**
 713  
 714  - <code>dict\[str, Any\]</code> – A dictionary with serialized data.
 715  
 716  #### from_dict
 717  
 718  ```python
 719  from_dict(data: dict[str, Any]) -> FaithfulnessEvaluator
 720  ```
 721  
 722  Deserialize this component from a dictionary.
 723  
 724  **Parameters:**
 725  
 726  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
 727  
 728  **Returns:**
 729  
 730  - <code>FaithfulnessEvaluator</code> – The deserialized component instance.
 731  
 732  ## llm_evaluator
 733  
 734  ### LLMEvaluator
 735  
 736  Uses an LLM to evaluate inputs based on a prompt containing instructions and examples.
 737  
 738  The default API requires an OpenAI API key to be provided as an environment variable "OPENAI_API_KEY".
 739  The inputs are lists that are user-defined depending on the desired metric.
 740  The output is a dictionary with a key `results` containing a list of evaluation results.
 741  Each result is a dictionary with user-defined keys and values of either 0 for FALSE or 1 for TRUE respectively.
 742  
 743  Usage example:
 744  
 745  ```python
 746  from haystack.components.evaluators import LLMEvaluator
 747  evaluator = LLMEvaluator(
 748      instructions="Is this answer problematic for children?",
 749      inputs=[("predicted_answers", list[str])],
 750      outputs=["score"],
 751      examples=[
 752          {"inputs": {"predicted_answers": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}},
 753          {"inputs": {"predicted_answers": "Football is the most popular sport."}, "outputs": {"score": 0}},
 754      ],
 755  )
 756  predicted_answers = [
 757      "Football is the most popular sport with around 4 billion followers worldwide",
 758      "Python language was created by Guido van Rossum.",
 759  ]
 760  results = evaluator.run(predicted_answers=predicted_answers)
 761  print(results)
 762  # {'results': [{'score': 0}, {'score': 0}]}
 763  ```
 764  
 765  #### __init__
 766  
 767  ```python
 768  __init__(
 769      instructions: str,
 770      inputs: list[tuple[str, type[list]]],
 771      outputs: list[str],
 772      examples: list[dict[str, Any]],
 773      progress_bar: bool = True,
 774      *,
 775      raise_on_failure: bool = True,
 776      chat_generator: ChatGenerator | None = None
 777  )
 778  ```
 779  
 780  Creates an instance of LLMEvaluator.
 781  
 782  If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode.
 783  
 784  **Parameters:**
 785  
 786  - **instructions** (<code>str</code>) – The prompt instructions to use for evaluation.
 787    Should be a question about the inputs that can be answered with yes or no.
 788  - **inputs** (<code>list\[tuple\[str, type\[list\]\]\]</code>) – The inputs that the component expects as incoming connections and that it evaluates.
 789    Each input is a tuple of an input name and input type. Input types must be lists.
 790  - **outputs** (<code>list\[str\]</code>) – Output names of the evaluation results. They correspond to keys in the output dictionary.
 791  - **examples** (<code>list\[dict\[str, Any\]\]</code>) – Few-shot examples conforming to the expected input and output format as defined in the `inputs` and
 792    `outputs` parameters.
 793    Each example is a dictionary with keys "inputs" and "outputs"
 794    They contain the input and output as dictionaries respectively.
 795  - **raise_on_failure** (<code>bool</code>) – If True, the component will raise an exception on an unsuccessful API call.
 796  - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation.
 797  - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM.
 798    In order for the component to work, the LLM should be configured to return a JSON object. For example,
 799    when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the
 800    `generation_kwargs`.
 801  
 802  #### warm_up
 803  
 804  ```python
 805  warm_up()
 806  ```
 807  
 808  Warm up the component by warming up the underlying chat generator.
 809  
 810  #### validate_init_parameters
 811  
 812  ```python
 813  validate_init_parameters(
 814      inputs: list[tuple[str, type[list]]],
 815      outputs: list[str],
 816      examples: list[dict[str, Any]],
 817  )
 818  ```
 819  
 820  Validate the init parameters.
 821  
 822  **Parameters:**
 823  
 824  - **inputs** (<code>list\[tuple\[str, type\[list\]\]\]</code>) – The inputs to validate.
 825  - **outputs** (<code>list\[str\]</code>) – The outputs to validate.
 826  - **examples** (<code>list\[dict\[str, Any\]\]</code>) – The examples to validate.
 827  
 828  **Raises:**
 829  
 830  - <code>ValueError</code> – If the inputs are not a list of tuples with a string and a type of list.
 831    If the outputs are not a list of strings.
 832    If the examples are not a list of dictionaries.
 833    If any example does not have keys "inputs" and "outputs" with values that are dictionaries with string keys.
 834  
 835  #### run
 836  
 837  ```python
 838  run(**inputs) -> dict[str, Any]
 839  ```
 840  
 841  Run the LLM evaluator.
 842  
 843  **Parameters:**
 844  
 845  - **inputs** – The input values to evaluate. The keys are the input names and the values are lists of input values.
 846  
 847  **Returns:**
 848  
 849  - <code>dict\[str, Any\]</code> – A dictionary with a `results` entry that contains a list of results.
 850    Each result is a dictionary containing the keys as defined in the `outputs` parameter of the LLMEvaluator
 851    and the evaluation results as the values. If an exception occurs for a particular input value, the result
 852    will be `None` for that entry.
 853    If the API is "openai" and the response contains a "meta" key, the metadata from OpenAI will be included
 854    in the output dictionary, under the key "meta".
 855  
 856  **Raises:**
 857  
 858  - <code>ValueError</code> – Only in the case that `raise_on_failure` is set to True and the received inputs are not lists or have
 859    different lengths, or if the output is not a valid JSON or doesn't contain the expected keys.
 860  
 861  #### prepare_template
 862  
 863  ```python
 864  prepare_template() -> str
 865  ```
 866  
 867  Prepare the prompt template.
 868  
 869  Combine instructions, inputs, outputs, and examples into one prompt template with the following format:
 870  Instructions:
 871  `<instructions>`
 872  
 873  Generate the response in JSON format with the following keys:
 874  `<list of output keys>`
 875  Consider the instructions and the examples below to determine those values.
 876  
 877  Examples:
 878  `<examples>`
 879  
 880  Inputs:
 881  `<inputs>`
 882  Outputs:
 883  
 884  **Returns:**
 885  
 886  - <code>str</code> – The prompt template.
 887  
 888  #### to_dict
 889  
 890  ```python
 891  to_dict() -> dict[str, Any]
 892  ```
 893  
 894  Serialize this component to a dictionary.
 895  
 896  **Returns:**
 897  
 898  - <code>dict\[str, Any\]</code> – The serialized component as a dictionary.
 899  
 900  #### from_dict
 901  
 902  ```python
 903  from_dict(data: dict[str, Any]) -> LLMEvaluator
 904  ```
 905  
 906  Deserialize this component from a dictionary.
 907  
 908  **Parameters:**
 909  
 910  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
 911  
 912  **Returns:**
 913  
 914  - <code>LLMEvaluator</code> – The deserialized component instance.
 915  
 916  #### validate_input_parameters
 917  
 918  ```python
 919  validate_input_parameters(
 920      expected: dict[str, Any], received: dict[str, Any]
 921  ) -> None
 922  ```
 923  
 924  Validate the input parameters.
 925  
 926  **Parameters:**
 927  
 928  - **expected** (<code>dict\[str, Any\]</code>) – The expected input parameters.
 929  - **received** (<code>dict\[str, Any\]</code>) – The received input parameters.
 930  
 931  **Raises:**
 932  
 933  - <code>ValueError</code> – If not all expected inputs are present in the received inputs
 934    If the received inputs are not lists or have different lengths
 935  
 936  ## sas_evaluator
 937  
 938  ### SASEvaluator
 939  
 940  SASEvaluator computes the Semantic Answer Similarity (SAS) between a list of predictions and a one of ground truths.
 941  
 942  It's usually used in Retrieval Augmented Generation (RAG) pipelines to evaluate the quality of the generated
 943  answers. The SAS is computed using a pre-trained model from the Hugging Face model hub. The model can be either a
 944  Bi-Encoder or a Cross-Encoder. The choice of the model is based on the `model` parameter.
 945  
 946  Usage example:
 947  
 948  ```python
 949  from haystack.components.evaluators.sas_evaluator import SASEvaluator
 950  
 951  evaluator = SASEvaluator(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
 952  ground_truths = [
 953      "A construction budget of US $2.3 billion",
 954      "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
 955      "The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
 956  ]
 957  predictions = [
 958      "A construction budget of US $2.3 billion",
 959      "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
 960      "The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
 961  ]
 962  result = evaluator.run(
 963      ground_truths_answers=ground_truths, predicted_answers=predictions
 964  )
 965  
 966  print(result["score"])
 967  # 0.9999673763910929
 968  
 969  print(result["individual_scores"])
 970  # [0.9999765157699585, 0.999968409538269, 0.9999572038650513]
 971  ```
 972  
 973  #### __init__
 974  
 975  ```python
 976  __init__(
 977      model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
 978      batch_size: int = 32,
 979      device: ComponentDevice | None = None,
 980      token: Secret = Secret.from_env_var(
 981          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
 982      ),
 983  ) -> None
 984  ```
 985  
 986  Creates a new instance of SASEvaluator.
 987  
 988  **Parameters:**
 989  
 990  - **model** (<code>str</code>) – SentenceTransformers semantic textual similarity model, should be path or string pointing to a downloadable
 991    model.
 992  - **batch_size** (<code>int</code>) – Number of prediction-label pairs to encode at once.
 993  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically selected.
 994  - **token** (<code>Secret</code>) – The Hugging Face token for HTTP bearer authorization.
 995    You can find your HF token in your [account settings](https://huggingface.co/settings/tokens)
 996  
 997  #### to_dict
 998  
 999  ```python
1000  to_dict() -> dict[str, Any]
1001  ```
1002  
1003  Serialize this component to a dictionary.
1004  
1005  **Returns:**
1006  
1007  - <code>dict\[str, Any\]</code> – The serialized component as a dictionary.
1008  
1009  #### from_dict
1010  
1011  ```python
1012  from_dict(data: dict[str, Any]) -> SASEvaluator
1013  ```
1014  
1015  Deserialize this component from a dictionary.
1016  
1017  **Parameters:**
1018  
1019  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
1020  
1021  **Returns:**
1022  
1023  - <code>SASEvaluator</code> – The deserialized component instance.
1024  
1025  #### warm_up
1026  
1027  ```python
1028  warm_up() -> None
1029  ```
1030  
1031  Initializes the component.
1032  
1033  #### run
1034  
1035  ```python
1036  run(
1037      ground_truth_answers: list[str], predicted_answers: list[str]
1038  ) -> dict[str, float | list[float]]
1039  ```
1040  
1041  SASEvaluator component run method.
1042  
1043  Run the SASEvaluator to compute the Semantic Answer Similarity (SAS) between a list of predicted answers
1044  and a list of ground truth answers. Both must be list of strings of same length.
1045  
1046  **Parameters:**
1047  
1048  - **ground_truth_answers** (<code>list\[str\]</code>) – A list of expected answers for each question.
1049  - **predicted_answers** (<code>list\[str\]</code>) – A list of generated answers for each question.
1050  
1051  **Returns:**
1052  
1053  - <code>dict\[str, float | list\[float\]\]</code> – A dictionary with the following outputs:
1054    - `score`: Mean SAS score over all the predictions/ground-truth pairs.
1055    - `individual_scores`: A list of similarity scores for each prediction/ground-truth pair.