Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.28 / haystack-api / evaluators_api.md
evaluators_api.md
   1  ---
   2  title: "Evaluators"
   3  id: evaluators-api
   4  description: "Evaluate your pipelines or individual components."
   5  slug: "/evaluators-api"
   6  ---
   7  
   8  
   9  ## answer_exact_match
  10  
  11  ### AnswerExactMatchEvaluator
  12  
  13  An answer exact match evaluator class.
  14  
  15  The evaluator that checks if the predicted answers matches any of the ground truth answers exactly.
  16  The result is a number from 0.0 to 1.0, it represents the proportion of predicted answers
  17  that matched one of the ground truth answers.
  18  There can be multiple ground truth answers and multiple predicted answers as input.
  19  
  20  Usage example:
  21  
  22  ```python
  23  from haystack.components.evaluators import AnswerExactMatchEvaluator
  24  
  25  evaluator = AnswerExactMatchEvaluator()
  26  result = evaluator.run(
  27      ground_truth_answers=["Berlin", "Paris"],
  28      predicted_answers=["Berlin", "Lyon"],
  29  )
  30  
  31  print(result["individual_scores"])
  32  # [1, 0]
  33  print(result["score"])
  34  # 0.5
  35  ```
  36  
  37  #### run
  38  
  39  ```python
  40  run(
  41      ground_truth_answers: list[str], predicted_answers: list[str]
  42  ) -> dict[str, Any]
  43  ```
  44  
  45  Run the AnswerExactMatchEvaluator on the given inputs.
  46  
  47  The `ground_truth_answers` and `retrieved_answers` must have the same length.
  48  
  49  **Parameters:**
  50  
  51  - **ground_truth_answers** (<code>list\[str\]</code>) – A list of expected answers.
  52  - **predicted_answers** (<code>list\[str\]</code>) – A list of predicted answers.
  53  
  54  **Returns:**
  55  
  56  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
  57  - `individual_scores` - A list of 0s and 1s, where 1 means that the predicted answer matched one of the
  58    ground truth.
  59  - `score` - A number from 0.0 to 1.0 that represents the proportion of questions where any predicted
  60    answer matched one of the ground truth answers.
  61  
  62  ## context_relevance
  63  
  64  ### ContextRelevanceEvaluator
  65  
  66  Bases: <code>LLMEvaluator</code>
  67  
  68  Evaluator that checks if a provided context is relevant to the question.
  69  
  70  An LLM breaks up a context into multiple statements and checks whether each statement
  71  is relevant for answering a question.
  72  The score for each context is either binary score of 1 or 0, where 1 indicates that the context is relevant
  73  to the question and 0 indicates that the context is not relevant.
  74  The evaluator also provides the relevant statements from the context and an average score over all the provided
  75  input questions contexts pairs.
  76  
  77  Usage example:
  78  
  79  ```python
  80  from haystack.components.evaluators import ContextRelevanceEvaluator
  81  
  82  questions = ["Who created the Python language?", "Why does Java needs a JVM?", "Is C++ better than Python?"]
  83  contexts = [
  84      [(
  85          "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming "
  86          "language. Its design philosophy emphasizes code readability, and its language constructs aim to help "
  87          "programmers write clear, logical code for both small and large-scale software projects."
  88      )],
  89      [(
  90          "Java is a high-level, class-based, object-oriented programming language that is designed to have as few "
  91          "implementation dependencies as possible. The JVM has two primary functions: to allow Java programs to run"
  92          "on any device or operating system (known as the 'write once, run anywhere' principle), and to manage and"
  93          "optimize program memory."
  94      )],
  95      [(
  96          "C++ is a general-purpose programming language created by Bjarne Stroustrup as an extension of the C "
  97          "programming language."
  98      )],
  99  ]
 100  
 101  evaluator = ContextRelevanceEvaluator()
 102  result = evaluator.run(questions=questions, contexts=contexts)
 103  print(result["score"])
 104  # 0.67
 105  print(result["individual_scores"])
 106  # [1,1,0]
 107  print(result["results"])
 108  # [{
 109  #   'relevant_statements': ['Python, created by Guido van Rossum in the late 1980s.'],
 110  #    'score': 1.0
 111  #  },
 112  #  {
 113  #   'relevant_statements': ['The JVM has two primary functions: to allow Java programs to run on any device or
 114  #                           operating system (known as the "write once, run anywhere" principle), and to manage and
 115  #                           optimize program memory'],
 116  #   'score': 1.0
 117  #  },
 118  #  {
 119  #   'relevant_statements': [],
 120  #   'score': 0.0
 121  #  }]
 122  ```
 123  
 124  #### __init__
 125  
 126  ```python
 127  __init__(
 128      examples: list[dict[str, Any]] | None = None,
 129      progress_bar: bool = True,
 130      raise_on_failure: bool = True,
 131      chat_generator: ChatGenerator | None = None,
 132  ) -> None
 133  ```
 134  
 135  Creates an instance of ContextRelevanceEvaluator.
 136  
 137  If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode.
 138  
 139  **Parameters:**
 140  
 141  - **examples** (<code>list\[dict\[str, Any\]\] | None</code>) – Optional few-shot examples conforming to the expected input and output format of ContextRelevanceEvaluator.
 142    Default examples will be used if none are provided.
 143    Each example must be a dictionary with keys "inputs" and "outputs".
 144    "inputs" must be a dictionary with keys "questions" and "contexts".
 145    "outputs" must be a dictionary with "relevant_statements".
 146    Expected format:
 147  
 148  ```python
 149  [{
 150      "inputs": {
 151          "questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
 152      },
 153      "outputs": {
 154          "relevant_statements": ["Rome is the capital of Italy."],
 155      },
 156  }]
 157  ```
 158  
 159  - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation.
 160  - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the API call fails.
 161  - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM.
 162    In order for the component to work, the LLM should be configured to return a JSON object. For example,
 163    when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the
 164    `generation_kwargs`.
 165  
 166  #### run
 167  
 168  ```python
 169  run(**inputs: Any) -> dict[str, Any]
 170  ```
 171  
 172  Run the LLM evaluator.
 173  
 174  **Parameters:**
 175  
 176  - **questions** – A list of questions.
 177  - **contexts** – A list of lists of contexts. Each list of contexts corresponds to one question.
 178  
 179  **Returns:**
 180  
 181  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 182    - `score`: Mean context relevance score over all the provided input questions.
 183    - `results`: A list of dictionaries with `relevant_statements` and `score` for each input context.
 184  
 185  #### to_dict
 186  
 187  ```python
 188  to_dict() -> dict[str, Any]
 189  ```
 190  
 191  Serialize this component to a dictionary.
 192  
 193  **Returns:**
 194  
 195  - <code>dict\[str, Any\]</code> – A dictionary with serialized data.
 196  
 197  #### from_dict
 198  
 199  ```python
 200  from_dict(data: dict[str, Any]) -> ContextRelevanceEvaluator
 201  ```
 202  
 203  Deserialize this component from a dictionary.
 204  
 205  **Parameters:**
 206  
 207  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
 208  
 209  **Returns:**
 210  
 211  - <code>ContextRelevanceEvaluator</code> – The deserialized component instance.
 212  
 213  ## document_map
 214  
 215  ### DocumentMAPEvaluator
 216  
 217  A Mean Average Precision (MAP) evaluator for documents.
 218  
 219  Evaluator that calculates the mean average precision of the retrieved documents, a metric
 220  that measures how high retrieved documents are ranked.
 221  Each question can have multiple ground truth documents and multiple retrieved documents.
 222  
 223  `DocumentMAPEvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
 224  should be used to clean and normalize the documents before passing them to this evaluator.
 225  
 226  Usage example:
 227  
 228  ```python
 229  from haystack import Document
 230  from haystack.components.evaluators import DocumentMAPEvaluator
 231  
 232  evaluator = DocumentMAPEvaluator()
 233  result = evaluator.run(
 234      ground_truth_documents=[
 235          [Document(content="France")],
 236          [Document(content="9th century"), Document(content="9th")],
 237      ],
 238      retrieved_documents=[
 239          [Document(content="France")],
 240          [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
 241      ],
 242  )
 243  
 244  print(result["individual_scores"])
 245  # [1.0, 0.8333333333333333]
 246  print(result["score"])
 247  # 0.9166666666666666
 248  ```
 249  
 250  #### __init__
 251  
 252  ```python
 253  __init__(document_comparison_field: str = 'content') -> None
 254  ```
 255  
 256  Create a DocumentMAPEvaluator component.
 257  
 258  **Parameters:**
 259  
 260  - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options:
 261  - `"content"`: uses `doc.content`
 262  - `"id"`: uses `doc.id`
 263  - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]`
 264    (e.g. `"meta.file_id"`, `"meta.page_number"`)
 265    Nested keys are supported (e.g. `"meta.source.url"`).
 266  
 267  #### to_dict
 268  
 269  ```python
 270  to_dict() -> dict[str, Any]
 271  ```
 272  
 273  Serializes the component to a dictionary.
 274  
 275  **Returns:**
 276  
 277  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 278  
 279  #### run
 280  
 281  ```python
 282  run(
 283      ground_truth_documents: list[list[Document]],
 284      retrieved_documents: list[list[Document]],
 285  ) -> dict[str, Any]
 286  ```
 287  
 288  Run the DocumentMAPEvaluator on the given inputs.
 289  
 290  All lists must have the same length.
 291  
 292  **Parameters:**
 293  
 294  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question.
 295  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question.
 296  
 297  **Returns:**
 298  
 299  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 300  - `score` - The average of calculated scores.
 301  - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high retrieved documents
 302    are ranked.
 303  
 304  ## document_mrr
 305  
 306  ### DocumentMRREvaluator
 307  
 308  Evaluator that calculates the mean reciprocal rank of the retrieved documents.
 309  
 310  MRR measures how high the first retrieved document is ranked.
 311  Each question can have multiple ground truth documents and multiple retrieved documents.
 312  
 313  `DocumentMRREvaluator` doesn't normalize its inputs, the `DocumentCleaner` component
 314  should be used to clean and normalize the documents before passing them to this evaluator.
 315  
 316  Usage example:
 317  
 318  ```python
 319  from haystack import Document
 320  from haystack.components.evaluators import DocumentMRREvaluator
 321  
 322  evaluator = DocumentMRREvaluator()
 323  result = evaluator.run(
 324      ground_truth_documents=[
 325          [Document(content="France")],
 326          [Document(content="9th century"), Document(content="9th")],
 327      ],
 328      retrieved_documents=[
 329          [Document(content="France")],
 330          [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
 331      ],
 332  )
 333  print(result["individual_scores"])
 334  # [1.0, 1.0]
 335  print(result["score"])
 336  # 1.0
 337  ```
 338  
 339  #### __init__
 340  
 341  ```python
 342  __init__(document_comparison_field: str = 'content') -> None
 343  ```
 344  
 345  Create a DocumentMRREvaluator component.
 346  
 347  **Parameters:**
 348  
 349  - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options:
 350  - `"content"`: uses `doc.content`
 351  - `"id"`: uses `doc.id`
 352  - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]`
 353    (e.g. `"meta.file_id"`, `"meta.page_number"`)
 354    Nested keys are supported (e.g. `"meta.source.url"`).
 355  
 356  #### to_dict
 357  
 358  ```python
 359  to_dict() -> dict[str, Any]
 360  ```
 361  
 362  Serializes the component to a dictionary.
 363  
 364  **Returns:**
 365  
 366  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 367  
 368  #### run
 369  
 370  ```python
 371  run(
 372      ground_truth_documents: list[list[Document]],
 373      retrieved_documents: list[list[Document]],
 374  ) -> dict[str, Any]
 375  ```
 376  
 377  Run the DocumentMRREvaluator on the given inputs.
 378  
 379  `ground_truth_documents` and `retrieved_documents` must have the same length.
 380  
 381  **Parameters:**
 382  
 383  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question.
 384  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question.
 385  
 386  **Returns:**
 387  
 388  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 389  - `score` - The average of calculated scores.
 390  - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents how high the first retrieved
 391    document is ranked.
 392  
 393  ## document_ndcg
 394  
 395  ### DocumentNDCGEvaluator
 396  
 397  Evaluator that calculates the normalized discounted cumulative gain (NDCG) of retrieved documents.
 398  
 399  Each question can have multiple ground truth documents and multiple retrieved documents.
 400  If the ground truth documents have relevance scores, the NDCG calculation uses these scores.
 401  Otherwise, it assumes binary relevance of all ground truth documents.
 402  
 403  Usage example:
 404  
 405  ```python
 406  from haystack import Document
 407  from haystack.components.evaluators import DocumentNDCGEvaluator
 408  
 409  evaluator = DocumentNDCGEvaluator()
 410  result = evaluator.run(
 411      ground_truth_documents=[[Document(content="France", score=1.0), Document(content="Paris", score=0.5)]],
 412      retrieved_documents=[[Document(content="France"), Document(content="Germany"), Document(content="Paris")]],
 413  )
 414  print(result["individual_scores"])
 415  # [0.8869]
 416  print(result["score"])
 417  # 0.8869
 418  ```
 419  
 420  #### run
 421  
 422  ```python
 423  run(
 424      ground_truth_documents: list[list[Document]],
 425      retrieved_documents: list[list[Document]],
 426  ) -> dict[str, Any]
 427  ```
 428  
 429  Run the DocumentNDCGEvaluator on the given inputs.
 430  
 431  `ground_truth_documents` and `retrieved_documents` must have the same length.
 432  The list items within `ground_truth_documents` and `retrieved_documents` can differ in length.
 433  
 434  **Parameters:**
 435  
 436  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – Lists of expected documents, one list per question. Binary relevance is used if documents have no scores.
 437  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – Lists of retrieved documents, one list per question.
 438  
 439  **Returns:**
 440  
 441  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 442  - `score` - The average of calculated scores.
 443  - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents the NDCG for each question.
 444  
 445  #### validate_inputs
 446  
 447  ```python
 448  validate_inputs(
 449      gt_docs: list[list[Document]], ret_docs: list[list[Document]]
 450  ) -> None
 451  ```
 452  
 453  Validate the input parameters.
 454  
 455  **Parameters:**
 456  
 457  - **gt_docs** (<code>list\[list\[Document\]\]</code>) – The ground_truth_documents to validate.
 458  - **ret_docs** (<code>list\[list\[Document\]\]</code>) – The retrieved_documents to validate.
 459  
 460  **Raises:**
 461  
 462  - <code>ValueError</code> – If the ground_truth_documents or the retrieved_documents are an empty a list.
 463    If the length of ground_truth_documents and retrieved_documents differs.
 464    If any list of documents in ground_truth_documents contains a mix of documents with and without a score.
 465  
 466  #### calculate_dcg
 467  
 468  ```python
 469  calculate_dcg(gt_docs: list[Document], ret_docs: list[Document]) -> float
 470  ```
 471  
 472  Calculate the discounted cumulative gain (DCG) of the retrieved documents.
 473  
 474  **Parameters:**
 475  
 476  - **gt_docs** (<code>list\[Document\]</code>) – The ground truth documents.
 477  - **ret_docs** (<code>list\[Document\]</code>) – The retrieved documents.
 478  
 479  **Returns:**
 480  
 481  - <code>float</code> – The discounted cumulative gain (DCG) of the retrieved
 482    documents based on the ground truth documents.
 483  
 484  #### calculate_idcg
 485  
 486  ```python
 487  calculate_idcg(gt_docs: list[Document]) -> float
 488  ```
 489  
 490  Calculate the ideal discounted cumulative gain (IDCG) of the ground truth documents.
 491  
 492  **Parameters:**
 493  
 494  - **gt_docs** (<code>list\[Document\]</code>) – The ground truth documents.
 495  
 496  **Returns:**
 497  
 498  - <code>float</code> – The ideal discounted cumulative gain (IDCG) of the ground truth documents.
 499  
 500  ## document_recall
 501  
 502  ### RecallMode
 503  
 504  Bases: <code>Enum</code>
 505  
 506  Enum for the mode to use for calculating the recall score.
 507  
 508  #### from_str
 509  
 510  ```python
 511  from_str(string: str) -> RecallMode
 512  ```
 513  
 514  Convert a string to a RecallMode enum.
 515  
 516  ### DocumentRecallEvaluator
 517  
 518  Evaluator that calculates the Recall score for a list of documents.
 519  
 520  Returns both a list of scores for each question and the average.
 521  There can be multiple ground truth documents and multiple predicted documents as input.
 522  
 523  Usage example:
 524  
 525  ```python
 526  from haystack import Document
 527  from haystack.components.evaluators import DocumentRecallEvaluator
 528  
 529  evaluator = DocumentRecallEvaluator()
 530  result = evaluator.run(
 531      ground_truth_documents=[
 532          [Document(content="France")],
 533          [Document(content="9th century"), Document(content="9th")],
 534      ],
 535      retrieved_documents=[
 536          [Document(content="France")],
 537          [Document(content="9th century"), Document(content="10th century"), Document(content="9th")],
 538      ],
 539  )
 540  print(result["individual_scores"])
 541  # [1.0, 1.0]
 542  print(result["score"])
 543  # 1.0
 544  ```
 545  
 546  #### __init__
 547  
 548  ```python
 549  __init__(
 550      mode: str | RecallMode = RecallMode.SINGLE_HIT,
 551      document_comparison_field: str = "content",
 552  ) -> None
 553  ```
 554  
 555  Create a DocumentRecallEvaluator component.
 556  
 557  **Parameters:**
 558  
 559  - **mode** (<code>str | RecallMode</code>) – Mode to use for calculating the recall score.
 560  - **document_comparison_field** (<code>str</code>) – The Document field to use for comparison. Possible options:
 561  - `"content"`: uses `doc.content`
 562  - `"id"`: uses `doc.id`
 563  - A `meta.` prefix followed by a key name: uses `doc.meta["<key>"]`
 564    (e.g. `"meta.file_id"`, `"meta.page_number"`)
 565    Nested keys are supported (e.g. `"meta.source.url"`).
 566  
 567  #### run
 568  
 569  ```python
 570  run(
 571      ground_truth_documents: list[list[Document]],
 572      retrieved_documents: list[list[Document]],
 573  ) -> dict[str, Any]
 574  ```
 575  
 576  Run the DocumentRecallEvaluator on the given inputs.
 577  
 578  `ground_truth_documents` and `retrieved_documents` must have the same length.
 579  
 580  **Parameters:**
 581  
 582  - **ground_truth_documents** (<code>list\[list\[Document\]\]</code>) – A list of expected documents for each question.
 583  - **retrieved_documents** (<code>list\[list\[Document\]\]</code>) – A list of retrieved documents for each question.
 584    A dictionary with the following outputs:
 585    - `score` - The average of calculated scores.
 586    - `individual_scores` - A list of numbers from 0.0 to 1.0 that represents the proportion of matching
 587      documents retrieved. If the mode is `single_hit`, the individual scores are 0 or 1.
 588  
 589  #### to_dict
 590  
 591  ```python
 592  to_dict() -> dict[str, Any]
 593  ```
 594  
 595  Serializes the component to a dictionary.
 596  
 597  **Returns:**
 598  
 599  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
 600  
 601  ## faithfulness
 602  
 603  ### FaithfulnessEvaluator
 604  
 605  Bases: <code>LLMEvaluator</code>
 606  
 607  Evaluator that checks if a generated answer can be inferred from the provided contexts.
 608  
 609  An LLM separates the answer into multiple statements and checks whether the statement can be inferred from the
 610  context or not. The final score for the full answer is a number from 0.0 to 1.0. It represents the proportion of
 611  statements that can be inferred from the provided contexts.
 612  
 613  Usage example:
 614  
 615  ```python
 616  from haystack.components.evaluators import FaithfulnessEvaluator
 617  
 618  questions = ["Who created the Python language?"]
 619  contexts = [
 620      [(
 621          "Python, created by Guido van Rossum in the late 1980s, is a high-level general-purpose programming "
 622          "language. Its design philosophy emphasizes code readability, and its language constructs aim to help "
 623          "programmers write clear, logical code for both small and large-scale software projects."
 624      )],
 625  ]
 626  predicted_answers = [
 627      "Python is a high-level general-purpose programming language that was created by George Lucas."
 628  ]
 629  evaluator = FaithfulnessEvaluator()
 630  result = evaluator.run(questions=questions, contexts=contexts, predicted_answers=predicted_answers)
 631  
 632  print(result["individual_scores"])
 633  # [0.5]
 634  print(result["score"])
 635  # 0.5
 636  print(result["results"])
 637  # [{'statements': ['Python is a high-level general-purpose programming language.',
 638  # 'Python was created by George Lucas.'], 'statement_scores': [1, 0], 'score': 0.5}]
 639  ```
 640  
 641  #### __init__
 642  
 643  ```python
 644  __init__(
 645      examples: list[dict[str, Any]] | None = None,
 646      progress_bar: bool = True,
 647      raise_on_failure: bool = True,
 648      chat_generator: ChatGenerator | None = None,
 649  ) -> None
 650  ```
 651  
 652  Creates an instance of FaithfulnessEvaluator.
 653  
 654  If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode.
 655  
 656  **Parameters:**
 657  
 658  - **examples** (<code>list\[dict\[str, Any\]\] | None</code>) – Optional few-shot examples conforming to the expected input and output format of FaithfulnessEvaluator.
 659    Default examples will be used if none are provided.
 660    Each example must be a dictionary with keys "inputs" and "outputs".
 661    "inputs" must be a dictionary with keys "questions", "contexts", and "predicted_answers".
 662    "outputs" must be a dictionary with "statements" and "statement_scores".
 663    Expected format:
 664  
 665  ```python
 666  [{
 667      "inputs": {
 668          "questions": "What is the capital of Italy?", "contexts": ["Rome is the capital of Italy."],
 669          "predicted_answers": "Rome is the capital of Italy with more than 4 million inhabitants.",
 670      },
 671      "outputs": {
 672          "statements": ["Rome is the capital of Italy.", "Rome has more than 4 million inhabitants."],
 673          "statement_scores": [1, 0],
 674      },
 675  }]
 676  ```
 677  
 678  - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation.
 679  - **raise_on_failure** (<code>bool</code>) – Whether to raise an exception if the API call fails.
 680  - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM.
 681    In order for the component to work, the LLM should be configured to return a JSON object. For example,
 682    when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the
 683    `generation_kwargs`.
 684  
 685  #### run
 686  
 687  ```python
 688  run(**inputs: Any) -> dict[str, Any]
 689  ```
 690  
 691  Run the LLM evaluator.
 692  
 693  **Parameters:**
 694  
 695  - **questions** – A list of questions.
 696  - **contexts** – A nested list of contexts that correspond to the questions.
 697  - **predicted_answers** – A list of predicted answers.
 698  
 699  **Returns:**
 700  
 701  - <code>dict\[str, Any\]</code> – A dictionary with the following outputs:
 702    - `score`: Mean faithfulness score over all the provided input answers.
 703    - `individual_scores`: A list of faithfulness scores for each input answer.
 704    - `results`: A list of dictionaries with `statements` and `statement_scores` for each input answer.
 705  
 706  #### to_dict
 707  
 708  ```python
 709  to_dict() -> dict[str, Any]
 710  ```
 711  
 712  Serialize this component to a dictionary.
 713  
 714  **Returns:**
 715  
 716  - <code>dict\[str, Any\]</code> – A dictionary with serialized data.
 717  
 718  #### from_dict
 719  
 720  ```python
 721  from_dict(data: dict[str, Any]) -> FaithfulnessEvaluator
 722  ```
 723  
 724  Deserialize this component from a dictionary.
 725  
 726  **Parameters:**
 727  
 728  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
 729  
 730  **Returns:**
 731  
 732  - <code>FaithfulnessEvaluator</code> – The deserialized component instance.
 733  
 734  ## llm_evaluator
 735  
 736  ### LLMEvaluator
 737  
 738  Uses an LLM to evaluate inputs based on a prompt containing instructions and examples.
 739  
 740  The default API requires an OpenAI API key to be provided as an environment variable "OPENAI_API_KEY".
 741  The inputs are lists that are user-defined depending on the desired metric.
 742  The output is a dictionary with a key `results` containing a list of evaluation results.
 743  Each result is a dictionary with user-defined keys and values of either 0 for FALSE or 1 for TRUE respectively.
 744  
 745  Usage example:
 746  
 747  ```python
 748  from haystack.components.evaluators import LLMEvaluator
 749  evaluator = LLMEvaluator(
 750      instructions="Is this answer problematic for children?",
 751      inputs=[("predicted_answers", list[str])],
 752      outputs=["score"],
 753      examples=[
 754          {"inputs": {"predicted_answers": "Damn, this is straight outta hell!!!"}, "outputs": {"score": 1}},
 755          {"inputs": {"predicted_answers": "Football is the most popular sport."}, "outputs": {"score": 0}},
 756      ],
 757  )
 758  predicted_answers = [
 759      "Football is the most popular sport with around 4 billion followers worldwide",
 760      "Python language was created by Guido van Rossum.",
 761  ]
 762  results = evaluator.run(predicted_answers=predicted_answers)
 763  print(results)
 764  # {'results': [{'score': 0}, {'score': 0}]}
 765  ```
 766  
 767  #### __init__
 768  
 769  ```python
 770  __init__(
 771      instructions: str,
 772      inputs: list[tuple[str, type[list]]],
 773      outputs: list[str],
 774      examples: list[dict[str, Any]],
 775      progress_bar: bool = True,
 776      *,
 777      raise_on_failure: bool = True,
 778      chat_generator: ChatGenerator | None = None
 779  ) -> None
 780  ```
 781  
 782  Creates an instance of LLMEvaluator.
 783  
 784  If no LLM is specified using the `chat_generator` parameter, the component will use OpenAI in JSON mode.
 785  
 786  **Parameters:**
 787  
 788  - **instructions** (<code>str</code>) – The prompt instructions to use for evaluation.
 789    Should be a question about the inputs that can be answered with yes or no.
 790  - **inputs** (<code>list\[tuple\[str, type\[list\]\]\]</code>) – The inputs that the component expects as incoming connections and that it evaluates.
 791    Each input is a tuple of an input name and input type. Input types must be lists.
 792  - **outputs** (<code>list\[str\]</code>) – Output names of the evaluation results. They correspond to keys in the output dictionary.
 793  - **examples** (<code>list\[dict\[str, Any\]\]</code>) – Few-shot examples conforming to the expected input and output format as defined in the `inputs` and
 794    `outputs` parameters.
 795    Each example is a dictionary with keys "inputs" and "outputs"
 796    They contain the input and output as dictionaries respectively.
 797  - **raise_on_failure** (<code>bool</code>) – If True, the component will raise an exception on an unsuccessful API call.
 798  - **progress_bar** (<code>bool</code>) – Whether to show a progress bar during the evaluation.
 799  - **chat_generator** (<code>ChatGenerator | None</code>) – a ChatGenerator instance which represents the LLM.
 800    In order for the component to work, the LLM should be configured to return a JSON object. For example,
 801    when using the OpenAIChatGenerator, you should pass `{"response_format": {"type": "json_object"}}` in the
 802    `generation_kwargs`.
 803  
 804  #### warm_up
 805  
 806  ```python
 807  warm_up() -> None
 808  ```
 809  
 810  Warm up the component by warming up the underlying chat generator.
 811  
 812  #### validate_init_parameters
 813  
 814  ```python
 815  validate_init_parameters(
 816      inputs: list[tuple[str, type[list]]],
 817      outputs: list[str],
 818      examples: list[dict[str, Any]],
 819  ) -> None
 820  ```
 821  
 822  Validate the init parameters.
 823  
 824  **Parameters:**
 825  
 826  - **inputs** (<code>list\[tuple\[str, type\[list\]\]\]</code>) – The inputs to validate.
 827  - **outputs** (<code>list\[str\]</code>) – The outputs to validate.
 828  - **examples** (<code>list\[dict\[str, Any\]\]</code>) – The examples to validate.
 829  
 830  **Raises:**
 831  
 832  - <code>ValueError</code> – If the inputs are not a list of tuples with a string and a type of list.
 833    If the outputs are not a list of strings.
 834    If the examples are not a list of dictionaries.
 835    If any example does not have keys "inputs" and "outputs" with values that are dictionaries with string keys.
 836  
 837  #### run
 838  
 839  ```python
 840  run(**inputs: Any) -> dict[str, Any]
 841  ```
 842  
 843  Run the LLM evaluator.
 844  
 845  **Parameters:**
 846  
 847  - **inputs** (<code>Any</code>) – The input values to evaluate. The keys are the input names and the values are lists of input values.
 848  
 849  **Returns:**
 850  
 851  - <code>dict\[str, Any\]</code> – A dictionary with a `results` entry that contains a list of results.
 852    Each result is a dictionary containing the keys as defined in the `outputs` parameter of the LLMEvaluator
 853    and the evaluation results as the values. If an exception occurs for a particular input value, the result
 854    will be `None` for that entry.
 855    If the API is "openai" and the response contains a "meta" key, the metadata from OpenAI will be included
 856    in the output dictionary, under the key "meta".
 857  
 858  **Raises:**
 859  
 860  - <code>ValueError</code> – Only in the case that `raise_on_failure` is set to True and the received inputs are not lists or have
 861    different lengths, or if the output is not a valid JSON or doesn't contain the expected keys.
 862  
 863  #### prepare_template
 864  
 865  ```python
 866  prepare_template() -> str
 867  ```
 868  
 869  Prepare the prompt template.
 870  
 871  Combine instructions, inputs, outputs, and examples into one prompt template with the following format:
 872  Instructions:
 873  `<instructions>`
 874  
 875  Generate the response in JSON format with the following keys:
 876  `<list of output keys>`
 877  Consider the instructions and the examples below to determine those values.
 878  
 879  Examples:
 880  `<examples>`
 881  
 882  Inputs:
 883  `<inputs>`
 884  Outputs:
 885  
 886  **Returns:**
 887  
 888  - <code>str</code> – The prompt template.
 889  
 890  #### to_dict
 891  
 892  ```python
 893  to_dict() -> dict[str, Any]
 894  ```
 895  
 896  Serialize this component to a dictionary.
 897  
 898  **Returns:**
 899  
 900  - <code>dict\[str, Any\]</code> – The serialized component as a dictionary.
 901  
 902  #### from_dict
 903  
 904  ```python
 905  from_dict(data: dict[str, Any]) -> LLMEvaluator
 906  ```
 907  
 908  Deserialize this component from a dictionary.
 909  
 910  **Parameters:**
 911  
 912  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
 913  
 914  **Returns:**
 915  
 916  - <code>LLMEvaluator</code> – The deserialized component instance.
 917  
 918  #### validate_input_parameters
 919  
 920  ```python
 921  validate_input_parameters(
 922      expected: dict[str, Any], received: dict[str, Any]
 923  ) -> None
 924  ```
 925  
 926  Validate the input parameters.
 927  
 928  **Parameters:**
 929  
 930  - **expected** (<code>dict\[str, Any\]</code>) – The expected input parameters.
 931  - **received** (<code>dict\[str, Any\]</code>) – The received input parameters.
 932  
 933  **Raises:**
 934  
 935  - <code>ValueError</code> – If not all expected inputs are present in the received inputs
 936    If the received inputs are not lists or have different lengths
 937  
 938  ## sas_evaluator
 939  
 940  ### SASEvaluator
 941  
 942  SASEvaluator computes the Semantic Answer Similarity (SAS) between a list of predictions and a one of ground truths.
 943  
 944  It's usually used in Retrieval Augmented Generation (RAG) pipelines to evaluate the quality of the generated
 945  answers. The SAS is computed using a pre-trained model from the Hugging Face model hub. The model can be either a
 946  Bi-Encoder or a Cross-Encoder. The choice of the model is based on the `model` parameter.
 947  
 948  Usage example:
 949  
 950  ```python
 951  from haystack.components.evaluators.sas_evaluator import SASEvaluator
 952  
 953  evaluator = SASEvaluator(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
 954  ground_truths = [
 955      "A construction budget of US $2.3 billion",
 956      "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
 957      "The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
 958  ]
 959  predictions = [
 960      "A construction budget of US $2.3 billion",
 961      "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
 962      "The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
 963  ]
 964  result = evaluator.run(
 965      ground_truth_answers=ground_truths, predicted_answers=predictions
 966  )
 967  
 968  print(result["score"])
 969  # 0.9999673763910929
 970  
 971  print(result["individual_scores"])
 972  # [0.9999765157699585, 0.999968409538269, 0.9999572038650513]
 973  ```
 974  
 975  #### __init__
 976  
 977  ```python
 978  __init__(
 979      model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
 980      batch_size: int = 32,
 981      device: ComponentDevice | None = None,
 982      token: Secret = Secret.from_env_var(
 983          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
 984      ),
 985  ) -> None
 986  ```
 987  
 988  Creates a new instance of SASEvaluator.
 989  
 990  **Parameters:**
 991  
 992  - **model** (<code>str</code>) – SentenceTransformers semantic textual similarity model, should be path or string pointing to a downloadable
 993    model.
 994  - **batch_size** (<code>int</code>) – Number of prediction-label pairs to encode at once.
 995  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically selected.
 996  - **token** (<code>Secret</code>) – The Hugging Face token for HTTP bearer authorization.
 997    You can find your HF token in your [account settings](https://huggingface.co/settings/tokens)
 998  
 999  #### to_dict
1000  
1001  ```python
1002  to_dict() -> dict[str, Any]
1003  ```
1004  
1005  Serialize this component to a dictionary.
1006  
1007  **Returns:**
1008  
1009  - <code>dict\[str, Any\]</code> – The serialized component as a dictionary.
1010  
1011  #### from_dict
1012  
1013  ```python
1014  from_dict(data: dict[str, Any]) -> SASEvaluator
1015  ```
1016  
1017  Deserialize this component from a dictionary.
1018  
1019  **Parameters:**
1020  
1021  - **data** (<code>dict\[str, Any\]</code>) – The dictionary representation of this component.
1022  
1023  **Returns:**
1024  
1025  - <code>SASEvaluator</code> – The deserialized component instance.
1026  
1027  #### warm_up
1028  
1029  ```python
1030  warm_up() -> None
1031  ```
1032  
1033  Initializes the component.
1034  
1035  #### run
1036  
1037  ```python
1038  run(
1039      ground_truth_answers: list[str], predicted_answers: list[str]
1040  ) -> dict[str, float | list[float]]
1041  ```
1042  
1043  SASEvaluator component run method.
1044  
1045  Run the SASEvaluator to compute the Semantic Answer Similarity (SAS) between a list of predicted answers
1046  and a list of ground truth answers. Both must be list of strings of same length.
1047  
1048  **Parameters:**
1049  
1050  - **ground_truth_answers** (<code>list\[str\]</code>) – A list of expected answers for each question.
1051  - **predicted_answers** (<code>list\[str\]</code>) – A list of generated answers for each question.
1052  
1053  **Returns:**
1054  
1055  - <code>dict\[str, float | list\[float\]\]</code> – A dictionary with the following outputs:
1056    - `score`: Mean SAS score over all the predictions/ground-truth pairs.
1057    - `individual_scores`: A list of similarity scores for each prediction/ground-truth pair.