Cradicle Explorer

/ docs-website / versioned_docs / version-2.21 / pipeline-components / evaluators / llmevaluator.mdx
llmevaluator.mdx
  1  ---
  2  title: "LLMEvaluator"
  3  id: llmevaluator
  4  slug: "/llmevaluator"
  5  description: "This Evaluator uses an LLM to evaluate inputs based on a prompt containing user-defined instructions and examples."
  6  ---
  7  
  8  # LLMEvaluator
  9  
 10  This Evaluator uses an LLM to evaluate inputs based on a prompt containing user-defined instructions and examples.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | On its own or in an evaluation pipeline. To be used after a separate pipeline that has generated the inputs for the Evaluator. |
 17  | **Mandatory init variables** | `instructions`: The prompt instructions string  <br /> <br />`inputs`: The expected inputs  <br /> <br />`outputs`: The output names of the evaluation results  <br /> <br />`examples`: Few-shot examples conforming to the input and output format |
 18  | **Mandatory run variables** | `inputs`: Defined by the user – for example, questions or responses |
 19  | **Output variables** | `results`: A dictionary containing keys defined by the user, such as score |
 20  | **API reference** | [Evaluators](/reference/evaluators-api) |
 21  | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/llm_evaluator.py |
 22  
 23  </div>
 24  
 25  ## Overview
 26  
 27  The `LLMEvaluator` component can evaluate answers, documents, or any other outputs of a Haystack pipeline based on a user-defined aspect. The component combines the instructions, examples, and expected output names into one prompt. It is meant for calculating user-defined model-based evaluation metrics. If you are looking for pre-defined model-based evaluators that work out of the box, have a look at Haystack’s [`FaithfulnessEvaluator`](faithfulnessevaluator.mdx) and [`ContextRelevanceEvaluator`](contextrelevanceevaluator.mdx) components instead.
 28  
 29  ### Parameters
 30  
 31  The default model for this Evaluator is `gpt-4o-mini`. You can override the model using the `chat_generator` parameter during initialization. This needs to be a Chat Generator instance configured to return a JSON object. For example, when using the [`OpenAIChatGenerator`](../generators/openaichatgenerator.mdx), you should pass `{"response_format": {"type": "json_object"}}` in its `generation_kwargs`.
 32  
 33  If you are not initializing the Evaluator with your own Chat Generator other than OpenAI, a valid OpenAI API key must be set as an `OPENAI_API_KEY` environment variable. For details, see our [documentation page on secret management](../../concepts/secret-management.mdx).
 34  
 35  `LLMEvaluator` requires six parameters for initialization:
 36  
 37  - `instructions`: The prompt instructions to use for evaluation, such as a question about the inputs that the LLM can answer with _yes,_ _no_, or a score.
 38  - `inputs`: The inputs that the `LLMEvaluator` expects and that it evaluates. The inputs determine the incoming connections of the component. Each input is a tuple of an input name and input type. Input types must be lists. An example could be `[("responses", List[str])]`.
 39  - `outputs`: Output names of the evaluation results corresponding to keys in the output dictionary. An example could be a `["score"]`.
 40  - `examples`: Use this parameter to pass few-shot examples conforming to the expected input and output format. These examples are included in the prompt that is sent to the LLM. Examples increase the number of tokens of the prompt and make each request more costly. Adding more than one or two examples can be helpful if you want to improve the quality of the evaluation at the cost of more tokens.
 41  - `raise_on_failure`: If True (default), raise an exception on an unsuccessful API call.
 42  - `progress_bar`: Whether to show a progress bar during the evaluation. None is the default.
 43  
 44  Each example must be a dictionary with keys `inputs` and `outputs`.
 45  `inputs` must be a dictionary with keys `questions` and `contexts`.
 46  `outputs` must be a dictionary with `statements` and `statement_scores`.
 47  
 48  Here is the expected format:
 49  
 50  ```python
 51  [
 52      {
 53          "inputs": {
 54              "questions": "What is the capital of Italy?",
 55              "contexts": ["Rome is the capital of Italy."],
 56          },
 57          "outputs": {
 58              "statements": [
 59                  "Rome is the capital of Italy.",
 60                  "Rome has more than 4 million inhabitants.",
 61              ],
 62              "statement_scores": [1, 0],
 63          },
 64      },
 65  ]
 66  ```
 67  
 68  ## Usage
 69  
 70  ### On its own
 71  
 72  Below is an example where we use an `LLMEvaluator` component to evaluate a generated response. The aspect we evaluate is whether the response is problematic for children as defined in the instructions. The `LLMEvaluator` returns one binary score per input response with the result that both responses are not problematic.
 73  
 74  ```python
 75  from typing import List
 76  from haystack.components.evaluators import LLMEvaluator
 77  
 78  llm_evaluator = LLMEvaluator(
 79      instructions="Is this answer problematic for children?",
 80      inputs=[("responses", List[str])],
 81      outputs=["score"],
 82      examples=[
 83          {
 84              "inputs": {"responses": "Damn, this is straight outta hell!!!"},
 85              "outputs": {"score": 1},
 86          },
 87          {
 88              "inputs": {"responses": "Football is the most popular sport."},
 89              "outputs": {"score": 0},
 90          },
 91      ],
 92  )
 93  responses = [
 94      "Football is the most popular sport with around 4 billion followers worldwide",
 95      "Python language was created by Guido van Rossum.",
 96  ]
 97  results = llm_evaluator.run(responses=responses)
 98  print(results)
 99  ## {'results': [{'score': 0}, {'score': 0}]}
100  ```
101  
102  ### In a pipeline
103  
104  Below is an example where we use an `LLMEvaluator` in a pipeline to evaluate a response.
105  
106  ```python
107  from typing import List
108  from haystack import Pipeline
109  from haystack.components.evaluators import LLMEvaluator
110  
111  pipeline = Pipeline()
112  llm_evaluator = LLMEvaluator(
113      instructions="Is this answer problematic for children?",
114      inputs=[("responses", List[str])],
115      outputs=["score"],
116      examples=[
117          {
118              "inputs": {"responses": "Damn, this is straight outta hell!!!"},
119              "outputs": {"score": 1},
120          },
121          {
122              "inputs": {"responses": "Football is the most popular sport."},
123              "outputs": {"score": 0},
124          },
125      ],
126  )
127  
128  pipeline.add_component("llm_evaluator", llm_evaluator)
129  
130  responses = [
131      "Football is the most popular sport with around 4 billion followers worldwide",
132      "Python language was created by Guido van Rossum.",
133  ]
134  
135  result = pipeline.run({"llm_evaluator": {"responses": responses}})
136  
137  for evaluator in result:
138      print(result[evaluator]["results"])
139  ## [{'score': 0}, {'score': 0}]
140  ```