/ docs-website / versioned_docs / version-2.27 / pipeline-components / evaluators / llmevaluator.mdx
llmevaluator.mdx
1 --- 2 title: "LLMEvaluator" 3 id: llmevaluator 4 slug: "/llmevaluator" 5 description: "This Evaluator uses an LLM to evaluate inputs based on a prompt containing user-defined instructions and examples." 6 --- 7 8 # LLMEvaluator 9 10 This Evaluator uses an LLM to evaluate inputs based on a prompt containing user-defined instructions and examples. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | On its own or in an evaluation pipeline. To be used after a separate pipeline that has generated the inputs for the Evaluator. | 17 | **Mandatory init variables** | `instructions`: The prompt instructions string <br /> <br />`inputs`: The expected inputs <br /> <br />`outputs`: The output names of the evaluation results <br /> <br />`examples`: Few-shot examples conforming to the input and output format | 18 | **Mandatory run variables** | `inputs`: Defined by the user – for example, questions or responses | 19 | **Output variables** | `results`: A dictionary containing keys defined by the user, such as score | 20 | **API reference** | [Evaluators](/reference/evaluators-api) | 21 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/evaluators/llm_evaluator.py | 22 23 </div> 24 25 ## Overview 26 27 The `LLMEvaluator` component can evaluate answers, documents, or any other outputs of a Haystack pipeline based on a user-defined aspect. The component combines the instructions, examples, and expected output names into one prompt. It is meant for calculating user-defined model-based evaluation metrics. If you are looking for pre-defined model-based evaluators that work out of the box, have a look at Haystack’s [`FaithfulnessEvaluator`](faithfulnessevaluator.mdx) and [`ContextRelevanceEvaluator`](contextrelevanceevaluator.mdx) components instead. 28 29 ### Parameters 30 31 The default model for this Evaluator is `gpt-4o-mini`. You can override the model using the `chat_generator` parameter during initialization. This needs to be a Chat Generator instance configured to return a JSON object. For example, when using the [`OpenAIChatGenerator`](../generators/openaichatgenerator.mdx), you should pass `{"response_format": {"type": "json_object"}}` in its `generation_kwargs`. 32 33 If you are not initializing the Evaluator with your own Chat Generator other than OpenAI, a valid OpenAI API key must be set as an `OPENAI_API_KEY` environment variable. For details, see our [documentation page on secret management](../../concepts/secret-management.mdx). 34 35 `LLMEvaluator` requires six parameters for initialization: 36 37 - `instructions`: The prompt instructions to use for evaluation, such as a question about the inputs that the LLM can answer with _yes,_ _no_, or a score. 38 - `inputs`: The inputs that the `LLMEvaluator` expects and that it evaluates. The inputs determine the incoming connections of the component. Each input is a tuple of an input name and input type. Input types must be lists. An example could be `[("responses", List[str])]`. 39 - `outputs`: Output names of the evaluation results corresponding to keys in the output dictionary. An example could be a `["score"]`. 40 - `examples`: Use this parameter to pass few-shot examples conforming to the expected input and output format. These examples are included in the prompt that is sent to the LLM. Examples increase the number of tokens of the prompt and make each request more costly. Adding more than one or two examples can be helpful if you want to improve the quality of the evaluation at the cost of more tokens. 41 - `raise_on_failure`: If True (default), raise an exception on an unsuccessful API call. 42 - `progress_bar`: Whether to show a progress bar during the evaluation. None is the default. 43 44 Each example must be a dictionary with keys `inputs` and `outputs`. 45 `inputs` must be a dictionary with keys `questions` and `contexts`. 46 `outputs` must be a dictionary with `statements` and `statement_scores`. 47 48 Here is the expected format: 49 50 ```python 51 [ 52 { 53 "inputs": { 54 "questions": "What is the capital of Italy?", 55 "contexts": ["Rome is the capital of Italy."], 56 }, 57 "outputs": { 58 "statements": [ 59 "Rome is the capital of Italy.", 60 "Rome has more than 4 million inhabitants.", 61 ], 62 "statement_scores": [1, 0], 63 }, 64 }, 65 ] 66 ``` 67 68 ## Usage 69 70 ### On its own 71 72 Below is an example where we use an `LLMEvaluator` component to evaluate a generated response. The aspect we evaluate is whether the response is problematic for children as defined in the instructions. The `LLMEvaluator` returns one binary score per input response with the result that both responses are not problematic. 73 74 ```python 75 from typing import List 76 from haystack.components.evaluators import LLMEvaluator 77 78 llm_evaluator = LLMEvaluator( 79 instructions="Is this answer problematic for children?", 80 inputs=[("responses", List[str])], 81 outputs=["score"], 82 examples=[ 83 { 84 "inputs": {"responses": "Damn, this is straight outta hell!!!"}, 85 "outputs": {"score": 1}, 86 }, 87 { 88 "inputs": {"responses": "Football is the most popular sport."}, 89 "outputs": {"score": 0}, 90 }, 91 ], 92 ) 93 responses = [ 94 "Football is the most popular sport with around 4 billion followers worldwide", 95 "Python language was created by Guido van Rossum.", 96 ] 97 results = llm_evaluator.run(responses=responses) 98 print(results) 99 ## {'results': [{'score': 0}, {'score': 0}]} 100 ``` 101 102 ### In a pipeline 103 104 Below is an example where we use an `LLMEvaluator` in a pipeline to evaluate a response. 105 106 ```python 107 from typing import List 108 from haystack import Pipeline 109 from haystack.components.evaluators import LLMEvaluator 110 111 pipeline = Pipeline() 112 llm_evaluator = LLMEvaluator( 113 instructions="Is this answer problematic for children?", 114 inputs=[("responses", List[str])], 115 outputs=["score"], 116 examples=[ 117 { 118 "inputs": {"responses": "Damn, this is straight outta hell!!!"}, 119 "outputs": {"score": 1}, 120 }, 121 { 122 "inputs": {"responses": "Football is the most popular sport."}, 123 "outputs": {"score": 0}, 124 }, 125 ], 126 ) 127 128 pipeline.add_component("llm_evaluator", llm_evaluator) 129 130 responses = [ 131 "Football is the most popular sport with around 4 billion followers worldwide", 132 "Python language was created by Guido van Rossum.", 133 ] 134 135 result = pipeline.run({"llm_evaluator": {"responses": responses}}) 136 137 for evaluator in result: 138 print(result[evaluator]["results"]) 139 ## [{'score': 0}, {'score': 0}] 140 ```