classifiers_api.md
1 --- 2 title: "Classifiers" 3 id: classifiers-api 4 description: "Classify documents based on the provided labels." 5 slug: "/classifiers-api" 6 --- 7 8 9 ## document_language_classifier 10 11 ### DocumentLanguageClassifier 12 13 Classifies the language of each document and adds it to its metadata. 14 15 Provide a list of languages during initialization. If the document's text doesn't match any of the 16 specified languages, the metadata value is set to "unmatched". 17 To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier. 18 For routing plain text, use the TextLanguageRouter component instead. 19 20 ### Usage example 21 22 ```python 23 from haystack import Document, Pipeline 24 from haystack.document_stores.in_memory import InMemoryDocumentStore 25 from haystack.components.classifiers import DocumentLanguageClassifier 26 from haystack.components.routers import MetadataRouter 27 from haystack.components.writers import DocumentWriter 28 29 docs = [Document(id="1", content="This is an English document"), 30 Document(id="2", content="Este es un documento en español")] 31 32 document_store = InMemoryDocumentStore() 33 34 p = Pipeline() 35 p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier") 36 p.add_component( 37 instance=MetadataRouter(rules={ 38 "en": { 39 "field": "meta.language", 40 "operator": "==", 41 "value": "en" 42 } 43 }), 44 name="router") 45 p.add_component(instance=DocumentWriter(document_store=document_store), name="writer") 46 p.connect("language_classifier.documents", "router.documents") 47 p.connect("router.en", "writer.documents") 48 49 p.run({"language_classifier": {"documents": docs}}) 50 51 written_docs = document_store.filter_documents() 52 assert len(written_docs) == 1 53 assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"}) 54 ``` 55 56 #### __init__ 57 58 ```python 59 __init__(languages: list[str] | None = None) 60 ``` 61 62 Initializes the DocumentLanguageClassifier component. 63 64 **Parameters:** 65 66 - **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes. 67 See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages). 68 If not specified, defaults to ["en"]. 69 70 #### run 71 72 ```python 73 run(documents: list[Document]) 74 ``` 75 76 Classifies the language of each document and adds it to its metadata. 77 78 If the document's text doesn't match any of the languages specified at initialization, 79 sets the metadata value to "unmatched". 80 81 **Parameters:** 82 83 - **documents** (<code>list\[Document\]</code>) – A list of documents for language classification. 84 85 **Returns:** 86 87 - – A dictionary with the following key: 88 - `documents`: A list of documents with an added `language` metadata field. 89 90 **Raises:** 91 92 - <code>TypeError</code> – if the input is not a list of Documents. 93 94 ## zero_shot_document_classifier 95 96 ### TransformersZeroShotDocumentClassifier 97 98 Performs zero-shot classification of documents based on given labels and adds the predicted label to their metadata. 99 100 The component uses a Hugging Face pipeline for zero-shot classification. 101 Provide the model and the set of labels to be used for categorization during initialization. 102 Additionally, you can configure the component to allow multiple labels to be true. 103 104 Classification is run on the document's content field by default. If you want it to run on another field, set the 105 `classification_field` to one of the document's metadata fields. 106 107 Available models for the task of zero-shot-classification include: 108 \- `valhalla/distilbart-mnli-12-3` 109 \- `cross-encoder/nli-distilroberta-base` 110 \- `cross-encoder/nli-deberta-v3-xsmall` 111 112 ### Usage example 113 114 The following is a pipeline that classifies documents based on predefined classification labels 115 retrieved from a search pipeline: 116 117 ```python 118 from haystack import Document 119 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 120 from haystack.document_stores.in_memory import InMemoryDocumentStore 121 from haystack.core.pipeline import Pipeline 122 from haystack.components.classifiers import TransformersZeroShotDocumentClassifier 123 124 documents = [Document(id="0", content="Today was a nice day!"), 125 Document(id="1", content="Yesterday was a bad day!")] 126 127 document_store = InMemoryDocumentStore() 128 retriever = InMemoryBM25Retriever(document_store=document_store) 129 document_classifier = TransformersZeroShotDocumentClassifier( 130 model="cross-encoder/nli-deberta-v3-xsmall", 131 labels=["positive", "negative"], 132 ) 133 134 document_store.write_documents(documents) 135 136 pipeline = Pipeline() 137 pipeline.add_component(instance=retriever, name="retriever") 138 pipeline.add_component(instance=document_classifier, name="document_classifier") 139 pipeline.connect("retriever", "document_classifier") 140 141 queries = ["How was your day today?", "How was your day yesterday?"] 142 expected_predictions = ["positive", "negative"] 143 144 for idx, query in enumerate(queries): 145 result = pipeline.run({"retriever": {"query": query, "top_k": 1}}) 146 assert result["document_classifier"]["documents"][0].to_dict()["id"] == str(idx) 147 assert (result["document_classifier"]["documents"][0].to_dict()["classification"]["label"] 148 == expected_predictions[idx]) 149 ``` 150 151 #### __init__ 152 153 ```python 154 __init__( 155 model: str, 156 labels: list[str], 157 multi_label: bool = False, 158 classification_field: str | None = None, 159 device: ComponentDevice | None = None, 160 token: Secret | None = Secret.from_env_var( 161 ["HF_API_TOKEN", "HF_TOKEN"], strict=False 162 ), 163 huggingface_pipeline_kwargs: dict[str, Any] | None = None, 164 ) 165 ``` 166 167 Initializes the TransformersZeroShotDocumentClassifier. 168 169 See the Hugging Face [website](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads&search=nli) 170 for the full list of zero-shot classification models (NLI) models. 171 172 **Parameters:** 173 174 - **model** (<code>str</code>) – The name or path of a Hugging Face model for zero shot document classification. 175 - **labels** (<code>list\[str\]</code>) – The set of possible class labels to classify each document into, for example, 176 ["positive", "negative"]. The labels depend on the selected model. 177 - **multi_label** (<code>bool</code>) – Whether or not multiple candidate labels can be true. 178 If `False`, the scores are normalized such that 179 the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered 180 independent and probabilities are normalized for each candidate by doing a softmax of the entailment 181 score vs. the contradiction score. 182 - **classification_field** (<code>str | None</code>) – Name of document's meta field to be used for classification. 183 If not set, `Document.content` is used by default. 184 - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically 185 selected. If a device/device map is specified in `huggingface_pipeline_kwargs`, it overrides this parameter. 186 - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization. 187 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 188 - **huggingface_pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Dictionary containing keyword arguments used to initialize the 189 Hugging Face pipeline for text classification. 190 191 #### warm_up 192 193 ```python 194 warm_up() 195 ``` 196 197 Initializes the component. 198 199 #### to_dict 200 201 ```python 202 to_dict() -> dict[str, Any] 203 ``` 204 205 Serializes the component to a dictionary. 206 207 **Returns:** 208 209 - <code>dict\[str, Any\]</code> – Dictionary with serialized data. 210 211 #### from_dict 212 213 ```python 214 from_dict(data: dict[str, Any]) -> TransformersZeroShotDocumentClassifier 215 ``` 216 217 Deserializes the component from a dictionary. 218 219 **Parameters:** 220 221 - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from. 222 223 **Returns:** 224 225 - <code>TransformersZeroShotDocumentClassifier</code> – Deserialized component. 226 227 #### run 228 229 ```python 230 run(documents: list[Document], batch_size: int = 1) 231 ``` 232 233 Classifies the documents based on the provided labels and adds them to their metadata. 234 235 The classification results are stored in the `classification` dict within 236 each document's metadata. If `multi_label` is set to `True`, the scores for each label are available under 237 the `details` key within the `classification` dictionary. 238 239 **Parameters:** 240 241 - **documents** (<code>list\[Document\]</code>) – Documents to process. 242 - **batch_size** (<code>int</code>) – Batch size used for processing the content in each document. 243 244 **Returns:** 245 246 - – A dictionary with the following key: 247 - `documents`: A list of documents with an added metadata field called `classification`.