classifiers_api.md
1 --- 2 title: Classifiers 3 id: classifiers-api 4 description: Classify documents based on the provided labels. 5 slug: "/classifiers-api" 6 --- 7 8 <a id="document_language_classifier"></a> 9 10 # Module document\_language\_classifier 11 12 <a id="document_language_classifier.DocumentLanguageClassifier"></a> 13 14 ## DocumentLanguageClassifier 15 16 Classifies the language of each document and adds it to its metadata. 17 18 Provide a list of languages during initialization. If the document's text doesn't match any of the 19 specified languages, the metadata value is set to "unmatched". 20 To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier. 21 For routing plain text, use the TextLanguageRouter component instead. 22 23 ### Usage example 24 25 ```python 26 from haystack import Document, Pipeline 27 from haystack.document_stores.in_memory import InMemoryDocumentStore 28 from haystack.components.classifiers import DocumentLanguageClassifier 29 from haystack.components.routers import MetadataRouter 30 from haystack.components.writers import DocumentWriter 31 32 docs = [Document(id="1", content="This is an English document"), 33 Document(id="2", content="Este es un documento en espaƱol")] 34 35 document_store = InMemoryDocumentStore() 36 37 p = Pipeline() 38 p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier") 39 p.add_component(instance=MetadataRouter(rules={"en": {"language": {"$eq": "en"}}}), name="router") 40 p.add_component(instance=DocumentWriter(document_store=document_store), name="writer") 41 p.connect("language_classifier.documents", "router.documents") 42 p.connect("router.en", "writer.documents") 43 44 p.run({"language_classifier": {"documents": docs}}) 45 46 written_docs = document_store.filter_documents() 47 assert len(written_docs) == 1 48 assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"}) 49 ``` 50 51 <a id="document_language_classifier.DocumentLanguageClassifier.__init__"></a> 52 53 #### DocumentLanguageClassifier.\_\_init\_\_ 54 55 ```python 56 def __init__(languages: Optional[list[str]] = None) 57 ``` 58 59 Initializes the DocumentLanguageClassifier component. 60 61 **Arguments**: 62 63 - `languages`: A list of ISO language codes. 64 See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages). 65 If not specified, defaults to ["en"]. 66 67 <a id="document_language_classifier.DocumentLanguageClassifier.run"></a> 68 69 #### DocumentLanguageClassifier.run 70 71 ```python 72 @component.output_types(documents=list[Document]) 73 def run(documents: list[Document]) 74 ``` 75 76 Classifies the language of each document and adds it to its metadata. 77 78 If the document's text doesn't match any of the languages specified at initialization, 79 sets the metadata value to "unmatched". 80 81 **Arguments**: 82 83 - `documents`: A list of documents for language classification. 84 85 **Raises**: 86 87 - `TypeError`: if the input is not a list of Documents. 88 89 **Returns**: 90 91 A dictionary with the following key: 92 - `documents`: A list of documents with an added `language` metadata field. 93 94 <a id="zero_shot_document_classifier"></a> 95 96 # Module zero\_shot\_document\_classifier 97 98 <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier"></a> 99 100 ## TransformersZeroShotDocumentClassifier 101 102 Performs zero-shot classification of documents based on given labels and adds the predicted label to their metadata. 103 104 The component uses a Hugging Face pipeline for zero-shot classification. 105 Provide the model and the set of labels to be used for categorization during initialization. 106 Additionally, you can configure the component to allow multiple labels to be true. 107 108 Classification is run on the document's content field by default. If you want it to run on another field, set the 109 `classification_field` to one of the document's metadata fields. 110 111 Available models for the task of zero-shot-classification include: 112 - `valhalla/distilbart-mnli-12-3` 113 - `cross-encoder/nli-distilroberta-base` 114 - `cross-encoder/nli-deberta-v3-xsmall` 115 116 ### Usage example 117 118 The following is a pipeline that classifies documents based on predefined classification labels 119 retrieved from a search pipeline: 120 121 ```python 122 from haystack import Document 123 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 124 from haystack.document_stores.in_memory import InMemoryDocumentStore 125 from haystack.core.pipeline import Pipeline 126 from haystack.components.classifiers import TransformersZeroShotDocumentClassifier 127 128 documents = [Document(id="0", content="Today was a nice day!"), 129 Document(id="1", content="Yesterday was a bad day!")] 130 131 document_store = InMemoryDocumentStore() 132 retriever = InMemoryBM25Retriever(document_store=document_store) 133 document_classifier = TransformersZeroShotDocumentClassifier( 134 model="cross-encoder/nli-deberta-v3-xsmall", 135 labels=["positive", "negative"], 136 ) 137 138 document_store.write_documents(documents) 139 140 pipeline = Pipeline() 141 pipeline.add_component(instance=retriever, name="retriever") 142 pipeline.add_component(instance=document_classifier, name="document_classifier") 143 pipeline.connect("retriever", "document_classifier") 144 145 queries = ["How was your day today?", "How was your day yesterday?"] 146 expected_predictions = ["positive", "negative"] 147 148 for idx, query in enumerate(queries): 149 result = pipeline.run({"retriever": {"query": query, "top_k": 1}}) 150 assert result["document_classifier"]["documents"][0].to_dict()["id"] == str(idx) 151 assert (result["document_classifier"]["documents"][0].to_dict()["classification"]["label"] 152 == expected_predictions[idx]) 153 ``` 154 155 <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.__init__"></a> 156 157 #### TransformersZeroShotDocumentClassifier.\_\_init\_\_ 158 159 ```python 160 def __init__(model: str, 161 labels: list[str], 162 multi_label: bool = False, 163 classification_field: Optional[str] = None, 164 device: Optional[ComponentDevice] = None, 165 token: Optional[Secret] = Secret.from_env_var( 166 ["HF_API_TOKEN", "HF_TOKEN"], strict=False), 167 huggingface_pipeline_kwargs: Optional[dict[str, Any]] = None) 168 ``` 169 170 Initializes the TransformersZeroShotDocumentClassifier. 171 172 See the Hugging Face [website](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads&search=nli) 173 for the full list of zero-shot classification models (NLI) models. 174 175 **Arguments**: 176 177 - `model`: The name or path of a Hugging Face model for zero shot document classification. 178 - `labels`: The set of possible class labels to classify each document into, for example, 179 ["positive", "negative"]. The labels depend on the selected model. 180 - `multi_label`: Whether or not multiple candidate labels can be true. 181 If `False`, the scores are normalized such that 182 the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered 183 independent and probabilities are normalized for each candidate by doing a softmax of the entailment 184 score vs. the contradiction score. 185 - `classification_field`: Name of document's meta field to be used for classification. 186 If not set, `Document.content` is used by default. 187 - `device`: The device on which the model is loaded. If `None`, the default device is automatically 188 selected. If a device/device map is specified in `huggingface_pipeline_kwargs`, it overrides this parameter. 189 - `token`: The Hugging Face token to use as HTTP bearer authorization. 190 Check your HF token in your [account settings](https://huggingface.co/settings/tokens). 191 - `huggingface_pipeline_kwargs`: Dictionary containing keyword arguments used to initialize the 192 Hugging Face pipeline for text classification. 193 194 <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.warm_up"></a> 195 196 #### TransformersZeroShotDocumentClassifier.warm\_up 197 198 ```python 199 def warm_up() 200 ``` 201 202 Initializes the component. 203 204 <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.to_dict"></a> 205 206 #### TransformersZeroShotDocumentClassifier.to\_dict 207 208 ```python 209 def to_dict() -> dict[str, Any] 210 ``` 211 212 Serializes the component to a dictionary. 213 214 **Returns**: 215 216 Dictionary with serialized data. 217 218 <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.from_dict"></a> 219 220 #### TransformersZeroShotDocumentClassifier.from\_dict 221 222 ```python 223 @classmethod 224 def from_dict( 225 cls, data: dict[str, Any]) -> "TransformersZeroShotDocumentClassifier" 226 ``` 227 228 Deserializes the component from a dictionary. 229 230 **Arguments**: 231 232 - `data`: Dictionary to deserialize from. 233 234 **Returns**: 235 236 Deserialized component. 237 238 <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.run"></a> 239 240 #### TransformersZeroShotDocumentClassifier.run 241 242 ```python 243 @component.output_types(documents=list[Document]) 244 def run(documents: list[Document], batch_size: int = 1) 245 ``` 246 247 Classifies the documents based on the provided labels and adds them to their metadata. 248 249 The classification results are stored in the `classification` dict within 250 each document's metadata. If `multi_label` is set to `True`, the scores for each label are available under 251 the `details` key within the `classification` dictionary. 252 253 **Arguments**: 254 255 - `documents`: Documents to process. 256 - `batch_size`: Batch size used for processing the content in each document. 257 258 **Returns**: 259 260 A dictionary with the following key: 261 - `documents`: A list of documents with an added metadata field called `classification`.