Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.26 / haystack-api / classifiers_api.md
classifiers_api.md
  1  ---
  2  title: "Classifiers"
  3  id: classifiers-api
  4  description: "Classify documents based on the provided labels."
  5  slug: "/classifiers-api"
  6  ---
  7  
  8  
  9  ## document_language_classifier
 10  
 11  ### DocumentLanguageClassifier
 12  
 13  Classifies the language of each document and adds it to its metadata.
 14  
 15  Provide a list of languages during initialization. If the document's text doesn't match any of the
 16  specified languages, the metadata value is set to "unmatched".
 17  To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier.
 18  For routing plain text, use the TextLanguageRouter component instead.
 19  
 20  ### Usage example
 21  
 22  ```python
 23  from haystack import Document, Pipeline
 24  from haystack.document_stores.in_memory import InMemoryDocumentStore
 25  from haystack.components.classifiers import DocumentLanguageClassifier
 26  from haystack.components.routers import MetadataRouter
 27  from haystack.components.writers import DocumentWriter
 28  
 29  docs = [Document(id="1", content="This is an English document"),
 30          Document(id="2", content="Este es un documento en español")]
 31  
 32  document_store = InMemoryDocumentStore()
 33  
 34  p = Pipeline()
 35  p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
 36  p.add_component(
 37  instance=MetadataRouter(rules={
 38      "en": {
 39          "field": "meta.language",
 40          "operator": "==",
 41          "value": "en"
 42      }
 43  }),
 44  name="router")
 45  p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
 46  p.connect("language_classifier.documents", "router.documents")
 47  p.connect("router.en", "writer.documents")
 48  
 49  p.run({"language_classifier": {"documents": docs}})
 50  
 51  written_docs = document_store.filter_documents()
 52  assert len(written_docs) == 1
 53  assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"})
 54  ```
 55  
 56  #### __init__
 57  
 58  ```python
 59  __init__(languages: list[str] | None = None)
 60  ```
 61  
 62  Initializes the DocumentLanguageClassifier component.
 63  
 64  **Parameters:**
 65  
 66  - **languages** (<code>list\[str\] | None</code>) – A list of ISO language codes.
 67    See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages).
 68    If not specified, defaults to ["en"].
 69  
 70  #### run
 71  
 72  ```python
 73  run(documents: list[Document])
 74  ```
 75  
 76  Classifies the language of each document and adds it to its metadata.
 77  
 78  If the document's text doesn't match any of the languages specified at initialization,
 79  sets the metadata value to "unmatched".
 80  
 81  **Parameters:**
 82  
 83  - **documents** (<code>list\[Document\]</code>) – A list of documents for language classification.
 84  
 85  **Returns:**
 86  
 87  - – A dictionary with the following key:
 88  - `documents`: A list of documents with an added `language` metadata field.
 89  
 90  **Raises:**
 91  
 92  - <code>TypeError</code> – if the input is not a list of Documents.
 93  
 94  ## zero_shot_document_classifier
 95  
 96  ### TransformersZeroShotDocumentClassifier
 97  
 98  Performs zero-shot classification of documents based on given labels and adds the predicted label to their metadata.
 99  
100  The component uses a Hugging Face pipeline for zero-shot classification.
101  Provide the model and the set of labels to be used for categorization during initialization.
102  Additionally, you can configure the component to allow multiple labels to be true.
103  
104  Classification is run on the document's content field by default. If you want it to run on another field, set the
105  `classification_field` to one of the document's metadata fields.
106  
107  Available models for the task of zero-shot-classification include:
108  \- `valhalla/distilbart-mnli-12-3`
109  \- `cross-encoder/nli-distilroberta-base`
110  \- `cross-encoder/nli-deberta-v3-xsmall`
111  
112  ### Usage example
113  
114  The following is a pipeline that classifies documents based on predefined classification labels
115  retrieved from a search pipeline:
116  
117  ```python
118  from haystack import Document
119  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
120  from haystack.document_stores.in_memory import InMemoryDocumentStore
121  from haystack.core.pipeline import Pipeline
122  from haystack.components.classifiers import TransformersZeroShotDocumentClassifier
123  
124  documents = [Document(id="0", content="Today was a nice day!"),
125               Document(id="1", content="Yesterday was a bad day!")]
126  
127  document_store = InMemoryDocumentStore()
128  retriever = InMemoryBM25Retriever(document_store=document_store)
129  document_classifier = TransformersZeroShotDocumentClassifier(
130      model="cross-encoder/nli-deberta-v3-xsmall",
131      labels=["positive", "negative"],
132  )
133  
134  document_store.write_documents(documents)
135  
136  pipeline = Pipeline()
137  pipeline.add_component(instance=retriever, name="retriever")
138  pipeline.add_component(instance=document_classifier, name="document_classifier")
139  pipeline.connect("retriever", "document_classifier")
140  
141  queries = ["How was your day today?", "How was your day yesterday?"]
142  expected_predictions = ["positive", "negative"]
143  
144  for idx, query in enumerate(queries):
145      result = pipeline.run({"retriever": {"query": query, "top_k": 1}})
146      assert result["document_classifier"]["documents"][0].to_dict()["id"] == str(idx)
147      assert (result["document_classifier"]["documents"][0].to_dict()["classification"]["label"]
148              == expected_predictions[idx])
149  ```
150  
151  #### __init__
152  
153  ```python
154  __init__(
155      model: str,
156      labels: list[str],
157      multi_label: bool = False,
158      classification_field: str | None = None,
159      device: ComponentDevice | None = None,
160      token: Secret | None = Secret.from_env_var(
161          ["HF_API_TOKEN", "HF_TOKEN"], strict=False
162      ),
163      huggingface_pipeline_kwargs: dict[str, Any] | None = None,
164  )
165  ```
166  
167  Initializes the TransformersZeroShotDocumentClassifier.
168  
169  See the Hugging Face [website](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads&search=nli)
170  for the full list of zero-shot classification models (NLI) models.
171  
172  **Parameters:**
173  
174  - **model** (<code>str</code>) – The name or path of a Hugging Face model for zero shot document classification.
175  - **labels** (<code>list\[str\]</code>) – The set of possible class labels to classify each document into, for example,
176    ["positive", "negative"]. The labels depend on the selected model.
177  - **multi_label** (<code>bool</code>) – Whether or not multiple candidate labels can be true.
178    If `False`, the scores are normalized such that
179    the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered
180    independent and probabilities are normalized for each candidate by doing a softmax of the entailment
181    score vs. the contradiction score.
182  - **classification_field** (<code>str | None</code>) – Name of document's meta field to be used for classification.
183    If not set, `Document.content` is used by default.
184  - **device** (<code>ComponentDevice | None</code>) – The device on which the model is loaded. If `None`, the default device is automatically
185    selected. If a device/device map is specified in `huggingface_pipeline_kwargs`, it overrides this parameter.
186  - **token** (<code>Secret | None</code>) – The Hugging Face token to use as HTTP bearer authorization.
187    Check your HF token in your [account settings](https://huggingface.co/settings/tokens).
188  - **huggingface_pipeline_kwargs** (<code>dict\[str, Any\] | None</code>) – Dictionary containing keyword arguments used to initialize the
189    Hugging Face pipeline for text classification.
190  
191  #### warm_up
192  
193  ```python
194  warm_up()
195  ```
196  
197  Initializes the component.
198  
199  #### to_dict
200  
201  ```python
202  to_dict() -> dict[str, Any]
203  ```
204  
205  Serializes the component to a dictionary.
206  
207  **Returns:**
208  
209  - <code>dict\[str, Any\]</code> – Dictionary with serialized data.
210  
211  #### from_dict
212  
213  ```python
214  from_dict(data: dict[str, Any]) -> TransformersZeroShotDocumentClassifier
215  ```
216  
217  Deserializes the component from a dictionary.
218  
219  **Parameters:**
220  
221  - **data** (<code>dict\[str, Any\]</code>) – Dictionary to deserialize from.
222  
223  **Returns:**
224  
225  - <code>TransformersZeroShotDocumentClassifier</code> – Deserialized component.
226  
227  #### run
228  
229  ```python
230  run(documents: list[Document], batch_size: int = 1)
231  ```
232  
233  Classifies the documents based on the provided labels and adds them to their metadata.
234  
235  The classification results are stored in the `classification` dict within
236  each document's metadata. If `multi_label` is set to `True`, the scores for each label are available under
237  the `details` key within the `classification` dictionary.
238  
239  **Parameters:**
240  
241  - **documents** (<code>list\[Document\]</code>) – Documents to process.
242  - **batch_size** (<code>int</code>) – Batch size used for processing the content in each document.
243  
244  **Returns:**
245  
246  - – A dictionary with the following key:
247  - `documents`: A list of documents with an added metadata field called `classification`.