Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.18 / haystack-api / classifiers_api.md
classifiers_api.md
  1  ---
  2  title: Classifiers
  3  id: classifiers-api
  4  description: Classify documents based on the provided labels.
  5  slug: "/classifiers-api"
  6  ---
  7  
  8  <a id="document_language_classifier"></a>
  9  
 10  # Module document\_language\_classifier
 11  
 12  <a id="document_language_classifier.DocumentLanguageClassifier"></a>
 13  
 14  ## DocumentLanguageClassifier
 15  
 16  Classifies the language of each document and adds it to its metadata.
 17  
 18  Provide a list of languages during initialization. If the document's text doesn't match any of the
 19  specified languages, the metadata value is set to "unmatched".
 20  To route documents based on their language, use the MetadataRouter component after DocumentLanguageClassifier.
 21  For routing plain text, use the TextLanguageRouter component instead.
 22  
 23  ### Usage example
 24  
 25  ```python
 26  from haystack import Document, Pipeline
 27  from haystack.document_stores.in_memory import InMemoryDocumentStore
 28  from haystack.components.classifiers import DocumentLanguageClassifier
 29  from haystack.components.routers import MetadataRouter
 30  from haystack.components.writers import DocumentWriter
 31  
 32  docs = [Document(id="1", content="This is an English document"),
 33          Document(id="2", content="Este es un documento en español")]
 34  
 35  document_store = InMemoryDocumentStore()
 36  
 37  p = Pipeline()
 38  p.add_component(instance=DocumentLanguageClassifier(languages=["en"]), name="language_classifier")
 39  p.add_component(instance=MetadataRouter(rules={"en": {"language": {"$eq": "en"}}}), name="router")
 40  p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
 41  p.connect("language_classifier.documents", "router.documents")
 42  p.connect("router.en", "writer.documents")
 43  
 44  p.run({"language_classifier": {"documents": docs}})
 45  
 46  written_docs = document_store.filter_documents()
 47  assert len(written_docs) == 1
 48  assert written_docs[0] == Document(id="1", content="This is an English document", meta={"language": "en"})
 49  ```
 50  
 51  <a id="document_language_classifier.DocumentLanguageClassifier.__init__"></a>
 52  
 53  #### DocumentLanguageClassifier.\_\_init\_\_
 54  
 55  ```python
 56  def __init__(languages: Optional[list[str]] = None)
 57  ```
 58  
 59  Initializes the DocumentLanguageClassifier component.
 60  
 61  **Arguments**:
 62  
 63  - `languages`: A list of ISO language codes.
 64  See the supported languages in [`langdetect` documentation](https://github.com/Mimino666/langdetect#languages).
 65  If not specified, defaults to ["en"].
 66  
 67  <a id="document_language_classifier.DocumentLanguageClassifier.run"></a>
 68  
 69  #### DocumentLanguageClassifier.run
 70  
 71  ```python
 72  @component.output_types(documents=list[Document])
 73  def run(documents: list[Document])
 74  ```
 75  
 76  Classifies the language of each document and adds it to its metadata.
 77  
 78  If the document's text doesn't match any of the languages specified at initialization,
 79  sets the metadata value to "unmatched".
 80  
 81  **Arguments**:
 82  
 83  - `documents`: A list of documents for language classification.
 84  
 85  **Raises**:
 86  
 87  - `TypeError`: if the input is not a list of Documents.
 88  
 89  **Returns**:
 90  
 91  A dictionary with the following key:
 92  - `documents`: A list of documents with an added `language` metadata field.
 93  
 94  <a id="zero_shot_document_classifier"></a>
 95  
 96  # Module zero\_shot\_document\_classifier
 97  
 98  <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier"></a>
 99  
100  ## TransformersZeroShotDocumentClassifier
101  
102  Performs zero-shot classification of documents based on given labels and adds the predicted label to their metadata.
103  
104  The component uses a Hugging Face pipeline for zero-shot classification.
105  Provide the model and the set of labels to be used for categorization during initialization.
106  Additionally, you can configure the component to allow multiple labels to be true.
107  
108  Classification is run on the document's content field by default. If you want it to run on another field, set the
109  `classification_field` to one of the document's metadata fields.
110  
111  Available models for the task of zero-shot-classification include:
112      - `valhalla/distilbart-mnli-12-3`
113      - `cross-encoder/nli-distilroberta-base`
114      - `cross-encoder/nli-deberta-v3-xsmall`
115  
116  ### Usage example
117  
118  The following is a pipeline that classifies documents based on predefined classification labels
119  retrieved from a search pipeline:
120  
121  ```python
122  from haystack import Document
123  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
124  from haystack.document_stores.in_memory import InMemoryDocumentStore
125  from haystack.core.pipeline import Pipeline
126  from haystack.components.classifiers import TransformersZeroShotDocumentClassifier
127  
128  documents = [Document(id="0", content="Today was a nice day!"),
129               Document(id="1", content="Yesterday was a bad day!")]
130  
131  document_store = InMemoryDocumentStore()
132  retriever = InMemoryBM25Retriever(document_store=document_store)
133  document_classifier = TransformersZeroShotDocumentClassifier(
134      model="cross-encoder/nli-deberta-v3-xsmall",
135      labels=["positive", "negative"],
136  )
137  
138  document_store.write_documents(documents)
139  
140  pipeline = Pipeline()
141  pipeline.add_component(instance=retriever, name="retriever")
142  pipeline.add_component(instance=document_classifier, name="document_classifier")
143  pipeline.connect("retriever", "document_classifier")
144  
145  queries = ["How was your day today?", "How was your day yesterday?"]
146  expected_predictions = ["positive", "negative"]
147  
148  for idx, query in enumerate(queries):
149      result = pipeline.run({"retriever": {"query": query, "top_k": 1}})
150      assert result["document_classifier"]["documents"][0].to_dict()["id"] == str(idx)
151      assert (result["document_classifier"]["documents"][0].to_dict()["classification"]["label"]
152              == expected_predictions[idx])
153  ```
154  
155  <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.__init__"></a>
156  
157  #### TransformersZeroShotDocumentClassifier.\_\_init\_\_
158  
159  ```python
160  def __init__(model: str,
161               labels: list[str],
162               multi_label: bool = False,
163               classification_field: Optional[str] = None,
164               device: Optional[ComponentDevice] = None,
165               token: Optional[Secret] = Secret.from_env_var(
166                   ["HF_API_TOKEN", "HF_TOKEN"], strict=False),
167               huggingface_pipeline_kwargs: Optional[dict[str, Any]] = None)
168  ```
169  
170  Initializes the TransformersZeroShotDocumentClassifier.
171  
172  See the Hugging Face [website](https://huggingface.co/models?pipeline_tag=zero-shot-classification&sort=downloads&search=nli)
173  for the full list of zero-shot classification models (NLI) models.
174  
175  **Arguments**:
176  
177  - `model`: The name or path of a Hugging Face model for zero shot document classification.
178  - `labels`: The set of possible class labels to classify each document into, for example,
179  ["positive", "negative"]. The labels depend on the selected model.
180  - `multi_label`: Whether or not multiple candidate labels can be true.
181  If `False`, the scores are normalized such that
182  the sum of the label likelihoods for each sequence is 1. If `True`, the labels are considered
183  independent and probabilities are normalized for each candidate by doing a softmax of the entailment
184  score vs. the contradiction score.
185  - `classification_field`: Name of document's meta field to be used for classification.
186  If not set, `Document.content` is used by default.
187  - `device`: The device on which the model is loaded. If `None`, the default device is automatically
188  selected. If a device/device map is specified in `huggingface_pipeline_kwargs`, it overrides this parameter.
189  - `token`: The Hugging Face token to use as HTTP bearer authorization.
190  Check your HF token in your [account settings](https://huggingface.co/settings/tokens).
191  - `huggingface_pipeline_kwargs`: Dictionary containing keyword arguments used to initialize the
192  Hugging Face pipeline for text classification.
193  
194  <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.warm_up"></a>
195  
196  #### TransformersZeroShotDocumentClassifier.warm\_up
197  
198  ```python
199  def warm_up()
200  ```
201  
202  Initializes the component.
203  
204  <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.to_dict"></a>
205  
206  #### TransformersZeroShotDocumentClassifier.to\_dict
207  
208  ```python
209  def to_dict() -> dict[str, Any]
210  ```
211  
212  Serializes the component to a dictionary.
213  
214  **Returns**:
215  
216  Dictionary with serialized data.
217  
218  <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.from_dict"></a>
219  
220  #### TransformersZeroShotDocumentClassifier.from\_dict
221  
222  ```python
223  @classmethod
224  def from_dict(
225          cls, data: dict[str, Any]) -> "TransformersZeroShotDocumentClassifier"
226  ```
227  
228  Deserializes the component from a dictionary.
229  
230  **Arguments**:
231  
232  - `data`: Dictionary to deserialize from.
233  
234  **Returns**:
235  
236  Deserialized component.
237  
238  <a id="zero_shot_document_classifier.TransformersZeroShotDocumentClassifier.run"></a>
239  
240  #### TransformersZeroShotDocumentClassifier.run
241  
242  ```python
243  @component.output_types(documents=list[Document])
244  def run(documents: list[Document], batch_size: int = 1)
245  ```
246  
247  Classifies the documents based on the provided labels and adds them to their metadata.
248  
249  The classification results are stored in the `classification` dict within
250  each document's metadata. If `multi_label` is set to `True`, the scores for each label are available under
251  the `details` key within the `classification` dictionary.
252  
253  **Arguments**:
254  
255  - `documents`: Documents to process.
256  - `batch_size`: Batch size used for processing the content in each document.
257  
258  **Returns**:
259  
260  A dictionary with the following key:
261  - `documents`: A list of documents with an added metadata field called `classification`.