Cradicle Explorer

/ docs-website / versioned_docs / version-2.21 / pipeline-components / preprocessors / textcleaner.mdx
textcleaner.mdx
  1  ---
  2  title: "TextCleaner"
  3  id: textcleaner
  4  slug: "/textcleaner"
  5  description: "Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation."
  6  ---
  7  
  8  # TextCleaner
  9  
 10  Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Between a [Generator](../generators.mdx)  and an [Evaluator](../evaluators.mdx)                        |
 17  | **Mandatory run variables**            | `texts`: A list of strings to be cleaned                                                             |
 18  | **Output variables**                   | `texts`: A list of cleaned texts                                                                     |
 19  | **API reference**                      | [PreProcessors](/reference/preprocessors-api)                                                               |
 20  | **GitHub link**                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py |
 21  
 22  </div>
 23  
 24  ## Overview
 25  
 26  `TextCleaner` expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to `convert_to_lowercase`, `remove_punctuation`, and to `remove_numbers`. These three parameters are booleans that need to be set when the component is initialized.
 27  
 28  - `convert_to_lowercase` converts all characters in texts to lowercase.
 29  - `remove_punctuation` removes all punctuation from the text.
 30  - `remove_numbers` removes all numerical digits from the text.
 31  
 32  In addition, you can specify a regular expression with the parameter `remove_regexps`, and any matches will be removed.
 33  
 34  ## Usage
 35  
 36  ### On its own
 37  
 38  You can use it outside of a pipeline to clean up any texts:
 39  
 40  ```python
 41  from haystack.components.preprocessors import TextCleaner
 42  
 43  text_to_clean = (
 44      "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything."
 45  )
 46  
 47  cleaner = TextCleaner(
 48      convert_to_lowercase=True,
 49      remove_punctuation=False,
 50      remove_numbers=True,
 51  )
 52  result = cleaner.run(texts=[text_to_clean])
 53  ```
 54  
 55  ### In a pipeline
 56  
 57  In this example, we are using `TextCleaner` after an `ExtractiveReader` and an `OutputAdapter` to remove the punctuation in texts. Then, our custom-made `ExactMatchEvaluator` component compares the retrieved answer to the ground truth answer.
 58  
 59  ```python
 60  from typing import List
 61  from haystack import component, Document, Pipeline
 62  from haystack.components.converters import OutputAdapter
 63  from haystack.components.preprocessors import TextCleaner
 64  from haystack.components.readers import ExtractiveReader
 65  from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
 66  from haystack.document_stores.in_memory import InMemoryDocumentStore
 67  
 68  document_store = InMemoryDocumentStore()
 69  documents = [
 70      Document(content="There are over 7,000 languages spoken around the world today."),
 71      Document(
 72          content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.",
 73      ),
 74      Document(
 75          content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.",
 76      ),
 77  ]
 78  document_store.write_documents(documents=documents)
 79  
 80  
 81  @component
 82  class ExactMatchEvaluator:
 83      @component.output_types(score=int)
 84      def run(self, expected: str, provided: List[str]):
 85          return {"score": int(expected in provided)}
 86  
 87  
 88  adapter = OutputAdapter(
 89      template="{{answers | extract_data}}",
 90      output_type=List[str],
 91      custom_filters={
 92          "extract_data": lambda data: [answer.data for answer in data if answer.data],
 93      },
 94  )
 95  
 96  p = Pipeline()
 97  p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store))
 98  p.add_component("reader", ExtractiveReader())
 99  p.add_component("adapter", adapter)
100  p.add_component("cleaner", TextCleaner(remove_punctuation=True))
101  p.add_component("evaluator", ExactMatchEvaluator())
102  
103  p.connect("retriever", "reader")
104  p.connect("reader", "adapter")
105  p.connect("adapter", "cleaner.texts")
106  p.connect("cleaner", "evaluator.provided")
107  
108  question = "What behavior indicates a high level of self-awareness of elephants?"
109  ground_truth_answer = "recognizing themselves in mirrors"
110  
111  result = p.run(
112      {
113          "retriever": {"query": question},
114          "reader": {"query": question},
115          "evaluator": {"expected": ground_truth_answer},
116      },
117  )
118  print(result)
119  ```