/ docs-website / versioned_docs / version-2.21 / pipeline-components / preprocessors / textcleaner.mdx
textcleaner.mdx
1 --- 2 title: "TextCleaner" 3 id: textcleaner 4 slug: "/textcleaner" 5 description: "Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation." 6 --- 7 8 # TextCleaner 9 10 Use `TextCleaner` to make text data more readable. It removes regexes, punctuation, and numbers, as well as converts text to lowercase. This is especially useful to clean up text data before evaluation. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Between a [Generator](../generators.mdx) and an [Evaluator](../evaluators.mdx) | 17 | **Mandatory run variables** | `texts`: A list of strings to be cleaned | 18 | **Output variables** | `texts`: A list of cleaned texts | 19 | **API reference** | [PreProcessors](/reference/preprocessors-api) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/text_cleaner.py | 21 22 </div> 23 24 ## Overview 25 26 `TextCleaner` expects a list of strings as input and returns a list of strings with cleaned texts. Selectable cleaning steps are to `convert_to_lowercase`, `remove_punctuation`, and to `remove_numbers`. These three parameters are booleans that need to be set when the component is initialized. 27 28 - `convert_to_lowercase` converts all characters in texts to lowercase. 29 - `remove_punctuation` removes all punctuation from the text. 30 - `remove_numbers` removes all numerical digits from the text. 31 32 In addition, you can specify a regular expression with the parameter `remove_regexps`, and any matches will be removed. 33 34 ## Usage 35 36 ### On its own 37 38 You can use it outside of a pipeline to clean up any texts: 39 40 ```python 41 from haystack.components.preprocessors import TextCleaner 42 43 text_to_clean = ( 44 "1Moonlight shimmered softly, 300 Wolves howled nearby, Night enveloped everything." 45 ) 46 47 cleaner = TextCleaner( 48 convert_to_lowercase=True, 49 remove_punctuation=False, 50 remove_numbers=True, 51 ) 52 result = cleaner.run(texts=[text_to_clean]) 53 ``` 54 55 ### In a pipeline 56 57 In this example, we are using `TextCleaner` after an `ExtractiveReader` and an `OutputAdapter` to remove the punctuation in texts. Then, our custom-made `ExactMatchEvaluator` component compares the retrieved answer to the ground truth answer. 58 59 ```python 60 from typing import List 61 from haystack import component, Document, Pipeline 62 from haystack.components.converters import OutputAdapter 63 from haystack.components.preprocessors import TextCleaner 64 from haystack.components.readers import ExtractiveReader 65 from haystack.components.retrievers.in_memory import InMemoryBM25Retriever 66 from haystack.document_stores.in_memory import InMemoryDocumentStore 67 68 document_store = InMemoryDocumentStore() 69 documents = [ 70 Document(content="There are over 7,000 languages spoken around the world today."), 71 Document( 72 content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors.", 73 ), 74 Document( 75 content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.", 76 ), 77 ] 78 document_store.write_documents(documents=documents) 79 80 81 @component 82 class ExactMatchEvaluator: 83 @component.output_types(score=int) 84 def run(self, expected: str, provided: List[str]): 85 return {"score": int(expected in provided)} 86 87 88 adapter = OutputAdapter( 89 template="{{answers | extract_data}}", 90 output_type=List[str], 91 custom_filters={ 92 "extract_data": lambda data: [answer.data for answer in data if answer.data], 93 }, 94 ) 95 96 p = Pipeline() 97 p.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) 98 p.add_component("reader", ExtractiveReader()) 99 p.add_component("adapter", adapter) 100 p.add_component("cleaner", TextCleaner(remove_punctuation=True)) 101 p.add_component("evaluator", ExactMatchEvaluator()) 102 103 p.connect("retriever", "reader") 104 p.connect("reader", "adapter") 105 p.connect("adapter", "cleaner.texts") 106 p.connect("cleaner", "evaluator.provided") 107 108 question = "What behavior indicates a high level of self-awareness of elephants?" 109 ground_truth_answer = "recognizing themselves in mirrors" 110 111 result = p.run( 112 { 113 "retriever": {"query": question}, 114 "reader": {"query": question}, 115 "evaluator": {"expected": ground_truth_answer}, 116 }, 117 ) 118 print(result) 119 ```