Cradicle Explorer

/ docs-website / versioned_docs / version-2.21 / pipeline-components / preprocessors / csvdocumentcleaner.mdx
csvdocumentcleaner.mdx
 1  ---
 2  title: "CSVDocumentCleaner"
 3  id: csvdocumentcleaner
 4  slug: "/csvdocumentcleaner"
 5  description: "Use `CSVDocumentCleaner` to clean CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. It processes CSV content stored in documents and helps standardize data for further analysis."
 6  ---
 7  
 8  # CSVDocumentCleaner
 9  
10  Use `CSVDocumentCleaner` to clean CSV documents by removing empty rows and columns while preserving specific ignored rows and columns. It processes CSV content stored in documents and helps standardize data for further analysis.
11  
12  <div className="key-value-table">
13  
14  |  |  |
15  | --- | --- |
16  | **Most common position in a pipeline** | In indexing pipelines after [Converters](../converters.mdx) , before [Embedders](../embedders.mdx) or [Writers](../writers/documentwriter.mdx) |
17  | **Mandatory run variables**            | `documents`: A list of documents containing CSV content                                                                  |
18  | **Output variables**                   | `documents`: A list of cleaned CSV documents                                                                             |
19  | **API reference**                      | [PreProcessors](/reference/preprocessors-api)                                                                                   |
20  | **GitHub link**                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/preprocessors/csv_document_cleaner.py             |
21  
22  </div>
23  
24  ## Overview
25  
26  `CSVDocumentCleaner` expects a list of `Document` objects as input, each containing CSV-formatted content as text. It cleans the data by removing fully empty rows and columns while allowing users to specify the number of rows and columns to be preserved before cleaning.
27  
28  ### Parameters
29  
30  - `ignore_rows`: Number of rows to ignore from the top of the CSV table before processing. If any columns are removed, the same columns will be dropped from the ignored rows.
31  - `ignore_columns`: Number of columns to ignore from the left of the CSV table before processing. If any rows are removed, the same rows will be dropped from the ignored columns.
32  - `remove_empty_rows`: Whether to remove entirely empty rows.
33  - `remove_empty_columns`: Whether to remove entirely empty columns.
34  - `keep_id`: Whether to retain the original document ID in the output document.
35  
36  ### Cleaning Process
37  
38  The `CSVDocumentCleaner` algorithm follows these steps:
39  
40  1. Reads each document's content as a CSV table using pandas.
41  2. Retains the specified number of `ignore_rows` from the top and `ignore_columns` from the left.
42  3. Drops any rows and columns that are entirely empty (contain only NaN values).
43  4. If columns are dropped, they are also removed from ignored rows.
44  5. If rows are dropped, they are also removed from ignored columns.
45  6. Reattaches the remaining ignored rows and columns to maintain their original positions.
46  7. Returns the cleaned CSV content as a new `Document` object.
47  
48  ## Usage
49  
50  ### On its own
51  
52  You can use `CSVDocumentCleaner` independently to clean up CSV documents:
53  
54  ```python
55  from haystack import Document
56  from haystack.components.preprocessors import CSVDocumentCleaner
57  
58  cleaner = CSVDocumentCleaner(ignore_rows=1, ignore_columns=0)
59  
60  documents = [Document(content="""col1,col2,col3\n,,\na,b,c\n,,""")]
61  cleaned_docs = cleaner.run(documents=documents)
62  ```
63  
64  ### In a pipeline
65  
66  ```python
67  from pathlib import Path
68  from haystack import Pipeline
69  from haystack.document_stores.in_memory import InMemoryDocumentStore
70  from haystack.components.converters import XLSXToDocument
71  from haystack.components.preprocessors import CSVDocumentCleaner
72  from haystack.components.writers import DocumentWriter
73  
74  document_store = InMemoryDocumentStore()
75  p = Pipeline()
76  p.add_component(instance=XLSXToDocument(), name="xlsx_file_converter")
77  p.add_component(
78      instance=CSVDocumentCleaner(ignore_rows=1, ignore_columns=1),
79      name="csv_cleaner",
80  )
81  p.add_component(instance=DocumentWriter(document_store=document_store), name="writer")
82  
83  p.connect("xlsx_file_converter.documents", "csv_cleaner.documents")
84  p.connect("csv_cleaner.documents", "writer.documents")
85  
86  p.run({"xlsx_file_converter": {"sources": [Path("your_xlsx_file.xlsx")]}})
87  ```
88  
89  This ensures that CSV documents are properly cleaned before further processing or storage.