Cradicle Explorer

/ docs-website / docs / pipeline-components / converters / docxtodocument.mdx
docxtodocument.mdx
 1  ---
 2  title: "DOCXToDocument"
 3  id: docxtodocument
 4  slug: "/docxtodocument"
 5  description: "Convert DOCX files to documents."
 6  ---
 7  
 8  # DOCXToDocument
 9  
10  Convert DOCX files to documents.
11  
12  <div className="key-value-table">
13  
14  |  |  |
15  | --- | --- |
16  | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx)  or right at the beginning of an indexing pipeline |
17  | **Mandatory run variables**            | `sources`: DOCX file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream)  objects           |
18  | **Output variables**                   | `documents`: A list of documents                                                               |
19  | **API reference**                      | [Converters](/reference/converters-api)                                                               |
20  | **GitHub link**                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/docx.py      |
21  
22  </div>
23  
24  ## Overview
25  
26  The `DOCXToDocument` component converts DOCX files into documents. It takes a list of file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects as input and outputs the converted result as a list of documents. By defining the table format (CSV or Markdown), you can use this component to extract tables in your DOCX files. Optionally, you can attach metadata to the documents through the `meta` input parameter.
27  
28  ## Usage
29  
30  First, install the`python-docx` package to start using this converter:
31  
32  ```shell
33  pip install python-docx
34  ```
35  
36  ### On its own
37  
38  ```python
39  from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat
40  
41  converter = DOCXToDocument()
42  ## or define the table format
43  converter = DOCXToDocument(table_format=DOCXTableFormat.CSV)
44  
45  results = converter.run(
46      sources=["sample.docx"],
47      meta={"date_added": datetime.now().isoformat()},
48  )
49  documents = results["documents"]
50  
51  print(documents[0].content)
52  
53  ## 'This is the text from the DOCX file.'
54  ```
55  
56  ### In a pipeline
57  
58  ```python
59  from haystack import Pipeline
60  from haystack.document_stores.in_memory import InMemoryDocumentStore
61  from haystack.components.converters import DOCXToDocument
62  from haystack.components.preprocessors import DocumentCleaner
63  from haystack.components.preprocessors import DocumentSplitter
64  from haystack.components.writers import DocumentWriter
65  
66  document_store = InMemoryDocumentStore()
67  
68  pipeline = Pipeline()
69  pipeline.add_component("converter", DOCXToDocument())
70  pipeline.add_component("cleaner", DocumentCleaner())
71  pipeline.add_component(
72      "splitter",
73      DocumentSplitter(split_by="sentence", split_length=5),
74  )
75  pipeline.add_component("writer", DocumentWriter(document_store=document_store))
76  pipeline.connect("converter", "cleaner")
77  pipeline.connect("cleaner", "splitter")
78  pipeline.connect("splitter", "writer")
79  
80  pipeline.run({"converter": {"sources": file_names}})
81  ```