/ docs-website / versioned_docs / version-2.21 / pipeline-components / converters / docxtodocument.mdx
docxtodocument.mdx
1 --- 2 title: "DOCXToDocument" 3 id: docxtodocument 4 slug: "/docxtodocument" 5 description: "Convert DOCX files to documents." 6 --- 7 8 # DOCXToDocument 9 10 Convert DOCX files to documents. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx) or right at the beginning of an indexing pipeline | 17 | **Mandatory run variables** | `sources`: DOCX file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects | 18 | **Output variables** | `documents`: A list of documents | 19 | **API reference** | [Converters](/reference/converters-api) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/docx.py | 21 22 </div> 23 24 ## Overview 25 26 The `DOCXToDocument` component converts DOCX files into documents. It takes a list of file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects as input and outputs the converted result as a list of documents. By defining the table format (CSV or Markdown), you can use this component to extract tables in your DOCX files. Optionally, you can attach metadata to the documents through the `meta` input parameter. 27 28 ## Usage 29 30 First, install the`python-docx` package to start using this converter: 31 32 ```shell 33 pip install python-docx 34 ``` 35 36 ### On its own 37 38 ```python 39 from haystack.components.converters.docx import DOCXToDocument, DOCXTableFormat 40 41 converter = DOCXToDocument() 42 ## or define the table format 43 converter = DOCXToDocument(table_format=DOCXTableFormat.CSV) 44 45 results = converter.run( 46 sources=["sample.docx"], 47 meta={"date_added": datetime.now().isoformat()}, 48 ) 49 documents = results["documents"] 50 51 print(documents[0].content) 52 53 ## 'This is the text from the DOCX file.' 54 ``` 55 56 ### In a pipeline 57 58 ```python 59 from haystack import Pipeline 60 from haystack.document_stores.in_memory import InMemoryDocumentStore 61 from haystack.components.converters import DOCXToDocument 62 from haystack.components.preprocessors import DocumentCleaner 63 from haystack.components.preprocessors import DocumentSplitter 64 from haystack.components.writers import DocumentWriter 65 66 document_store = InMemoryDocumentStore() 67 68 pipeline = Pipeline() 69 pipeline.add_component("converter", DOCXToDocument()) 70 pipeline.add_component("cleaner", DocumentCleaner()) 71 pipeline.add_component( 72 "splitter", 73 DocumentSplitter(split_by="sentence", split_length=5), 74 ) 75 pipeline.add_component("writer", DocumentWriter(document_store=document_store)) 76 pipeline.connect("converter", "cleaner") 77 pipeline.connect("cleaner", "splitter") 78 pipeline.connect("splitter", "writer") 79 80 pipeline.run({"converter": {"sources": file_names}}) 81 ```