markitdownconverter.mdx
1 --- 2 title: "MarkItDownConverter" 3 id: markitdownconverter 4 slug: "/markitdownconverter" 5 description: "A component that converts files (PDF, Word, PowerPoint, Excel, HTML, images, and more) to Documents using Microsoft's MarkItDown library." 6 --- 7 8 # MarkItDownConverter 9 10 A component that converts files to Documents using Microsoft's MarkItDown library. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx) or right at the beginning of an indexing pipeline | 17 | **Mandatory run variables** | `sources`: File paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects | 18 | **Output variables** | `documents`: A list of documents | 19 | **API reference** | [MarkItDown](/reference/integrations-markitdown) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown | 21 22 </div> 23 24 ## Overview 25 26 `MarkItDownConverter` converts files into Haystack Documents using Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. MarkItDown converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, and more. All processing is performed locally without relying on external APIs. 27 28 The converter accepts file paths or [ByteStream](../../concepts/data-classes.mdx#bytestream) objects as input and outputs the converted result as a list of Documents. You can attach metadata to the Documents through the `meta` input parameter. 29 30 :::note 31 This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, lists, and image tags. Connect the converter directly to your next component, or disable those options if you need custom cleanup. 32 ::: 33 34 ## Usage 35 36 Install the MarkItDown integration: 37 38 ```shell 39 pip install markitdown-haystack 40 ``` 41 42 ### On its own 43 44 ```python 45 from haystack_integrations.components.converters.markitdown import MarkItDownConverter 46 47 converter = MarkItDownConverter() 48 result = converter.run(sources=["document.pdf", "report.docx"]) 49 documents = result["documents"] 50 ``` 51 52 ### In a pipeline 53 54 ```python 55 from haystack import Pipeline 56 from haystack.components.preprocessors import DocumentSplitter 57 from haystack.components.writers import DocumentWriter 58 from haystack.document_stores.in_memory import InMemoryDocumentStore 59 from haystack_integrations.components.converters.markitdown import MarkItDownConverter 60 61 document_store = InMemoryDocumentStore() 62 63 pipeline = Pipeline() 64 pipeline.add_component("converter", MarkItDownConverter()) 65 pipeline.add_component( 66 "splitter", 67 DocumentSplitter(split_by="sentence", split_length=5), 68 ) 69 pipeline.add_component("writer", DocumentWriter(document_store=document_store)) 70 pipeline.connect("converter", "splitter") 71 pipeline.connect("splitter", "writer") 72 73 pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}}) 74 ```