Cradicle Explorer

/ docs-website / docs / pipeline-components / converters / markitdownconverter.mdx
markitdownconverter.mdx
 1  ---
 2  title: "MarkItDownConverter"
 3  id: markitdownconverter
 4  slug: "/markitdownconverter"
 5  description: "A component that converts files (PDF, Word, PowerPoint, Excel, HTML, images, and more) to Documents using Microsoft's MarkItDown library."
 6  ---
 7  
 8  # MarkItDownConverter
 9  
10  A component that converts files to Documents using Microsoft's MarkItDown library.
11  
12  <div className="key-value-table">
13  
14  |  |  |
15  | --- | --- |
16  | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx) or right at the beginning of an indexing pipeline |
17  | **Mandatory run variables**            | `sources`: File paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects |
18  | **Output variables**                   | `documents`: A list of documents |
19  | **API reference**                      | [MarkItDown](/reference/integrations-markitdown) |
20  | **GitHub link**                        | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/markitdown |
21  
22  </div>
23  
24  ## Overview
25  
26  `MarkItDownConverter` converts files into Haystack Documents using Microsoft's [MarkItDown](https://github.com/microsoft/markitdown) library. MarkItDown converts many file formats to Markdown, including PDF, Word (.docx), PowerPoint (.pptx), Excel (.xlsx), HTML, and more. All processing is performed locally without relying on external APIs.
27  
28  The converter accepts file paths or [ByteStream](../../concepts/data-classes.mdx#bytestream) objects as input and outputs the converted result as a list of Documents. You can attach metadata to the Documents through the `meta` input parameter.
29  
30  :::note
31  This component returns Markdown content. Avoid piping it through `DocumentCleaner()` with its default settings because `remove_extra_whitespaces=True` and `remove_empty_lines=True` can collapse line breaks and flatten headings, tables, lists, and image tags. Connect the converter directly to your next component, or disable those options if you need custom cleanup.
32  :::
33  
34  ## Usage
35  
36  Install the MarkItDown integration:
37  
38  ```shell
39  pip install markitdown-haystack
40  ```
41  
42  ### On its own
43  
44  ```python
45  from haystack_integrations.components.converters.markitdown import MarkItDownConverter
46  
47  converter = MarkItDownConverter()
48  result = converter.run(sources=["document.pdf", "report.docx"])
49  documents = result["documents"]
50  ```
51  
52  ### In a pipeline
53  
54  ```python
55  from haystack import Pipeline
56  from haystack.components.preprocessors import DocumentSplitter
57  from haystack.components.writers import DocumentWriter
58  from haystack.document_stores.in_memory import InMemoryDocumentStore
59  from haystack_integrations.components.converters.markitdown import MarkItDownConverter
60  
61  document_store = InMemoryDocumentStore()
62  
63  pipeline = Pipeline()
64  pipeline.add_component("converter", MarkItDownConverter())
65  pipeline.add_component(
66      "splitter",
67      DocumentSplitter(split_by="sentence", split_length=5),
68  )
69  pipeline.add_component("writer", DocumentWriter(document_store=document_store))
70  pipeline.connect("converter", "splitter")
71  pipeline.connect("splitter", "writer")
72  
73  pipeline.run({"converter": {"sources": ["document.pdf", "report.docx"]}})
74  ```