/ docs-website / versioned_docs / version-2.21 / pipeline-components / converters / pdfminertodocument.mdx
pdfminertodocument.mdx
1 --- 2 title: "PDFMinerToDocument" 3 id: pdfminertodocument 4 slug: "/pdfminertodocument" 5 description: "A component that converts complex PDF files to documents using pdfminer arguments." 6 --- 7 8 # PDFMinerToDocument 9 10 A component that converts complex PDF files to documents using pdfminer arguments. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx) or right at the beginning of an indexing pipeline | 17 | **Mandatory run variables** | `sources`: PDF file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects | 18 | **Output variables** | `documents`: A list of documents | 19 | **API reference** | [Converters](/reference/converters-api) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/pdfminer.py | 21 22 </div> 23 24 ## Overview 25 26 The `PDFMinerToDocument` component converts PDF files into documents using [PDFMiner](https://pdfminersix.readthedocs.io/en/latest/) extraction tool arguments. 27 28 You can use it in an indexing pipeline to index the contents of a PDF file in a Document Store. It takes a list of file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream)objects as input and outputs the converted result as a list of documents. Optionally, you can attach metadata to the documents through the `meta` input parameter. 29 30 When initializing the component, you can adjust several parameters to fit your PDF. See the full parameter list and descriptions in our [API reference](/reference/converters-api#pdfminertodocument). 31 32 ## Usage 33 34 First, install `pdfminer` package to start using this converter: 35 36 ```shell 37 pip install pdfminer.six 38 ``` 39 40 ### On its own 41 42 ```python 43 from haystack.components.converters import PDFMinerToDocument 44 45 converter = PDFMinerToDocument() 46 results = converter.run( 47 sources=["sample.pdf"], 48 meta={"date_added": datetime.now().isoformat()}, 49 ) 50 documents = results["documents"] 51 52 print(documents[0].content) 53 54 ## 'This is a text from the PDF file.' 55 ``` 56 57 ### In a pipeline 58 59 ```python 60 from haystack import Pipeline 61 from haystack.document_stores.in_memory import InMemoryDocumentStore 62 from haystack.components.converters import PDFMinerToDocument 63 from haystack.components.preprocessors import DocumentCleaner 64 from haystack.components.preprocessors import DocumentSplitter 65 from haystack.components.writers import DocumentWriter 66 67 document_store = InMemoryDocumentStore() 68 69 pipeline = Pipeline() 70 pipeline.add_component("converter", PDFMinerToDocument()) 71 pipeline.add_component("cleaner", DocumentCleaner()) 72 pipeline.add_component( 73 "splitter", 74 DocumentSplitter(split_by="sentence", split_length=5), 75 ) 76 pipeline.add_component("writer", DocumentWriter(document_store=document_store)) 77 pipeline.connect("converter", "cleaner") 78 pipeline.connect("cleaner", "splitter") 79 pipeline.connect("splitter", "writer") 80 81 pipeline.run({"converter": {"sources": file_names}}) 82 ```