htmltodocument.mdx
1 --- 2 title: "HTMLToDocument" 3 id: htmltodocument 4 slug: "/htmltodocument" 5 description: "A component that converts HTML files to documents." 6 --- 7 8 # HTMLToDocument 9 10 A component that converts HTML files to documents. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | Before [PreProcessors](../preprocessors.mdx) , or right at the beginning of an indexing pipeline | 17 | **Mandatory run variables** | `sources`: A list of HTML file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects | 18 | **Output variables** | `documents`: A list of documents | 19 | **API reference** | [Converters](/reference/converters-api) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/html.py | 21 22 </div> 23 24 ## Overview 25 26 The `HTMLToDocument` component converts HTML files into documents. It can be used in an indexing pipeline to index the contents of an HTML file into a Document Store or even in a querying pipeline after the [`LinkContentFetcher`](../fetchers/linkcontentfetcher.mdx). The `HTMLToDocument` component takes a list of HTML file paths or [`ByteStream`](../../concepts/data-classes.mdx#bytestream) objects as input and converts the files to a list of documents. Optionally, you can attach metadata to the documents through the `meta` input parameter. 27 28 When you initialize the component, you can optionally set `extraction_kwargs`, a dictionary containing keyword arguments to customize the extraction process. These are passed to the underlying Trafilatura `extract` function. For the full list of available arguments, see the [Trafilatura documentation](https://trafilatura.readthedocs.io/en/latest/corefunctions.html#extract). 29 30 ## Usage 31 32 ### On its own 33 34 ```python 35 from pathlib import Path 36 from haystack.components.converters import HTMLToDocument 37 38 converter = HTMLToDocument() 39 40 docs = converter.run(sources=[Path("saved_page.html")]) 41 ``` 42 43 ### In a pipeline 44 45 Here's an example of an indexing pipeline that writes the contents of an HTML file into an `InMemoryDocumentStore`: 46 47 ```python 48 from haystack import Pipeline 49 from haystack.document_stores.in_memory import InMemoryDocumentStore 50 from haystack.components.converters import HTMLToDocument 51 from haystack.components.preprocessors import DocumentCleaner 52 from haystack.components.preprocessors import DocumentSplitter 53 from haystack.components.writers import DocumentWriter 54 55 document_store = InMemoryDocumentStore() 56 57 pipeline = Pipeline() 58 pipeline.add_component("converter", HTMLToDocument()) 59 pipeline.add_component("cleaner", DocumentCleaner()) 60 pipeline.add_component( 61 "splitter", 62 DocumentSplitter(split_by="sentence", split_length=5), 63 ) 64 pipeline.add_component("writer", DocumentWriter(document_store=document_store)) 65 pipeline.connect("converter", "cleaner") 66 pipeline.connect("cleaner", "splitter") 67 pipeline.connect("splitter", "writer") 68 69 pipeline.run({"converter": {"sources": file_names}}) 70 ```