unstructured.md
1 --- 2 title: "Unstructured" 3 id: integrations-unstructured 4 description: "Unstructured integration for Haystack" 5 slug: "/integrations-unstructured" 6 --- 7 8 <a id="haystack_integrations.components.converters.unstructured.converter"></a> 9 10 ## Module haystack\_integrations.components.converters.unstructured.converter 11 12 <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter"></a> 13 14 ### UnstructuredFileConverter 15 16 A component for converting files to Haystack Documents using the Unstructured API (hosted or running locally). 17 18 For the supported file types and the specific API parameters, see 19 [Unstructured docs](https://docs.unstructured.io/api-reference/api-services/overview). 20 21 Usage example: 22 ```python 23 from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter 24 25 # make sure to either set the environment variable UNSTRUCTURED_API_KEY 26 # or run the Unstructured API locally: 27 # docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest 28 # --port 8000 --host 0.0.0.0 29 30 converter = UnstructuredFileConverter( 31 # api_url="http://localhost:8000/general/v0/general" # <-- Uncomment this if running Unstructured locally 32 ) 33 documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"] 34 ``` 35 36 <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.__init__"></a> 37 38 #### UnstructuredFileConverter.\_\_init\_\_ 39 40 ```python 41 def __init__(api_url: str = UNSTRUCTURED_HOSTED_API_URL, 42 api_key: Secret | None = Secret.from_env_var( 43 "UNSTRUCTURED_API_KEY", strict=False), 44 document_creation_mode: Literal[ 45 "one-doc-per-file", "one-doc-per-page", 46 "one-doc-per-element"] = "one-doc-per-file", 47 separator: str = "\n\n", 48 unstructured_kwargs: dict[str, Any] | None = None, 49 progress_bar: bool = True) 50 ``` 51 52 **Arguments**: 53 54 - `api_url`: URL of the Unstructured API. Defaults to the URL of the hosted version. 55 If you run the API locally, specify the URL of your local API (e.g. `"http://localhost:8000/general/v0/general"`). 56 - `api_key`: API key for the Unstructured API. 57 It can be explicitly passed or read the environment variable `UNSTRUCTURED_API_KEY` (recommended). 58 If you run the API locally, it is not needed. 59 - `document_creation_mode`: How to create Haystack Documents from the elements returned by Unstructured. 60 `"one-doc-per-file"`: One Haystack Document per file. All elements are concatenated into one text field. 61 `"one-doc-per-page"`: One Haystack Document per page. 62 All elements on a page are concatenated into one text field. 63 `"one-doc-per-element"`: One Haystack Document per element. Each element is converted to a Haystack Document. 64 - `separator`: Separator between elements when concatenating them into one text field. 65 - `unstructured_kwargs`: Additional parameters that are passed to the Unstructured API. 66 For the available parameters, see 67 [Unstructured API docs](https://docs.unstructured.io/api-reference/api-services/api-parameters). 68 - `progress_bar`: Whether to show a progress bar during the conversion. 69 70 <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.to_dict"></a> 71 72 #### UnstructuredFileConverter.to\_dict 73 74 ```python 75 def to_dict() -> dict[str, Any] 76 ``` 77 78 Serializes the component to a dictionary. 79 80 **Returns**: 81 82 Dictionary with serialized data. 83 84 <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.from_dict"></a> 85 86 #### UnstructuredFileConverter.from\_dict 87 88 ```python 89 @classmethod 90 def from_dict(cls, data: dict[str, Any]) -> "UnstructuredFileConverter" 91 ``` 92 93 Deserializes the component from a dictionary. 94 95 **Arguments**: 96 97 - `data`: Dictionary to deserialize from. 98 99 **Returns**: 100 101 Deserialized component. 102 103 <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.run"></a> 104 105 #### UnstructuredFileConverter.run 106 107 ```python 108 @component.output_types(documents=list[Document]) 109 def run( 110 paths: list[str] | list[os.PathLike], 111 meta: dict[str, Any] | list[dict[str, Any]] | None = None 112 ) -> dict[str, list[Document]] 113 ``` 114 115 Convert files to Haystack Documents using the Unstructured API. 116 117 **Arguments**: 118 119 - `paths`: List of paths to convert. Paths can be files or directories. 120 If a path is a directory, all files in the directory are converted. Subdirectories are ignored. 121 - `meta`: Optional metadata to attach to the Documents. 122 This value can be either a list of dictionaries or a single dictionary. 123 If it's a single dictionary, its content is added to the metadata of all produced Documents. 124 If it's a list, the length of the list must match the number of paths, because the two lists will be zipped. 125 Please note that if the paths contain directories, `meta` can only be a single dictionary 126 (same metadata for all files). 127 128 **Raises**: 129 130 - `ValueError`: If `meta` is a list and `paths` contains directories. 131 132 **Returns**: 133 134 A dictionary with the following key: 135 - `documents`: List of Haystack Documents.