unstructured.md
  1  ---
  2  title: "Unstructured"
  3  id: integrations-unstructured
  4  description: "Unstructured integration for Haystack"
  5  slug: "/integrations-unstructured"
  6  ---
  7  
  8  <a id="haystack_integrations.components.converters.unstructured.converter"></a>
  9  
 10  ## Module haystack\_integrations.components.converters.unstructured.converter
 11  
 12  <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter"></a>
 13  
 14  ### UnstructuredFileConverter
 15  
 16  A component for converting files to Haystack Documents using the Unstructured API (hosted or running locally).
 17  
 18  For the supported file types and the specific API parameters, see
 19  [Unstructured docs](https://docs.unstructured.io/api-reference/api-services/overview).
 20  
 21  Usage example:
 22  ```python
 23  from haystack_integrations.components.converters.unstructured import UnstructuredFileConverter
 24  
 25  # make sure to either set the environment variable UNSTRUCTURED_API_KEY
 26  # or run the Unstructured API locally:
 27  # docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest
 28  # --port 8000 --host 0.0.0.0
 29  
 30  converter = UnstructuredFileConverter(
 31      # api_url="http://localhost:8000/general/v0/general"  # <-- Uncomment this if running Unstructured locally
 32  )
 33  documents = converter.run(paths = ["a/file/path.pdf", "a/directory/path"])["documents"]
 34  ```
 35  
 36  <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.__init__"></a>
 37  
 38  #### UnstructuredFileConverter.\_\_init\_\_
 39  
 40  ```python
 41  def __init__(api_url: str = UNSTRUCTURED_HOSTED_API_URL,
 42               api_key: Secret | None = Secret.from_env_var(
 43                   "UNSTRUCTURED_API_KEY", strict=False),
 44               document_creation_mode: Literal[
 45                   "one-doc-per-file", "one-doc-per-page",
 46                   "one-doc-per-element"] = "one-doc-per-file",
 47               separator: str = "\n\n",
 48               unstructured_kwargs: dict[str, Any] | None = None,
 49               progress_bar: bool = True)
 50  ```
 51  
 52  **Arguments**:
 53  
 54  - `api_url`: URL of the Unstructured API. Defaults to the URL of the hosted version.
 55  If you run the API locally, specify the URL of your local API (e.g. `"http://localhost:8000/general/v0/general"`).
 56  - `api_key`: API key for the Unstructured API.
 57  It can be explicitly passed or read the environment variable `UNSTRUCTURED_API_KEY` (recommended).
 58  If you run the API locally, it is not needed.
 59  - `document_creation_mode`: How to create Haystack Documents from the elements returned by Unstructured.
 60  `"one-doc-per-file"`: One Haystack Document per file. All elements are concatenated into one text field.
 61  `"one-doc-per-page"`: One Haystack Document per page.
 62  All elements on a page are concatenated into one text field.
 63  `"one-doc-per-element"`: One Haystack Document per element. Each element is converted to a Haystack Document.
 64  - `separator`: Separator between elements when concatenating them into one text field.
 65  - `unstructured_kwargs`: Additional parameters that are passed to the Unstructured API.
 66  For the available parameters, see
 67  [Unstructured API docs](https://docs.unstructured.io/api-reference/api-services/api-parameters).
 68  - `progress_bar`: Whether to show a progress bar during the conversion.
 69  
 70  <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.to_dict"></a>
 71  
 72  #### UnstructuredFileConverter.to\_dict
 73  
 74  ```python
 75  def to_dict() -> dict[str, Any]
 76  ```
 77  
 78  Serializes the component to a dictionary.
 79  
 80  **Returns**:
 81  
 82  Dictionary with serialized data.
 83  
 84  <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.from_dict"></a>
 85  
 86  #### UnstructuredFileConverter.from\_dict
 87  
 88  ```python
 89  @classmethod
 90  def from_dict(cls, data: dict[str, Any]) -> "UnstructuredFileConverter"
 91  ```
 92  
 93  Deserializes the component from a dictionary.
 94  
 95  **Arguments**:
 96  
 97  - `data`: Dictionary to deserialize from.
 98  
 99  **Returns**:
100  
101  Deserialized component.
102  
103  <a id="haystack_integrations.components.converters.unstructured.converter.UnstructuredFileConverter.run"></a>
104  
105  #### UnstructuredFileConverter.run
106  
107  ```python
108  @component.output_types(documents=list[Document])
109  def run(
110      paths: list[str] | list[os.PathLike],
111      meta: dict[str, Any] | list[dict[str, Any]] | None = None
112  ) -> dict[str, list[Document]]
113  ```
114  
115  Convert files to Haystack Documents using the Unstructured API.
116  
117  **Arguments**:
118  
119  - `paths`: List of paths to convert. Paths can be files or directories.
120  If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
121  - `meta`: Optional metadata to attach to the Documents.
122  This value can be either a list of dictionaries or a single dictionary.
123  If it's a single dictionary, its content is added to the metadata of all produced Documents.
124  If it's a list, the length of the list must match the number of paths, because the two lists will be zipped.
125  Please note that if the paths contain directories, `meta` can only be a single dictionary
126  (same metadata for all files).
127  
128  **Raises**:
129  
130  - `ValueError`: If `meta` is a list and `paths` contains directories.
131  
132  **Returns**:
133  
134  A dictionary with the following key:
135  - `documents`: List of Haystack Documents.
136