Cradicle Explorer

/ docs-website / docs / pipeline-components / converters / filetofilecontent.mdx
filetofilecontent.mdx
  1  ---
  2  title: "FileToFileContent"
  3  id: filetofilecontent
  4  slug: "/filetofilecontent"
  5  description: "`FileToFileContent` reads local files and converts them into `FileContent` objects"
  6  ---
  7  
  8  # FileToFileContent
  9  
 10  `FileToFileContent` reads local files and converts them into `FileContent` objects. These are ready for multimodal AI pipelines that need to pass PDFs and other file types to an LLM.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | Before a `ChatPromptBuilder` in a query pipeline                                                      |
 17  | **Mandatory run variables**            | `sources`: A list of file paths or ByteStreams                                                        |
 18  | **Output variables**                   | `file_contents`: A list of `FileContent` objects                                                      |
 19  | **API reference**                      | [Converters](/reference/converters-api)                                                               |
 20  | **GitHub link**                        | https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/file_to_file_content.py |
 21  
 22  </div>
 23  
 24  ## Overview
 25  
 26  `FileToFileContent` processes a list of file sources and converts them into `FileContent` objects that can be embedded
 27  into a `ChatMessage` and passed to a Language Model.
 28  
 29  Each source can be:
 30  
 31  - A file path (string or `Path`), or
 32  - A `ByteStream` object.
 33  
 34  Optionally, you can provide extra provider-specific information using the `extra` parameter. This can be a single dictionary (applied to all files) or a list matching the length of `sources`.
 35  
 36  Support for passing files to LLMs varies by provider. Some providers do not support file inputs, some restrict support
 37  to PDF files, and others accept a wider range of file types.
 38  
 39  ## Usage
 40  
 41  ### On its own
 42  
 43  ```python
 44  from haystack.components.converters import FileToFileContent
 45  
 46  converter = FileToFileContent()
 47  
 48  sources = ["document.pdf", "recording.mp3"]
 49  
 50  result = converter.run(sources=sources)
 51  file_contents = result["file_contents"]
 52  print(file_contents)
 53  
 54  ## [
 55  ## FileContent(
 56  ##     base64_data='JVBERi0x...', mime_type='application/pdf',
 57  ##     filename='document.pdf', extra={}
 58  ## ),
 59  ## FileContent(
 60  ##     base64_data='SUQzBA...', mime_type='audio/mpeg',
 61  ##     filename='recording.mp3', extra={}
 62  ## )
 63  ## ]
 64  ```
 65  
 66  ### In a pipeline
 67  
 68  Use `FileToFileContent` together with a `LinkContentFetcher` and a `ChatPromptBuilder` to build a pipeline that fetches a remote file, converts it, and passes it to an LLM.
 69  
 70  ```python
 71  from haystack.components.converters import FileToFileContent
 72  from haystack.components.fetchers import LinkContentFetcher
 73  from haystack.components.generators.chat.openai import OpenAIChatGenerator
 74  from haystack.components.builders import ChatPromptBuilder
 75  
 76  from haystack import Pipeline
 77  
 78  template = """
 79  {% message role="user"%}
 80  {% for file in files %}
 81  {{ file | templatize_part }}
 82  {% endfor %}
 83  What's the main takeaway of the following document? Just one sentence.
 84  {% endmessage %}
 85  """
 86  
 87  pipeline = Pipeline()
 88  pipeline.add_component("fetcher", LinkContentFetcher())
 89  pipeline.add_component("converter", FileToFileContent())
 90  pipeline.add_component("prompt_builder", ChatPromptBuilder(template=template))
 91  pipeline.add_component("llm", OpenAIChatGenerator(model="gpt-4.1-mini"))
 92  
 93  pipeline.connect("fetcher", "converter")
 94  pipeline.connect("converter", "prompt_builder")
 95  pipeline.connect("prompt_builder", "llm")
 96  
 97  results = pipeline.run({"fetcher": {"urls": ["https://arxiv.org/pdf/2309.08632"]}})
 98  
 99  print(results["llm"]["replies"][0].text)
100  
101  # The document is a satirical paper humorously claiming that pretraining a
102  # small language model exclusively on evaluation benchmark test sets can achieve
103  # perfect performance, highlighting issues of data contamination in model
104  # evaluation.
105  ```