Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.28 / haystack-api / fetchers_api.md
fetchers_api.md
  1  ---
  2  title: "Fetchers"
  3  id: fetchers-api
  4  description: "Fetches content from a list of URLs and returns a list of extracted content streams."
  5  slug: "/fetchers-api"
  6  ---
  7  
  8  
  9  ## link_content
 10  
 11  ### LinkContentFetcher
 12  
 13  Fetches and extracts content from URLs.
 14  
 15  It supports various content types, retries on failures, and automatic user-agent rotation for failed web
 16  requests. Use it as the data-fetching step in your pipelines.
 17  
 18  You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument
 19  converter to do this.
 20  
 21  ### Usage example
 22  
 23  ```python
 24  from haystack.components.fetchers.link_content import LinkContentFetcher
 25  
 26  fetcher = LinkContentFetcher()
 27  streams = fetcher.run(urls=["https://www.google.com"])["streams"]
 28  
 29  assert len(streams) == 1
 30  assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
 31  assert streams[0].data
 32  ```
 33  
 34  For async usage:
 35  
 36  ```python
 37  import asyncio
 38  from haystack.components.fetchers import LinkContentFetcher
 39  
 40  async def fetch_async():
 41      fetcher = LinkContentFetcher()
 42      result = await fetcher.run_async(urls=["https://www.google.com"])
 43      return result["streams"]
 44  
 45  streams = asyncio.run(fetch_async())
 46  ```
 47  
 48  #### __init__
 49  
 50  ```python
 51  __init__(
 52      raise_on_failure: bool = True,
 53      user_agents: list[str] | None = None,
 54      retry_attempts: int = 2,
 55      timeout: int = 3,
 56      http2: bool = False,
 57      client_kwargs: dict | None = None,
 58      request_headers: dict[str, str] | None = None,
 59  ) -> None
 60  ```
 61  
 62  Initializes the component.
 63  
 64  **Parameters:**
 65  
 66  - **raise_on_failure** (<code>bool</code>) – If `True`, raises an exception if it fails to fetch a single URL.
 67    For multiple URLs, it logs errors and returns the content it successfully fetched.
 68  - **user_agents** (<code>list\[str\] | None</code>) – [User agents](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent)
 69    for fetching content. If `None`, a default user agent is used.
 70  - **retry_attempts** (<code>int</code>) – The number of times to retry to fetch the URL's content.
 71  - **timeout** (<code>int</code>) – Timeout in seconds for the request.
 72  - **http2** (<code>bool</code>) – Whether to enable HTTP/2 support for requests. Defaults to False.
 73    Requires the 'h2' package to be installed (via `pip install httpx[http2]`).
 74  - **client_kwargs** (<code>dict | None</code>) – Additional keyword arguments to pass to the httpx client.
 75    If `None`, default values are used.
 76  
 77  #### run
 78  
 79  ```python
 80  run(urls: list[str]) -> dict[str, Any]
 81  ```
 82  
 83  Fetches content from a list of URLs and returns a list of extracted content streams.
 84  
 85  Each content stream is a `ByteStream` object containing the extracted content as binary data.
 86  Each ByteStream object in the returned list corresponds to the contents of a single URL.
 87  The content type of each stream is stored in the metadata of the ByteStream object under
 88  the key "content_type". The URL of the fetched content is stored under the key "url".
 89  
 90  **Parameters:**
 91  
 92  - **urls** (<code>list\[str\]</code>) – A list of URLs to fetch content from.
 93  
 94  **Returns:**
 95  
 96  - <code>dict\[str, Any\]</code> – `ByteStream` objects representing the extracted content.
 97  
 98  **Raises:**
 99  
100  - <code>Exception</code> – If the provided list of URLs contains only a single URL, and `raise_on_failure` is set to
101    `True`, an exception will be raised in case of an error during content retrieval.
102    In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved `ByteStream`
103    objects is returned.
104  
105  #### run_async
106  
107  ```python
108  run_async(urls: list[str]) -> dict[str, Any]
109  ```
110  
111  Asynchronously fetches content from a list of URLs and returns a list of extracted content streams.
112  
113  This is the asynchronous version of the `run` method with the same parameters and return values.
114  
115  **Parameters:**
116  
117  - **urls** (<code>list\[str\]</code>) – A list of URLs to fetch content from.
118  
119  **Returns:**
120  
121  - <code>dict\[str, Any\]</code> – `ByteStream` objects representing the extracted content.