Cradicle Explorer

/ docs-website / reference_versioned_docs / version-2.22 / haystack-api / fetchers_api.md
fetchers_api.md
  1  ---
  2  title: "Fetchers"
  3  id: fetchers-api
  4  description: "Fetches content from a list of URLs and returns a list of extracted content streams."
  5  slug: "/fetchers-api"
  6  ---
  7  
  8  <a id="link_content"></a>
  9  
 10  ## Module link\_content
 11  
 12  <a id="link_content.LinkContentFetcher"></a>
 13  
 14  ### LinkContentFetcher
 15  
 16  Fetches and extracts content from URLs.
 17  
 18  It supports various content types, retries on failures, and automatic user-agent rotation for failed web
 19  requests. Use it as the data-fetching step in your pipelines.
 20  
 21  You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument
 22  converter to do this.
 23  
 24  ### Usage example
 25  
 26  ```python
 27  from haystack.components.fetchers.link_content import LinkContentFetcher
 28  
 29  fetcher = LinkContentFetcher()
 30  streams = fetcher.run(urls=["https://www.google.com"])["streams"]
 31  
 32  assert len(streams) == 1
 33  assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'}
 34  assert streams[0].data
 35  ```
 36  
 37  For async usage:
 38  
 39  ```python
 40  import asyncio
 41  from haystack.components.fetchers import LinkContentFetcher
 42  
 43  async def fetch_async():
 44      fetcher = LinkContentFetcher()
 45      result = await fetcher.run_async(urls=["https://www.google.com"])
 46      return result["streams"]
 47  
 48  streams = asyncio.run(fetch_async())
 49  ```
 50  
 51  <a id="link_content.LinkContentFetcher.__init__"></a>
 52  
 53  #### LinkContentFetcher.\_\_init\_\_
 54  
 55  ```python
 56  def __init__(raise_on_failure: bool = True,
 57               user_agents: list[str] | None = None,
 58               retry_attempts: int = 2,
 59               timeout: int = 3,
 60               http2: bool = False,
 61               client_kwargs: dict | None = None,
 62               request_headers: dict[str, str] | None = None)
 63  ```
 64  
 65  Initializes the component.
 66  
 67  **Arguments**:
 68  
 69  - `raise_on_failure`: If `True`, raises an exception if it fails to fetch a single URL.
 70  For multiple URLs, it logs errors and returns the content it successfully fetched.
 71  - `user_agents`: [User agents](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent)
 72  for fetching content. If `None`, a default user agent is used.
 73  - `retry_attempts`: The number of times to retry to fetch the URL's content.
 74  - `timeout`: Timeout in seconds for the request.
 75  - `http2`: Whether to enable HTTP/2 support for requests. Defaults to False.
 76  Requires the 'h2' package to be installed (via `pip install httpx[http2]`).
 77  - `client_kwargs`: Additional keyword arguments to pass to the httpx client.
 78  If `None`, default values are used.
 79  
 80  <a id="link_content.LinkContentFetcher.__del__"></a>
 81  
 82  #### LinkContentFetcher.\_\_del\_\_
 83  
 84  ```python
 85  def __del__()
 86  ```
 87  
 88  Clean up resources when the component is deleted.
 89  
 90  Closes both the synchronous and asynchronous HTTP clients to prevent
 91  resource leaks.
 92  
 93  <a id="link_content.LinkContentFetcher.run"></a>
 94  
 95  #### LinkContentFetcher.run
 96  
 97  ```python
 98  @component.output_types(streams=list[ByteStream])
 99  def run(urls: list[str])
100  ```
101  
102  Fetches content from a list of URLs and returns a list of extracted content streams.
103  
104  Each content stream is a `ByteStream` object containing the extracted content as binary data.
105  Each ByteStream object in the returned list corresponds to the contents of a single URL.
106  The content type of each stream is stored in the metadata of the ByteStream object under
107  the key "content_type". The URL of the fetched content is stored under the key "url".
108  
109  **Arguments**:
110  
111  - `urls`: A list of URLs to fetch content from.
112  
113  **Raises**:
114  
115  - `Exception`: If the provided list of URLs contains only a single URL, and `raise_on_failure` is set to
116  `True`, an exception will be raised in case of an error during content retrieval.
117  In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved `ByteStream`
118   objects is returned.
119  
120  **Returns**:
121  
122  `ByteStream` objects representing the extracted content.
123  
124  <a id="link_content.LinkContentFetcher.run_async"></a>
125  
126  #### LinkContentFetcher.run\_async
127  
128  ```python
129  @component.output_types(streams=list[ByteStream])
130  async def run_async(urls: list[str])
131  ```
132  
133  Asynchronously fetches content from a list of URLs and returns a list of extracted content streams.
134  
135  This is the asynchronous version of the `run` method with the same parameters and return values.
136  
137  **Arguments**:
138  
139  - `urls`: A list of URLs to fetch content from.
140  
141  **Returns**:
142  
143  `ByteStream` objects representing the extracted content.
144