fetchers_api.md
1 --- 2 title: "Fetchers" 3 id: fetchers-api 4 description: "Fetches content from a list of URLs and returns a list of extracted content streams." 5 slug: "/fetchers-api" 6 --- 7 8 9 ## link_content 10 11 ### LinkContentFetcher 12 13 Fetches and extracts content from URLs. 14 15 It supports various content types, retries on failures, and automatic user-agent rotation for failed web 16 requests. Use it as the data-fetching step in your pipelines. 17 18 You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument 19 converter to do this. 20 21 ### Usage example 22 23 ```python 24 from haystack.components.fetchers.link_content import LinkContentFetcher 25 26 fetcher = LinkContentFetcher() 27 streams = fetcher.run(urls=["https://www.google.com"])["streams"] 28 29 assert len(streams) == 1 30 assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'} 31 assert streams[0].data 32 ``` 33 34 For async usage: 35 36 ```python 37 import asyncio 38 from haystack.components.fetchers import LinkContentFetcher 39 40 async def fetch_async(): 41 fetcher = LinkContentFetcher() 42 result = await fetcher.run_async(urls=["https://www.google.com"]) 43 return result["streams"] 44 45 streams = asyncio.run(fetch_async()) 46 ``` 47 48 #### __init__ 49 50 ```python 51 __init__( 52 raise_on_failure: bool = True, 53 user_agents: list[str] | None = None, 54 retry_attempts: int = 2, 55 timeout: int = 3, 56 http2: bool = False, 57 client_kwargs: dict | None = None, 58 request_headers: dict[str, str] | None = None, 59 ) -> None 60 ``` 61 62 Initializes the component. 63 64 **Parameters:** 65 66 - **raise_on_failure** (<code>bool</code>) – If `True`, raises an exception if it fails to fetch a single URL. 67 For multiple URLs, it logs errors and returns the content it successfully fetched. 68 - **user_agents** (<code>list\[str\] | None</code>) – [User agents](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) 69 for fetching content. If `None`, a default user agent is used. 70 - **retry_attempts** (<code>int</code>) – The number of times to retry to fetch the URL's content. 71 - **timeout** (<code>int</code>) – Timeout in seconds for the request. 72 - **http2** (<code>bool</code>) – Whether to enable HTTP/2 support for requests. Defaults to False. 73 Requires the 'h2' package to be installed (via `pip install httpx[http2]`). 74 - **client_kwargs** (<code>dict | None</code>) – Additional keyword arguments to pass to the httpx client. 75 If `None`, default values are used. 76 77 #### run 78 79 ```python 80 run(urls: list[str]) -> dict[str, Any] 81 ``` 82 83 Fetches content from a list of URLs and returns a list of extracted content streams. 84 85 Each content stream is a `ByteStream` object containing the extracted content as binary data. 86 Each ByteStream object in the returned list corresponds to the contents of a single URL. 87 The content type of each stream is stored in the metadata of the ByteStream object under 88 the key "content_type". The URL of the fetched content is stored under the key "url". 89 90 **Parameters:** 91 92 - **urls** (<code>list\[str\]</code>) – A list of URLs to fetch content from. 93 94 **Returns:** 95 96 - <code>dict\[str, Any\]</code> – `ByteStream` objects representing the extracted content. 97 98 **Raises:** 99 100 - <code>Exception</code> – If the provided list of URLs contains only a single URL, and `raise_on_failure` is set to 101 `True`, an exception will be raised in case of an error during content retrieval. 102 In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved `ByteStream` 103 objects is returned. 104 105 #### run_async 106 107 ```python 108 run_async(urls: list[str]) -> dict[str, Any] 109 ``` 110 111 Asynchronously fetches content from a list of URLs and returns a list of extracted content streams. 112 113 This is the asynchronous version of the `run` method with the same parameters and return values. 114 115 **Parameters:** 116 117 - **urls** (<code>list\[str\]</code>) – A list of URLs to fetch content from. 118 119 **Returns:** 120 121 - <code>dict\[str, Any\]</code> – `ByteStream` objects representing the extracted content.