fetchers_api.md
1 --- 2 title: "Fetchers" 3 id: fetchers-api 4 description: "Fetches content from a list of URLs and returns a list of extracted content streams." 5 slug: "/fetchers-api" 6 --- 7 8 <a id="link_content"></a> 9 10 ## Module link\_content 11 12 <a id="link_content.LinkContentFetcher"></a> 13 14 ### LinkContentFetcher 15 16 Fetches and extracts content from URLs. 17 18 It supports various content types, retries on failures, and automatic user-agent rotation for failed web 19 requests. Use it as the data-fetching step in your pipelines. 20 21 You may need to convert LinkContentFetcher's output into a list of documents. Use HTMLToDocument 22 converter to do this. 23 24 ### Usage example 25 26 ```python 27 from haystack.components.fetchers.link_content import LinkContentFetcher 28 29 fetcher = LinkContentFetcher() 30 streams = fetcher.run(urls=["https://www.google.com"])["streams"] 31 32 assert len(streams) == 1 33 assert streams[0].meta == {'content_type': 'text/html', 'url': 'https://www.google.com'} 34 assert streams[0].data 35 ``` 36 37 For async usage: 38 39 ```python 40 import asyncio 41 from haystack.components.fetchers import LinkContentFetcher 42 43 async def fetch_async(): 44 fetcher = LinkContentFetcher() 45 result = await fetcher.run_async(urls=["https://www.google.com"]) 46 return result["streams"] 47 48 streams = asyncio.run(fetch_async()) 49 ``` 50 51 <a id="link_content.LinkContentFetcher.__init__"></a> 52 53 #### LinkContentFetcher.\_\_init\_\_ 54 55 ```python 56 def __init__(raise_on_failure: bool = True, 57 user_agents: list[str] | None = None, 58 retry_attempts: int = 2, 59 timeout: int = 3, 60 http2: bool = False, 61 client_kwargs: dict | None = None, 62 request_headers: dict[str, str] | None = None) 63 ``` 64 65 Initializes the component. 66 67 **Arguments**: 68 69 - `raise_on_failure`: If `True`, raises an exception if it fails to fetch a single URL. 70 For multiple URLs, it logs errors and returns the content it successfully fetched. 71 - `user_agents`: [User agents](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) 72 for fetching content. If `None`, a default user agent is used. 73 - `retry_attempts`: The number of times to retry to fetch the URL's content. 74 - `timeout`: Timeout in seconds for the request. 75 - `http2`: Whether to enable HTTP/2 support for requests. Defaults to False. 76 Requires the 'h2' package to be installed (via `pip install httpx[http2]`). 77 - `client_kwargs`: Additional keyword arguments to pass to the httpx client. 78 If `None`, default values are used. 79 80 <a id="link_content.LinkContentFetcher.__del__"></a> 81 82 #### LinkContentFetcher.\_\_del\_\_ 83 84 ```python 85 def __del__() 86 ``` 87 88 Clean up resources when the component is deleted. 89 90 Closes both the synchronous and asynchronous HTTP clients to prevent 91 resource leaks. 92 93 <a id="link_content.LinkContentFetcher.run"></a> 94 95 #### LinkContentFetcher.run 96 97 ```python 98 @component.output_types(streams=list[ByteStream]) 99 def run(urls: list[str]) 100 ``` 101 102 Fetches content from a list of URLs and returns a list of extracted content streams. 103 104 Each content stream is a `ByteStream` object containing the extracted content as binary data. 105 Each ByteStream object in the returned list corresponds to the contents of a single URL. 106 The content type of each stream is stored in the metadata of the ByteStream object under 107 the key "content_type". The URL of the fetched content is stored under the key "url". 108 109 **Arguments**: 110 111 - `urls`: A list of URLs to fetch content from. 112 113 **Raises**: 114 115 - `Exception`: If the provided list of URLs contains only a single URL, and `raise_on_failure` is set to 116 `True`, an exception will be raised in case of an error during content retrieval. 117 In all other scenarios, any retrieval errors are logged, and a list of successfully retrieved `ByteStream` 118 objects is returned. 119 120 **Returns**: 121 122 `ByteStream` objects representing the extracted content. 123 124 <a id="link_content.LinkContentFetcher.run_async"></a> 125 126 #### LinkContentFetcher.run\_async 127 128 ```python 129 @component.output_types(streams=list[ByteStream]) 130 async def run_async(urls: list[str]) 131 ``` 132 133 Asynchronously fetches content from a list of URLs and returns a list of extracted content streams. 134 135 This is the asynchronous version of the `run` method with the same parameters and return values. 136 137 **Arguments**: 138 139 - `urls`: A list of URLs to fetch content from. 140 141 **Returns**: 142 143 `ByteStream` objects representing the extracted content. 144