firecrawl.md
1 --- 2 title: "Firecrawl" 3 id: integrations-firecrawl 4 description: "Firecrawl integration for Haystack" 5 slug: "/integrations-firecrawl" 6 --- 7 8 9 ## haystack_integrations.components.fetchers.firecrawl.firecrawl_crawler 10 11 ### FirecrawlCrawler 12 13 A component that uses Firecrawl to crawl one or more URLs and return the content as Haystack Documents. 14 15 Crawling starts from each given URL and follows links to discover subpages, up to a configurable limit. 16 This is useful for ingesting entire websites or documentation sites, not just single pages. 17 18 Firecrawl is a service that crawls websites and returns content in a structured format (e.g. Markdown) 19 suitable for LLMs. You need a Firecrawl API key from [firecrawl.dev](https://firecrawl.dev). 20 21 ### Usage example 22 23 ```python 24 from haystack_integrations.components.fetchers.firecrawl import FirecrawlFetcher 25 26 fetcher = FirecrawlFetcher( 27 api_key=Secret.from_env_var("FIRECRAWL_API_KEY"), 28 params={"limit": 5}, 29 ) 30 fetcher.warm_up() 31 32 result = fetcher.run(urls=["https://docs.haystack.deepset.ai/docs/intro"]) 33 documents = result["documents"] 34 ``` 35 36 #### __init__ 37 38 ```python 39 __init__( 40 api_key: Secret = Secret.from_env_var("FIRECRAWL_API_KEY"), 41 params: dict[str, Any] | None = None, 42 ) -> None 43 ``` 44 45 Initialize the FirecrawlFetcher. 46 47 **Parameters:** 48 49 - **api_key** (<code>Secret</code>) – API key for Firecrawl. 50 Defaults to the `FIRECRAWL_API_KEY` environment variable. 51 - **params** (<code>dict\[str, Any\] | None</code>) – Parameters for the crawl request. See the 52 [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post) 53 for available parameters. 54 Defaults to `{"limit": 1, "scrape_options": {"formats": ["markdown"]}}`. 55 Without a limit, Firecrawl may crawl all subpages and consume credits quickly. 56 57 #### run 58 59 ```python 60 run(urls: list[str], params: dict[str, Any] | None = None) -> dict[str, Any] 61 ``` 62 63 Crawls the given URLs and returns the extracted content as Documents. 64 65 **Parameters:** 66 67 - **urls** (<code>list\[str\]</code>) – List of URLs to crawl. 68 - **params** (<code>dict\[str, Any\] | None</code>) – Optional override of crawl parameters for this run. 69 If provided, fully replaces the init-time params. 70 71 **Returns:** 72 73 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 74 - `documents`: List of documents, one for each URL crawled. 75 76 #### run_async 77 78 ```python 79 run_async( 80 urls: list[str], params: dict[str, Any] | None = None 81 ) -> dict[str, Any] 82 ``` 83 84 Asynchronously crawls the given URLs and returns the extracted content as Documents. 85 86 **Parameters:** 87 88 - **urls** (<code>list\[str\]</code>) – List of URLs to crawl. 89 - **params** (<code>dict\[str, Any\] | None</code>) – Optional override of crawl parameters for this run. 90 If provided, fully replaces the init-time params. 91 92 **Returns:** 93 94 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 95 - `documents`: List of documents, one for each URL crawled. 96 97 #### warm_up 98 99 ```python 100 warm_up() -> None 101 ``` 102 103 Warm up the Firecrawl client by initializing the clients. 104 This is useful to avoid cold start delays when crawling many URLs. 105 106 ## haystack_integrations.components.websearch.firecrawl.firecrawl_websearch 107 108 ### FirecrawlWebSearch 109 110 A component that uses Firecrawl to search the web and return results as Haystack Documents. 111 112 This component wraps the Firecrawl Search API, enabling web search queries that return 113 structured documents with content and links. It follows the standard Haystack WebSearch 114 component interface. 115 116 Firecrawl is a service that crawls and scrapes websites, returning content in formats suitable 117 for LLMs. You need a Firecrawl API key from [firecrawl.dev](https://firecrawl.dev). 118 119 ### Usage example 120 121 ```python 122 from haystack_integrations.components.websearch.firecrawl import FirecrawlWebSearch 123 from haystack.utils import Secret 124 125 websearch = FirecrawlWebSearch( 126 api_key=Secret.from_env_var("FIRECRAWL_API_KEY"), 127 top_k=5, 128 ) 129 result = websearch.run(query="What is Haystack by deepset?") 130 documents = result["documents"] 131 links = result["links"] 132 ``` 133 134 #### __init__ 135 136 ```python 137 __init__( 138 api_key: Secret = Secret.from_env_var("FIRECRAWL_API_KEY"), 139 top_k: int | None = 10, 140 search_params: dict[str, Any] | None = None, 141 ) -> None 142 ``` 143 144 Initialize the FirecrawlWebSearch component. 145 146 **Parameters:** 147 148 - **api_key** (<code>Secret</code>) – API key for Firecrawl. 149 Defaults to the `FIRECRAWL_API_KEY` environment variable. 150 - **top_k** (<code>int | None</code>) – Maximum number of documents to return. 151 Defaults to 10. This can be overridden by the `"limit"` parameter in `search_params`. 152 - **search_params** (<code>dict\[str, Any\] | None</code>) – Additional parameters passed to the Firecrawl search API. 153 See the [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/search) 154 for available parameters. Supported keys include: `tbs`, `location`, 155 `scrape_options`, `sources`, `categories`, `timeout`. 156 157 #### warm_up 158 159 ```python 160 warm_up() -> None 161 ``` 162 163 Warm up the Firecrawl clients by initializing the sync and async clients. 164 This is useful to avoid cold start delays when performing searches. 165 166 #### run 167 168 ```python 169 run(query: str, search_params: dict[str, Any] | None = None) -> dict[str, Any] 170 ``` 171 172 Search the web using Firecrawl and return results as Documents. 173 174 **Parameters:** 175 176 - **query** (<code>str</code>) – Search query string. 177 - **search_params** (<code>dict\[str, Any\] | None</code>) – Optional override of search parameters for this run. 178 If provided, fully replaces the init-time search_params. 179 180 **Returns:** 181 182 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 183 - `documents`: List of documents with search result content. 184 - `links`: List of URLs from the search results. 185 186 #### run_async 187 188 ```python 189 run_async( 190 query: str, search_params: dict[str, Any] | None = None 191 ) -> dict[str, Any] 192 ``` 193 194 Asynchronously search the web using Firecrawl and return results as Documents. 195 196 **Parameters:** 197 198 - **query** (<code>str</code>) – Search query string. 199 - **search_params** (<code>dict\[str, Any\] | None</code>) – Optional override of search parameters for this run. 200 If provided, fully replaces the init-time search_params. 201 202 **Returns:** 203 204 - <code>dict\[str, Any\]</code> – A dictionary with the following keys: 205 - `documents`: List of documents with search result content. 206 - `links`: List of URLs from the search results.