/ docs-website / versioned_docs / version-2.25 / pipeline-components / fetchers / firecrawlcrawler.mdx
firecrawlcrawler.mdx
1 --- 2 title: "FirecrawlCrawler" 3 id: firecrawlcrawler 4 slug: "/firecrawlcrawler" 5 description: "Use Firecrawl to crawl websites and return the content as Haystack Documents. Unlike single-page fetchers, FirecrawlCrawler follows links and discovers subpages." 6 --- 7 8 # FirecrawlCrawler 9 10 Use Firecrawl to crawl websites and return the content as Haystack Documents. Unlike single-page fetchers, FirecrawlCrawler follows links and discovers subpages. 11 12 <div className="key-value-table"> 13 14 | | | 15 | --- | --- | 16 | **Most common position in a pipeline** | In indexing or query pipelines as the data fetching step | 17 | **Mandatory run variables** | `urls`: A list of URLs (strings) to start crawling from | 18 | **Output variables** | `documents`: A list of [Documents](../../concepts/data-classes.mdx) | 19 | **API reference** | [Firecrawl](/reference/integrations-firecrawl) | 20 | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/firecrawl | 21 22 </div> 23 24 ## Overview 25 26 `FirecrawlCrawler` uses [Firecrawl](https://firecrawl.dev) to crawl one or more URLs and return the extracted content as Haystack `Document` objects. Starting from each given URL, it follows links to discover subpages up to a configurable limit. This makes it well-suited for ingesting entire websites or documentation sites, not just single pages. 27 28 Firecrawl returns content in a structured format that works well as input for LLMs. Each crawled page becomes a separate `Document` with the page content in the `content` field and metadata, such as title, URL, and description, in the `meta` field. 29 30 ### Crawl parameters 31 32 You can control the crawl behavior through the `params` argument. Some commonly used parameters: 33 34 - `limit`: Maximum number of pages to crawl per URL. Defaults to `1`. Without a limit, Firecrawl may crawl all subpages and consume credits quickly. 35 - `scrape_options`: Controls the output format. Defaults to `{"formats": ["markdown"]}`. 36 37 See the [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post) for the full list of available parameters. 38 39 ### Authorization 40 41 `FirecrawlCrawler` uses the `FIRECRAWL_API_KEY` environment variable by default. You can also pass the key explicitly at initialization: 42 43 ```python 44 from haystack.utils import Secret 45 from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler 46 47 crawler = FirecrawlCrawler(api_key=Secret.from_token("<your-api-key>")) 48 ``` 49 50 To get an API key, sign up at [firecrawl.dev](https://firecrawl.dev). 51 52 ### Installation 53 54 Install the Firecrawl integration with: 55 56 ```shell 57 pip install firecrawl-haystack 58 ``` 59 60 ## Usage 61 62 ### On its own 63 64 ```python 65 from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler 66 67 crawler = FirecrawlCrawler(params={"limit": 3}) 68 69 result = crawler.run(urls=["https://docs.haystack.deepset.ai/docs/intro"]) 70 documents = result["documents"] 71 72 for doc in documents: 73 print(f"{doc.meta.get('title')} - {doc.meta.get('url')}") 74 ``` 75 76 ### In a pipeline 77 78 Below is an example of an indexing pipeline that uses `FirecrawlCrawler` to crawl a documentation site and store the results in an `InMemoryDocumentStore`. 79 80 ```python 81 from haystack import Pipeline 82 from haystack.document_stores.in_memory import InMemoryDocumentStore 83 from haystack.components.preprocessors import DocumentSplitter 84 from haystack.components.writers import DocumentWriter 85 from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler 86 87 document_store = InMemoryDocumentStore() 88 89 crawler = FirecrawlCrawler(params={"limit": 10}) 90 splitter = DocumentSplitter(split_by="sentence", split_length=5) 91 writer = DocumentWriter(document_store=document_store) 92 93 indexing_pipeline = Pipeline() 94 indexing_pipeline.add_component("crawler", crawler) 95 indexing_pipeline.add_component("splitter", splitter) 96 indexing_pipeline.add_component("writer", writer) 97 98 indexing_pipeline.connect("crawler.documents", "splitter.documents") 99 indexing_pipeline.connect("splitter.documents", "writer.documents") 100 101 indexing_pipeline.run( 102 data={ 103 "crawler": { 104 "urls": ["https://docs.haystack.deepset.ai/docs/intro"], 105 }, 106 }, 107 ) 108 ```