Cradicle Explorer

/ docs-website / versioned_docs / version-2.25 / pipeline-components / fetchers / firecrawlcrawler.mdx
firecrawlcrawler.mdx
  1  ---
  2  title: "FirecrawlCrawler"
  3  id: firecrawlcrawler
  4  slug: "/firecrawlcrawler"
  5  description: "Use Firecrawl to crawl websites and return the content as Haystack Documents. Unlike single-page fetchers, FirecrawlCrawler follows links and discovers subpages."
  6  ---
  7  
  8  # FirecrawlCrawler
  9  
 10  Use Firecrawl to crawl websites and return the content as Haystack Documents. Unlike single-page fetchers, FirecrawlCrawler follows links and discovers subpages.
 11  
 12  <div className="key-value-table">
 13  
 14  |  |  |
 15  | --- | --- |
 16  | **Most common position in a pipeline** | In indexing or query pipelines as the data fetching step |
 17  | **Mandatory run variables** | `urls`: A list of URLs (strings) to start crawling from |
 18  | **Output variables** | `documents`: A list of [Documents](../../concepts/data-classes.mdx) |
 19  | **API reference** | [Firecrawl](/reference/integrations-firecrawl) |
 20  | **GitHub link** | https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/firecrawl |
 21  
 22  </div>
 23  
 24  ## Overview
 25  
 26  `FirecrawlCrawler` uses [Firecrawl](https://firecrawl.dev) to crawl one or more URLs and return the extracted content as Haystack `Document` objects. Starting from each given URL, it follows links to discover subpages up to a configurable limit. This makes it well-suited for ingesting entire websites or documentation sites, not just single pages.
 27  
 28  Firecrawl returns content in a structured format that works well as input for LLMs. Each crawled page becomes a separate `Document` with the page content in the `content` field and metadata, such as title, URL, and description, in the `meta` field.
 29  
 30  ### Crawl parameters
 31  
 32  You can control the crawl behavior through the `params` argument. Some commonly used parameters:
 33  
 34  - `limit`: Maximum number of pages to crawl per URL. Defaults to `1`. Without a limit, Firecrawl may crawl all subpages and consume credits quickly.
 35  - `scrape_options`: Controls the output format. Defaults to `{"formats": ["markdown"]}`.
 36  
 37  See the [Firecrawl API reference](https://docs.firecrawl.dev/api-reference/endpoint/crawl-post) for the full list of available parameters.
 38  
 39  ### Authorization
 40  
 41  `FirecrawlCrawler` uses the `FIRECRAWL_API_KEY` environment variable by default. You can also pass the key explicitly at initialization:
 42  
 43  ```python
 44  from haystack.utils import Secret
 45  from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
 46  
 47  crawler = FirecrawlCrawler(api_key=Secret.from_token("<your-api-key>"))
 48  ```
 49  
 50  To get an API key, sign up at [firecrawl.dev](https://firecrawl.dev).
 51  
 52  ### Installation
 53  
 54  Install the Firecrawl integration with:
 55  
 56  ```shell
 57  pip install firecrawl-haystack
 58  ```
 59  
 60  ## Usage
 61  
 62  ### On its own
 63  
 64  ```python
 65  from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
 66  
 67  crawler = FirecrawlCrawler(params={"limit": 3})
 68  
 69  result = crawler.run(urls=["https://docs.haystack.deepset.ai/docs/intro"])
 70  documents = result["documents"]
 71  
 72  for doc in documents:
 73      print(f"{doc.meta.get('title')} - {doc.meta.get('url')}")
 74  ```
 75  
 76  ### In a pipeline
 77  
 78  Below is an example of an indexing pipeline that uses `FirecrawlCrawler` to crawl a documentation site and store the results in an `InMemoryDocumentStore`.
 79  
 80  ```python
 81  from haystack import Pipeline
 82  from haystack.document_stores.in_memory import InMemoryDocumentStore
 83  from haystack.components.preprocessors import DocumentSplitter
 84  from haystack.components.writers import DocumentWriter
 85  from haystack_integrations.components.fetchers.firecrawl import FirecrawlCrawler
 86  
 87  document_store = InMemoryDocumentStore()
 88  
 89  crawler = FirecrawlCrawler(params={"limit": 10})
 90  splitter = DocumentSplitter(split_by="sentence", split_length=5)
 91  writer = DocumentWriter(document_store=document_store)
 92  
 93  indexing_pipeline = Pipeline()
 94  indexing_pipeline.add_component("crawler", crawler)
 95  indexing_pipeline.add_component("splitter", splitter)
 96  indexing_pipeline.add_component("writer", writer)
 97  
 98  indexing_pipeline.connect("crawler.documents", "splitter.documents")
 99  indexing_pipeline.connect("splitter.documents", "writer.documents")
100  
101  indexing_pipeline.run(
102      data={
103          "crawler": {
104              "urls": ["https://docs.haystack.deepset.ai/docs/intro"],
105          },
106      },
107  )
108  ```