Firecrawl Crawl
Blocks for crawling multiple pages of a website using Firecrawl.
Firecrawl Crawl
What it is
Firecrawl crawls websites to extract comprehensive data while bypassing blockers.
How it works
This block uses Firecrawl's API to crawl multiple pages of a website starting from a given URL. It navigates through links, handling JavaScript rendering and bypassing anti-bot measures to extract clean content from each page.
Configure the crawl depth with the limit parameter, choose output formats (markdown, HTML, or raw HTML), and optionally filter to main content only. The block supports caching with configurable max age and wait times for dynamic content.
Inputs
url
The URL to crawl
str
Yes
limit
The number of pages to crawl
int
No
only_main_content
Only return the main content of the page excluding headers, navs, footers, etc.
bool
No
max_age
The maximum age of the page in milliseconds - default is 1 hour
int
No
wait_for
Specify a delay in milliseconds before fetching the content, allowing the page sufficient time to load.
int
No
formats
The format of the crawl
List["markdown" | "html" | "rawHtml" | "links" | "screenshot" | "screenshot@fullPage" | "json" | "changeTracking"]
No
Outputs
error
Error message if the crawl failed
str
data
The result of the crawl
List[Dict[str, Any]]
markdown
The markdown of the crawl
str
html
The html of the crawl
str
raw_html
The raw html of the crawl
str
links
The links of the crawl
List[str]
screenshot
The screenshot of the crawl
str
screenshot_full_page
The screenshot full page of the crawl
str
json_data
The json data of the crawl
Dict[str, Any]
change_tracking
The change tracking of the crawl
Dict[str, Any]
Possible use case
Documentation Indexing: Crawl entire documentation sites to build searchable knowledge bases or training data.
Competitor Research: Extract content from competitor websites for market analysis and comparison.
Content Archival: Systematically archive website content for backup or compliance purposes.
Last updated
Was this helpful?