Spider Crawl

Type: lib.browser.SpiderCrawl

Namespace: lib.browser

Description

Crawls websites following links and emitting URLs with optional HTML content. spider, crawler, web scraping, links, sitemap

Use cases:
- Build sitemaps and discover website structure
- Collect URLs for bulk processing
- Find all pages on a website
- Extract content from multiple pages
- Feed agentic workflows with discovered pages
- Analyze website content and structure

Properties

Property	Type	Description	Default
start_url	`str`	The starting URL to begin crawling from	``
max_depth	`int`	Maximum depth to crawl (0 = start page only, 1 = start + linked pages, etc.)	`2`
max_pages	`int`	Maximum number of pages to crawl (safety limit)	`50`
same_domain_only	`bool`	Only follow links within the same domain as the start URL	`true`
include_html	`bool`	Include the HTML content of each page in the output (increases bandwidth)	`false`
respect_robots_txt	`bool`	Respect robots.txt rules (follows web crawler best practices)	`true`
delay_ms	`int`	Delay in milliseconds between requests (politeness policy)	`1000`
timeout	`int`	Timeout in milliseconds for each page load	`30000`
url_pattern	`str`	Optional regex pattern to filter URLs (only crawl matching URLs)	``
exclude_pattern	`str`	Optional regex pattern to exclude URLs (skip matching URLs)	``

Outputs

Output	Type	Description
url	`str`
depth	`int`
html	`str`
title	`str`
status_code	`int`
pages	`list`

Browse other nodes in the lib.browser namespace.

Edit this page on GitHub

Spider Crawl

Description

Properties

Outputs

Related Nodes