Type: lib.browser.SpiderCrawl

Namespace: lib.browser

Description

Crawls websites following links and emitting URLs with optional HTML content. spider, crawler, web scraping, links, sitemap

Use cases:
- Build sitemaps and discover website structure
- Collect URLs for bulk processing
- Find all pages on a website
- Extract content from multiple pages
- Feed agentic workflows with discovered pages
- Analyze website content and structure

Properties

Property Type Description Default
start_url str The starting URL to begin crawling from ``
max_depth int Maximum depth to crawl (0 = start page only, 1 = start + linked pages, etc.) 2
max_pages int Maximum number of pages to crawl (safety limit) 50
same_domain_only bool Only follow links within the same domain as the start URL true
include_html bool Include the HTML content of each page in the output (increases bandwidth) false
respect_robots_txt bool Respect robots.txt rules (follows web crawler best practices) true
delay_ms int Delay in milliseconds between requests (politeness policy) 1000
timeout int Timeout in milliseconds for each page load 30000
url_pattern str Optional regex pattern to filter URLs (only crawl matching URLs) ``
exclude_pattern str Optional regex pattern to exclude URLs (skip matching URLs) ``

Outputs

Output Type Description
url str  
depth int  
html str  
title str  
status_code int  
pages list  

Browse other nodes in the lib.browser namespace.