Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Configuration

TOML Config File

Pass a TOML configuration file instead of CLI flags:

palimpsest crawl -c crawl.toml

Example Configuration

seeds = ["https://example.com/", "https://docs.example.com/"]

[crawl]
seed = 42
max_depth = 3
max_urls = 500
concurrency = 10
user_agent = "PalimpsestBot/0.1"
browser_mode = false
scope = "same_domain"
output_dir = "./output"

[politeness]
min_host_delay_ms = 1000
max_concurrent_hosts = 100

Configuration Fields

Seeds

seeds = ["https://example.com/", "https://docs.example.com/"]

One or more seed URLs. The crawl starts from these and discovers links outward.

Crawl Seed

seed = 42

The deterministic seed value. Controls all randomness in the system: frontier ordering, host rotation, browser JS overrides. Same seed = identical crawl.

Scope

scope = "same_domain"
ValueBehavior
same_domainFollow links within the registrable domain (e.g., www.example.com and docs.example.com both match example.com)
same_hostExact host match only
anyNo scope restriction (use with caution)

Politeness Policy

[politeness]
min_host_delay_ms = 1000      # Minimum delay between same-host requests
max_concurrent_hosts = 100     # Maximum hosts being fetched in parallel

Presets (when using the API directly):

PresetHost DelayConcurrent Hosts
default_policy()1 second100
aggressive()100ms500
no_delay()0unlimited

Depth and Limits

max_depth = 3       # Max link-following depth from seed (0 = seed page only)
max_urls = 500      # Hard cap on total URLs fetched
concurrency = 10    # Parallel fetch tasks

Browser Mode

browser_mode = true

Enables headless Chrome capture via CDP. Each page is loaded in a fresh browser context with determinism overrides applied (Date.now(), Math.random(), performance.now() are all seeded from CrawlSeed).

Output Directory

output_dir = "./output"

When set, artifacts are persisted to disk: content-addressed blobs, SQLite index, WARC++ file, and frontier state. When omitted, the crawl runs in-memory only.

CLI Flag Mapping

Config FieldCLI FlagDefault
seedspositional args(required)
seed-s, --seed42
max_depth-d, --depth2
max_urls-m, --max-urls100
min_host_delay_ms--politeness-ms1000
user_agent--user-agentPalimpsestBot/0.1
browser_mode--browserfalse
output_dir-o, --output-dir(none)
config file-c, --config(none)