Configuration
TOML Config File
Pass a TOML configuration file instead of CLI flags:
palimpsest crawl -c crawl.toml
Example Configuration
seeds = ["https://example.com/", "https://docs.example.com/"]
[crawl]
seed = 42
max_depth = 3
max_urls = 500
concurrency = 10
user_agent = "PalimpsestBot/0.1"
browser_mode = false
scope = "same_domain"
output_dir = "./output"
[politeness]
min_host_delay_ms = 1000
max_concurrent_hosts = 100
Configuration Fields
Seeds
seeds = ["https://example.com/", "https://docs.example.com/"]
One or more seed URLs. The crawl starts from these and discovers links outward.
Crawl Seed
seed = 42
The deterministic seed value. Controls all randomness in the system: frontier ordering, host rotation, browser JS overrides. Same seed = identical crawl.
Scope
scope = "same_domain"
| Value | Behavior |
|---|---|
same_domain | Follow links within the registrable domain (e.g., www.example.com and docs.example.com both match example.com) |
same_host | Exact host match only |
any | No scope restriction (use with caution) |
Politeness Policy
[politeness]
min_host_delay_ms = 1000 # Minimum delay between same-host requests
max_concurrent_hosts = 100 # Maximum hosts being fetched in parallel
Presets (when using the API directly):
| Preset | Host Delay | Concurrent Hosts |
|---|---|---|
default_policy() | 1 second | 100 |
aggressive() | 100ms | 500 |
no_delay() | 0 | unlimited |
Depth and Limits
max_depth = 3 # Max link-following depth from seed (0 = seed page only)
max_urls = 500 # Hard cap on total URLs fetched
concurrency = 10 # Parallel fetch tasks
Browser Mode
browser_mode = true
Enables headless Chrome capture via CDP. Each page is loaded in a fresh browser context with determinism overrides applied (Date.now(), Math.random(), performance.now() are all seeded from CrawlSeed).
Output Directory
output_dir = "./output"
When set, artifacts are persisted to disk: content-addressed blobs, SQLite index, WARC++ file, and frontier state. When omitted, the crawl runs in-memory only.
CLI Flag Mapping
| Config Field | CLI Flag | Default |
|---|---|---|
seeds | positional args | (required) |
seed | -s, --seed | 42 |
max_depth | -d, --depth | 2 |
max_urls | -m, --max-urls | 100 |
min_host_delay_ms | --politeness-ms | 1000 |
user_agent | --user-agent | PalimpsestBot/0.1 |
browser_mode | --browser | false |
output_dir | -o, --output-dir | (none) |
| config file | -c, --config | (none) |