Your First Crawl
Basic Crawl
palimpsest crawl https://example.com -d 2 -m 50 -o ./output
| Flag | Meaning |
|---|---|
-d 2 | Maximum depth from seed URL |
-m 50 | Maximum 50 URLs to fetch |
-o ./output | Persist artifacts to disk |
The default seed is 42. The default politeness delay is 1 second per host.
Output Structure
After the crawl completes, ./output contains:
output/
blobs/ # Content-addressed storage (BLAKE3 hashes)
af/
1349b9f5... # Blob file named by hash
c7/
d2fe1a6b...
index.sqlite # Temporal index database
output.warc # WARC++ file (ISO 28500 compatible)
frontier.json # Saved frontier state (for resumption)
Replay a Captured URL
Reconstruct the captured version of a page from stored artifacts:
palimpsest replay https://example.com/ --data-dir ./output
This retrieves the stored blob, HTTP headers, and execution context to reproduce the original response.
View Capture History
List all captures of a URL with timestamps and content hashes:
palimpsest history https://example.com/ --data-dir ./output
Extract Text and RAG Chunks
Extract clean text and provenance-tagged chunks from a captured page:
palimpsest extract https://example.com/ --data-dir ./output --json
This strips HTML, removes scripts and styles, splits into chunks (default 1000 chars with 200 overlap), and tags each chunk with source_url, captured_at, source_hash, chunk_hash, and char_offset.
Browser Capture
Capture JavaScript-rendered pages with headless Chrome:
palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output
This captures:
- Rendered DOM after JavaScript execution
- All sub-resources (CSS, JS, images, fonts)
- Resource dependency graph with load ordering
Using a Deterministic Seed
The seed controls all randomness — frontier ordering, host rotation, and browser JS overrides:
# These two runs produce identical output
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-a
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-b
# Verify
diff <(find ./run-a/blobs -type f | sort) <(find ./run-b/blobs -type f | sort)
# No output = identical
Shadow Comparison
Compare output against a legacy crawler:
# Crawl with wget
wget --warc-file=legacy -r -l 1 https://example.com/
# Crawl with Palimpsest
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out
# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-out