Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Your First Crawl

Basic Crawl

palimpsest crawl https://example.com -d 2 -m 50 -o ./output
FlagMeaning
-d 2Maximum depth from seed URL
-m 50Maximum 50 URLs to fetch
-o ./outputPersist artifacts to disk

The default seed is 42. The default politeness delay is 1 second per host.

Output Structure

After the crawl completes, ./output contains:

output/
  blobs/          # Content-addressed storage (BLAKE3 hashes)
    af/
      1349b9f5... # Blob file named by hash
    c7/
      d2fe1a6b...
  index.sqlite    # Temporal index database
  output.warc     # WARC++ file (ISO 28500 compatible)
  frontier.json   # Saved frontier state (for resumption)

Replay a Captured URL

Reconstruct the captured version of a page from stored artifacts:

palimpsest replay https://example.com/ --data-dir ./output

This retrieves the stored blob, HTTP headers, and execution context to reproduce the original response.

View Capture History

List all captures of a URL with timestamps and content hashes:

palimpsest history https://example.com/ --data-dir ./output

Extract Text and RAG Chunks

Extract clean text and provenance-tagged chunks from a captured page:

palimpsest extract https://example.com/ --data-dir ./output --json

This strips HTML, removes scripts and styles, splits into chunks (default 1000 chars with 200 overlap), and tags each chunk with source_url, captured_at, source_hash, chunk_hash, and char_offset.

Browser Capture

Capture JavaScript-rendered pages with headless Chrome:

palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output

This captures:

  • Rendered DOM after JavaScript execution
  • All sub-resources (CSS, JS, images, fonts)
  • Resource dependency graph with load ordering

Using a Deterministic Seed

The seed controls all randomness — frontier ordering, host rotation, and browser JS overrides:

# These two runs produce identical output
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-a
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-b

# Verify
diff <(find ./run-a/blobs -type f | sort) <(find ./run-b/blobs -type f | sort)
# No output = identical

Shadow Comparison

Compare output against a legacy crawler:

# Crawl with wget
wget --warc-file=legacy -r -l 1 https://example.com/

# Crawl with Palimpsest
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out

# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-out