Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Distributed Crawling

Palimpsest supports horizontal scaling via an HTTP frontier server and N worker processes.

Architecture

                    ┌──────────────┐
    curl POST       │   Frontier   │ ◄── Deterministic ordering
    /seeds ────────►│   Server     │     (seed-driven)
                    │  :8090       │
                    └──┬───┬───┬──┘
                       │   │   │
              POST /pop│   │   │POST /discovered
                       │   │   │
                    ┌──┴┐ ┌┴──┐┌┴──┐
                    │W1 │ │W2 ││W3 │ ◄── Stateless workers
                    └───┘ └───┘└───┘
                       │   │   │
                       ▼   ▼   ▼
                    ┌──────────────┐
                    │ Shared Disk  │ (blobs, index, WARC)
                    └──────────────┘

Start the Frontier Server

palimpsest serve --port 8090 --seed 42 --politeness-ms 500

The frontier maintains deterministic URL ordering and politeness enforcement across all workers.

Seed URLs

curl -X POST http://localhost:8090/seeds \
  -H 'Content-Type: application/json' \
  -d '{"urls": ["https://example.com/", "https://docs.example.com/"]}'

Start Workers

# Terminal 2
palimpsest worker --server http://localhost:8090 --output-dir ./data

# Terminal 3 (scale out)
palimpsest worker --server http://localhost:8090 --output-dir ./data

Each worker loops: pop URL -> fetch -> store artifacts -> push discovered URLs.

Worker Flow

  1. POST /pop — receive next URL from frontier
  2. Fetch the URL (HTTP or browser)
  3. Store blob to content-addressed storage
  4. Insert entry into temporal index
  5. Write WARC++ records
  6. POST /discovered — push new URLs back to frontier
  7. Repeat

Monitoring

# Check frontier status
curl http://localhost:8090/status

# Response:
# {"queue_size": 1234, "seen_count": 5678, "host_count": 42, "seed_value": 42}

Determinism Guarantee

The frontier server maintains the same seed-driven ordering regardless of how many workers connect or in what order they pop URLs. Same seed = same frontier ordering.