Distributed Crawling
Palimpsest supports horizontal scaling via an HTTP frontier server and N worker processes.
Architecture
┌──────────────┐
curl POST │ Frontier │ ◄── Deterministic ordering
/seeds ────────►│ Server │ (seed-driven)
│ :8090 │
└──┬───┬───┬──┘
│ │ │
POST /pop│ │ │POST /discovered
│ │ │
┌──┴┐ ┌┴──┐┌┴──┐
│W1 │ │W2 ││W3 │ ◄── Stateless workers
└───┘ └───┘└───┘
│ │ │
▼ ▼ ▼
┌──────────────┐
│ Shared Disk │ (blobs, index, WARC)
└──────────────┘
Start the Frontier Server
palimpsest serve --port 8090 --seed 42 --politeness-ms 500
The frontier maintains deterministic URL ordering and politeness enforcement across all workers.
Seed URLs
curl -X POST http://localhost:8090/seeds \
-H 'Content-Type: application/json' \
-d '{"urls": ["https://example.com/", "https://docs.example.com/"]}'
Start Workers
# Terminal 2
palimpsest worker --server http://localhost:8090 --output-dir ./data
# Terminal 3 (scale out)
palimpsest worker --server http://localhost:8090 --output-dir ./data
Each worker loops: pop URL -> fetch -> store artifacts -> push discovered URLs.
Worker Flow
POST /pop— receive next URL from frontier- Fetch the URL (HTTP or browser)
- Store blob to content-addressed storage
- Insert entry into temporal index
- Write WARC++ records
POST /discovered— push new URLs back to frontier- Repeat
Monitoring
# Check frontier status
curl http://localhost:8090/status
# Response:
# {"queue_size": 1234, "seen_count": 5678, "host_count": 42, "seed_value": 42}
Determinism Guarantee
The frontier server maintains the same seed-driven ordering regardless of how many workers connect or in what order they pop URLs. Same seed = same frontier ordering.