Introduction
Palimpsest is a deterministic crawl kernel — not a crawler, not a Wayback clone, not a scraping framework. It is the foundational memory layer of the web: a system where the same input and the same seed produce an identical crawl, identical artifacts, and identical replay. Every design decision bends around this property.
What Makes This Different
Traditional web archiving tools (Heritrix, wget, Scrapy, Brozzler) treat crawling as an inherently non-deterministic process. Network jitter, DNS resolution timing, thread scheduling, and random retry backoff all introduce entropy. Two runs of the same crawl produce different results. This makes verification impossible, replay approximate, and auditing meaningless.
Palimpsest eliminates this. The system is governed by Six Laws — determinism, idempotence, content addressability, temporal integrity, replay fidelity, and observability as proof — that are enforced at every layer, from the frontier scheduler to the artifact serializer.
The result: a crawl kernel that auditors can trust, AI systems can consume, historians can depend on, and adversaries cannot easily corrupt.
The System at a Glance
| Metric | Value |
|---|---|
| Crates | 15 Rust workspace members |
| Tests | 301 (zero failures) |
| Determinism proof | 10,000 pages, zero divergence |
| Storage | Content-addressed (BLAKE3) with structural deduplication |
| Format | WARC++ (ISO 28500 extension) |
| Index | Temporal graph: URL x time x hash x context |
| Capture | Raw HTTP + headless Chrome (CDP) |
| Distribution | HTTP frontier server + N workers |
How to Read This Documentation
- Getting Started — Install, run your first crawl, configure the system.
- Architecture — System design, the Six Laws, crate dependency graph, data flow.
- Core Concepts — Deep dives into determinism, content addressability, the execution envelope, temporal indexing, and the WARC++ format.
- Crate Reference — Complete API documentation for all 15 crates.
- Operations — Docker deployment, distributed crawling, retrieval API, monitoring.
- Security — Trust boundaries, fetch safety, browser sandboxing.
- Testing — Testing philosophy, the simulation framework, adversarial universes.
- Contributing — Development setup, code standards, commit conventions.
- Appendix — Error taxonomy, API quick reference, glossary.