Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Palimpsest

Palimpsest is a deterministic crawl kernel — not a crawler, not a Wayback clone, not a scraping framework. It is the foundational memory layer of the web: a system where the same input and the same seed produce an identical crawl, identical artifacts, and identical replay. Every design decision bends around this property.

What Makes This Different

Traditional web archiving tools (Heritrix, wget, Scrapy, Brozzler) treat crawling as an inherently non-deterministic process. Network jitter, DNS resolution timing, thread scheduling, and random retry backoff all introduce entropy. Two runs of the same crawl produce different results. This makes verification impossible, replay approximate, and auditing meaningless.

Palimpsest eliminates this. The system is governed by Six Laws — determinism, idempotence, content addressability, temporal integrity, replay fidelity, and observability as proof — that are enforced at every layer, from the frontier scheduler to the artifact serializer.

The result: a crawl kernel that auditors can trust, AI systems can consume, historians can depend on, and adversaries cannot easily corrupt.

The System at a Glance

MetricValue
Crates15 Rust workspace members
Tests301 (zero failures)
Determinism proof10,000 pages, zero divergence
StorageContent-addressed (BLAKE3) with structural deduplication
FormatWARC++ (ISO 28500 extension)
IndexTemporal graph: URL x time x hash x context
CaptureRaw HTTP + headless Chrome (CDP)
DistributionHTTP frontier server + N workers

How to Read This Documentation

  • Getting Started — Install, run your first crawl, configure the system.
  • Architecture — System design, the Six Laws, crate dependency graph, data flow.
  • Core Concepts — Deep dives into determinism, content addressability, the execution envelope, temporal indexing, and the WARC++ format.
  • Crate Reference — Complete API documentation for all 15 crates.
  • Operations — Docker deployment, distributed crawling, retrieval API, monitoring.
  • Security — Trust boundaries, fetch safety, browser sandboxing.
  • Testing — Testing philosophy, the simulation framework, adversarial universes.
  • Contributing — Development setup, code standards, commit conventions.
  • Appendix — Error taxonomy, API quick reference, glossary.