Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

The Six Laws

Every design decision in Palimpsest bends around these six immutable laws. If a change violates any law, the change is wrong — not the law.

Law 1: Determinism

Frontier ordering is seed-driven. Retry logic is explicit. No hidden randomness anywhere.

Why it matters: Without determinism, you cannot verify a crawl, replay a crawl, or prove that two crawls are equivalent. Determinism is the foundation that makes every other law possible.

How it’s enforced:

  • All randomness flows from CrawlSeed through ChaCha8Rng (seeded PRNG)
  • No rand crate in any core path
  • BTreeMap for all ordered collections (never HashMap)
  • No Instant::now() or SystemTime::now() in core logic — time comes from the ExecutionEnvelope
  • Atomics are allowed for metrics counters only, never for control flow
  • Browser JS overrides: Date.now(), Math.random(), performance.now() are all seeded

What breaks if violated: Two runs with the same seed produce different output. Replay becomes approximate. Verification becomes impossible. The entire system reduces to a conventional crawler.

Law 2: Idempotence

Same URL + same execution context = identical artifact hash.

Why it matters: Idempotence enables deduplication, verification, and caching. If the same fetch produces different artifacts, you cannot distinguish content changes from system noise.

How it’s enforced:

  • ContentHash::of(data) produces a deterministic BLAKE3 hash
  • RecordId is generated from content_hash + record_type, not from random UUIDs
  • The ExecutionEnvelope freezes all inputs before the fetch begins
  • Response normalization is deterministic

What breaks if violated: Storage bloats with duplicate content under different hashes. Change detection produces false positives. Audit trails become unreliable.

Law 3: Content Addressability

All artifacts are BLAKE3 hash-addressed. Deduplication is structural.

Why it matters: Content addressing makes storage self-verifying. You can detect tampering by recomputing the hash. You get deduplication for free — identical content maps to the same hash, stored once.

How it’s enforced:

  • Every WarcRecord carries a Palimpsest-Content-Hash header
  • Every blob in storage is stored at a path derived from its BLAKE3 hash
  • FileSystemBlobStore uses git-style layout: {hash[0..2]}/{hash[2..]}
  • Integrity is verified on every read

What breaks if violated: Tampering becomes undetectable. Deduplication fails. Storage grows linearly instead of sublinearly.

Law 4: Temporal Integrity

Every capture binds wall clock + logical clock + crawl context + dependency chain.

Why it matters: The web changes constantly. Without precise temporal binding, you cannot answer “what did this page look like at time T?” or “which crawl produced this artifact?”

How it’s enforced:

  • CaptureInstant pairs wall clock (DateTime<Utc>) with logical clock (u64)
  • Every IndexEntry records URL, captured_at, content_hash, and crawl_context
  • CrawlContextId identifies the specific crawl session
  • CaptureGroup binds all records from a single fetch with their shared timestamp

What breaks if violated: History queries return ambiguous results. You cannot distinguish “same content, different time” from “different content, same time.”

Law 5: Replay Fidelity

Stored artifacts must be sufficient to reconstruct the HTTP exchange, DOM state, and resource dependency graph.

Why it matters: Replay is the proof that the system works. If you cannot reconstruct the original response from stored artifacts, the archive is incomplete.

How it’s enforced:

  • The ExecutionEnvelope stores the full context (seed, DNS, TLS, headers, browser config)
  • WARC++ records include envelope, dom-snapshot, resource-graph, and timing records
  • ReplayEngine reconstructs from envelope + stored artifacts
  • Same envelope + same artifacts = bit-identical reconstruction

What breaks if violated: The archive becomes a collection of blobs without enough context to interpret them. Legal and forensic use cases fail.

Law 6: Observability as Proof

Every decision is queryable. Every failure is replayable. Every artifact is verifiable.

Why it matters: A crawl system that cannot explain its own behavior is a black box. Observability is not a feature — it is the proof that the other five laws hold.

How it’s enforced:

  • Structured logging via tracing throughout the codebase
  • Prometheus metrics (9 atomic counters) exposed at /metrics
  • PalimpsestError classifies every failure into exactly one of seven categories
  • Errors are stored as artifacts in the crawl record
  • The temporal index makes every decision queryable

What breaks if violated: Debugging becomes guesswork. Compliance audits fail. Users cannot distinguish system bugs from legitimate content changes.