The Six Laws
Every design decision in Palimpsest bends around these six immutable laws. If a change violates any law, the change is wrong — not the law.
Law 1: Determinism
Frontier ordering is seed-driven. Retry logic is explicit. No hidden randomness anywhere.
Why it matters: Without determinism, you cannot verify a crawl, replay a crawl, or prove that two crawls are equivalent. Determinism is the foundation that makes every other law possible.
How it’s enforced:
- All randomness flows from
CrawlSeedthroughChaCha8Rng(seeded PRNG) - No
randcrate in any core path BTreeMapfor all ordered collections (neverHashMap)- No
Instant::now()orSystemTime::now()in core logic — time comes from theExecutionEnvelope - Atomics are allowed for metrics counters only, never for control flow
- Browser JS overrides:
Date.now(),Math.random(),performance.now()are all seeded
What breaks if violated: Two runs with the same seed produce different output. Replay becomes approximate. Verification becomes impossible. The entire system reduces to a conventional crawler.
Law 2: Idempotence
Same URL + same execution context = identical artifact hash.
Why it matters: Idempotence enables deduplication, verification, and caching. If the same fetch produces different artifacts, you cannot distinguish content changes from system noise.
How it’s enforced:
ContentHash::of(data)produces a deterministic BLAKE3 hashRecordIdis generated fromcontent_hash + record_type, not from random UUIDs- The
ExecutionEnvelopefreezes all inputs before the fetch begins - Response normalization is deterministic
What breaks if violated: Storage bloats with duplicate content under different hashes. Change detection produces false positives. Audit trails become unreliable.
Law 3: Content Addressability
All artifacts are BLAKE3 hash-addressed. Deduplication is structural.
Why it matters: Content addressing makes storage self-verifying. You can detect tampering by recomputing the hash. You get deduplication for free — identical content maps to the same hash, stored once.
How it’s enforced:
- Every
WarcRecordcarries aPalimpsest-Content-Hashheader - Every blob in storage is stored at a path derived from its BLAKE3 hash
FileSystemBlobStoreuses git-style layout:{hash[0..2]}/{hash[2..]}- Integrity is verified on every read
What breaks if violated: Tampering becomes undetectable. Deduplication fails. Storage grows linearly instead of sublinearly.
Law 4: Temporal Integrity
Every capture binds wall clock + logical clock + crawl context + dependency chain.
Why it matters: The web changes constantly. Without precise temporal binding, you cannot answer “what did this page look like at time T?” or “which crawl produced this artifact?”
How it’s enforced:
CaptureInstantpairs wall clock (DateTime<Utc>) with logical clock (u64)- Every
IndexEntryrecords URL,captured_at,content_hash, andcrawl_context CrawlContextIdidentifies the specific crawl sessionCaptureGroupbinds all records from a single fetch with their shared timestamp
What breaks if violated: History queries return ambiguous results. You cannot distinguish “same content, different time” from “different content, same time.”
Law 5: Replay Fidelity
Stored artifacts must be sufficient to reconstruct the HTTP exchange, DOM state, and resource dependency graph.
Why it matters: Replay is the proof that the system works. If you cannot reconstruct the original response from stored artifacts, the archive is incomplete.
How it’s enforced:
- The
ExecutionEnvelopestores the full context (seed, DNS, TLS, headers, browser config) - WARC++ records include
envelope,dom-snapshot,resource-graph, andtimingrecords ReplayEnginereconstructs fromenvelope + stored artifacts- Same envelope + same artifacts = bit-identical reconstruction
What breaks if violated: The archive becomes a collection of blobs without enough context to interpret them. Legal and forensic use cases fail.
Law 6: Observability as Proof
Every decision is queryable. Every failure is replayable. Every artifact is verifiable.
Why it matters: A crawl system that cannot explain its own behavior is a black box. Observability is not a feature — it is the proof that the other five laws hold.
How it’s enforced:
- Structured logging via
tracingthroughout the codebase - Prometheus metrics (9 atomic counters) exposed at
/metrics PalimpsestErrorclassifies every failure into exactly one of seven categories- Errors are stored as artifacts in the crawl record
- The temporal index makes every decision queryable
What breaks if violated: Debugging becomes guesswork. Compliance audits fail. Users cannot distinguish system bugs from legitimate content changes.