Palimpsest is a deterministic crawl kernel: same seed produces identical frontier ordering, identical artifacts, and identical replay. Not a crawler. Not a Wayback clone. A foundation for web archiving, AI training data, and compliance where every capture is cryptographically verifiable.
Determinism covers crawl scheduling, content hashing, artifact generation, index entries, and replay reconstruction. Network latency, TLS handshake timing, and performance metrics vary by environment.
Palimpsest is a deterministic crawl kernel — not a crawler, not a Wayback clone. It is the foundational memory layer of the web: same input + same seed = identical crawl, identical artifacts, identical replay. Every design decision bends around six immutable laws covering determinism, content addressability, temporal integrity, and replay fidelity.
The system includes raw HTTP and headless Chrome capture, a SQLite temporal index, content-addressed blob storage (local + S3/GCS/Azure), WARC++ output compatible with legacy crawlers, robots.txt compliance, concurrent fetching, crawl resumption, shadow comparison against Heritrix/wget, and a deterministic simulation framework that proves correctness at 5,000 pages with zero divergence.
We looked at every tool in the space. No one combines these properties.
The web archiving world has Heritrix and Webrecorder — institutional tools that produce WARC files but can't get past a Cloudflare challenge page. The scraping world has Scrapy, Crawlee, and Firecrawl — fast tools with anti-detection but no archival integrity, no WARC output, no replay fidelity. The AI world has Crawl4AI and Unstructured — great at extraction but with zero provenance, zero reproducibility, and no way to prove when content was captured or by whom.
These three worlds have never been combined. Archivists can't reach the modern web. Scrapers can't prove what they captured. AI pipelines can't trace their training data. Palimpsest exists because the web's memory layer shouldn't require choosing between access, integrity, and provenance.
Same seed = identical frontier ordering, identical artifacts, identical replay. No existing crawler guarantees this.
WARC-producing tools have zero anti-detection. Anti-detection tools produce zero WARCs. These worlds have never met.
Every fetch seals DNS, TLS chain, headers, browser config, and seed into an immutable context. No other tool does this.
URL × time × content hash × crawl context. CDX is a flat lookup. This is a queryable history of the web.
Every RAG chunk traces to a content hash, timestamp, crawl context, and execution envelope. Audit-ready provenance.
Raw HTTP and headless Chrome produce the same artifact types through the same envelope model. Not bolted on — native.
As of April 2026, we have not found another system that combines determinism, content addressability, TLS stealth, browser capture, WARC output, temporal indexing, and RAG extraction with determinism guarantees in a single codebase.
Before writing a single line of code, your environment needs to be precisely configured. Claude Code is an AI-native development tool that operates as a CLI agent — it reads your codebase, executes commands, edits files, and runs tests. But it needs the right foundation to be effective.
Claude Code is distributed as an npm package. You need Node.js 18+ installed first.
# Install Node.js if you don't have it
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs
# Install Claude Code globally
npm install -g @anthropic-ai/claude-code
# Verify installation
claude --version
You will need an Anthropic API key. Set it as an environment variable or Claude Code will prompt you on first run.
export ANTHROPIC_API_KEY="sk-ant-..."
Palimpsest uses Rust edition 2024, which requires Rust 1.85+. The stable toolchain works.
# Install rustup (Rust toolchain manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Source the environment
source "$HOME/.cargo/env"
# Install stable toolchain
rustup default stable
# Verify
rustc --version # Should be 1.85+
cargo --version
The project uses SQLite (bundled via rusqlite), headless Chrome for browser capture, and standard build tools.
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y build-essential pkg-config libssl-dev
# Install Google Chrome for browser capture
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | \
sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt-get update
sudo apt-get install -y google-chrome-stable
# Verify
google-chrome --version
Used for CI management, PR creation, and repository operations directly from Claude Code.
sudo apt-get install -y gh
gh auth login
Claude Code creates commits on your behalf. Ensure your identity is configured.
git config --global user.name "Your Name"
git config --global user.email "you@example.com"
The key insight from this build: Claude Code is most effective when given
clear architectural constraints upfront. The CLAUDE.md file
is the contract between you and the AI — it defines invariants,
coding standards, and non-negotiable rules.
This is the most important file in the repository. It tells Claude Code what the project is, what rules to follow, and what invariants to enforce. Every design decision in Palimpsest traces back to this document.
The Six Laws defined in CLAUDE.md were enforced by
pre-commit hooks that ran on every tool call. Claude Code checked its
own work against these laws before every response — grep sweeps
for .unwrap(), HashMap, rand,
and Instant::now() were performed dozens of times
throughout the session.
This is the full project instruction file that Claude Code reads on every startup:
Claude Code supports hooks — shell scripts that run before or after tool calls. We used two hooks to enforce the Six Laws automatically:
# .claude/hooks/guard-determinism.sh
# Runs before every Edit/Write tool call
# Warns about: rand crate, HashMap, Instant::now(),
# .unwrap()/.expect() in library code
# .claude/hooks/check-invariants.sh
# Runs before every response ends
# Verifies: Six Laws compliance, error classification,
# content addressability, temporal integrity
These hooks caught real issues — an .expect() that
slipped into library code, a missing content hash verification,
a HashMap that should have been a BTreeMap.
Claude Code skills are reusable prompt templates for domain-specific tasks. We defined 8 skills for the project:
# Available skills:
/shadow-compare # Compare against legacy crawlers
/replay-test # Verify replay fidelity
/crawl-verify # Determinism verification
/envelope-audit # Audit ExecutionEnvelope immutability
/frontier-sim # Simulate frontier scheduling
/invariant-check # Audit against Six Laws
/threat-model # Security review
/sentinel # Monitor Anthropic API changes
With the environment configured and constraints established, the actual build proceeded bottom-up through the dependency graph. Claude Code implemented each crate, wrote tests, ran clippy, and performed Six Laws audits at every commit boundary.
# Start Claude Code in the project directory
cd palimpsest
claude
# Claude Code reads CLAUDE.md, .claude/rules/, and .claude/hooks/
# automatically on startup. It understands the project's invariants
# before you type a single command.
The most effective pattern was direct and terse. No lengthy explanations needed — Claude Code had the context from CLAUDE.md.
# Start building
"let start working on this project"
# After reviewing the plan
"yes"
# Keep momentum
"proceed"
# Test against reality
"zuub.com"
# Scale up
"let run it"
# Ship
"lets commit and tag"
The build progressed through three tagged releases in a single session:
v0.1.0 — Core kernel: 8 crates, 239 tests, Six Laws verified. End-to-end pipeline from seed URL to replayed content.
v0.2.0 — Browser capture via CDP, SQLite index, object store backend, crawl resumption. Production-tested on zuub.com.
v0.3.0 — Polish: graceful shutdown, TOML config, CI pipeline, README, sub-resource indexing, JSON-to-SQLite migration.
Phase 3 — Scale: distributed frontier server, HTTP API, multi-worker architecture. One coordinator, N workers, zero external deps.
v0.4.0 — AI-Native: content extraction pipeline, RAG chunking with provenance, retrieval API server.
v0.5.0 — Semantic Intelligence: embedding generation, cosine similarity vector search, LCS change detection across captures.
These are the critical abstractions that make the system work. Each one enforces one or more of the Six Laws at the code level.
Every random decision in the entire system flows from this single seed
through a ChaCha8 PRNG. No rand::thread_rng(), no OsRng,
no entropy sources anywhere in the kernel.
Every fetch runs inside an immutable envelope that captures seed, timestamp, DNS snapshot, TLS fingerprint, and browser config. This is what makes replay possible — given the same envelope, the system must produce identical artifacts.
Injected before every page navigation, this JavaScript overrides all
non-deterministic browser APIs with seed-derived values. Same seed =
same Date.now(), same Math.random(),
same performance.now().
The core proof: crawl a simulated web twice with the same seed, assert byte-identical URL order, blob hashes, and index entries. Any divergence is a Law 1 violation. Proven at 5,000 pages with zero divergence.
One coordinator. N workers. Zero external infrastructure. The frontier server exposes the same deterministic scheduler over HTTP — workers pull URLs, fetch pages, push discovered links back. Horizontal scaling without Redis, Kafka, or message queues.
# Terminal 1 — Start the frontier server
palimpsest serve -p 8090 -s 42 --politeness-ms 500
# Terminal 2 — Worker A
palimpsest worker --server http://localhost:8090 -o ./output-a
# Terminal 3 — Worker B
palimpsest worker --server http://localhost:8090 -o ./output-b
# Seed the crawl
curl -X POST http://localhost:8090/seeds \
-d '{"urls":["https://example.com/","https://zuub.com/"]}'
# Monitor progress
curl http://localhost:8090/status
# {"queue_size":847,"seen_count":1203,"host_count":2,"seed_value":42}
Why not Redis? The frontier server is the entire state machine —
seed-driven ordering, politeness enforcement, URL dedup. Redis would require
reimplementing all of this in Lua scripts. The HTTP server wraps the existing
Frontier struct directly, preserving all Six Laws guarantees.
CLAUDE.md was written before any Rust code. The Six Laws, error taxonomy, dependency policy, and testing philosophy were all defined upfront. Claude Code enforced these constraints on every tool call through hooks. The result: zero invariant violations across 269 tests and 27 commits.
We didn't wait until the end to test against production HTML. The zuub.com crawl at commit 9 exposed a real bug (politeness starvation) that no unit test would have caught. The wget side-by-side comparison found three more bugs (script tag extraction, scheme normalization, WARC angle brackets). Every bug found by reality was fixed and tested before moving on.
The simulation framework (palimpsest-sim) generates a virtual
internet with six adversarial universes: LinkMaze, EncodingHell, MalformedDom,
RedirectLabyrinth, ContentTrap, and TemporalDrift. Crawling this simulated
web twice with the same seed and asserting byte-identical results is the
ultimate proof of Law 1 (Determinism). We proved it at 5,000 pages.
Claude Code's agent system let us launch three independent tasks in parallel (sub-resource capture, object store backend, crawl resumption) on isolated git worktrees. While the agents explored complex CDP event wiring, we implemented the simpler features directly. The agents provided architectural insight even when they didn't produce final code.
Same seed = identical crawl, identical artifacts, identical replay
live — v0.1.0Headless Chrome with JS determinism overrides, sub-resource graph, DOM snapshots
live — v0.2.0Multi-dimensional queries: URL × time × hash × context
live — v0.2.0S3, GCS, Azure, local filesystem — content-addressed with BLAKE3
live — v0.2.0Diff Palimpsest output against Heritrix, wget, Warcprox WARC files
live — v0.1.06 adversarial universes, orchestrator-level verification, 5K page scale proof
live — v0.2.0HTTP frontier server + N workers. Zero external infrastructure.
shipped — v0.4.0HTML → clean text → provenance-tagged chunks for embedding
shipped — v0.4.0/v1/content, /v1/chunks, /v1/search — HTTP JSON with CORS
shipped — v0.4.0EmbeddingProvider trait + hash-based test embedder. SQLite vector store with BLOB serialization.
shipped — v0.5.0Cosine similarity search over stored embeddings. Top-k results with full provenance per match.
shipped — v0.5.0LCS-based line diff across captures. Hunks, similarity ratio, added/removed/unchanged counts.
shipped — v0.5.0Multi-stage build, 4-service compose: API, frontier, worker, crawl. Production-ready containers.
live — v0.6.09 atomic counters exposed in text exposition format. Scrape-ready for Grafana dashboards.
live — v0.6.0Determinism verified at 10,000 pages across 5 adversarial universes. Zero divergence.
live — v0.6.0BoringSSL via wreq. JA3/JA4 matching with post-quantum key shares. 70+ browser profiles.
stealth — v0.7.0Akamai h2 passive fingerprint: SETTINGS, WINDOW_UPDATE, pseudo-header order per browser.
stealth — v0.7.017 anti-detection patches: webdriver, chrome object, plugins, canvas/audio noise, WebGL. All seeded.
stealth — v0.7.0Unified identity: TLS + HTTP/2 + headers + JS. Seeded from CrawlSeed. Per-domain rotation.
stealth — v0.7.0Every layer impersonates a real browser. Cross-layer consistency prevents detection. All values deterministic — seeded from CrawlSeed (Law 1).
Cipher suites, extensions, curves, ALPN, post-quantum key share (X25519MLKEM768). 70+ browser profiles via wreq.
SETTINGS frame values & order, WINDOW_UPDATE, pseudo-header order (:method,:authority,:scheme,:path). Per-browser.
navigator.webdriver, window.chrome, plugins, WebGL, canvas noise, AudioContext noise, ClientRect noise, permissions.
BrowserProfile ties all layers together. Chrome/Firefox/Safari/Edge presets. Per-domain rotation via BLAKE3(seed + domain).
Measured from deterministic simulation runs — two identical crawls, bit-compared.
48 commits. 288 tests. 15 crates. 7 releases. CI green. Every line written in a single Claude Code session.