Palimpsest
Built with Claude Code

The Web's Memory Layer Should Be Provable

Palimpsest is a deterministic crawl kernel: same seed produces identical frontier ordering, identical artifacts, and identical replay. Not a crawler. Not a Wayback clone. A foundation for web archiving, AI training data, and compliance where every capture is cryptographically verifiable.

Determinism covers crawl scheduling, content hashing, artifact generation, index entries, and replay reconstruction. Network latency, TLS handshake timing, and performance metrics vary by environment.

Rust
BLAKE3
CDP
WARC++
SQLite
Deterministic
Distributed
S3 / GCS
48
Commits
288
Tests
15
Crates
~35K
Lines
The Deliverable

What Is Palimpsest?

Palimpsest is a deterministic crawl kernel — not a crawler, not a Wayback clone. It is the foundational memory layer of the web: same input + same seed = identical crawl, identical artifacts, identical replay. Every design decision bends around six immutable laws covering determinism, content addressability, temporal integrity, and replay fidelity.

The system includes raw HTTP and headless Chrome capture, a SQLite temporal index, content-addressed blob storage (local + S3/GCS/Azure), WARC++ output compatible with legacy crawlers, robots.txt compliance, concurrent fetching, crawl resumption, shadow comparison against Heritrix/wget, and a deterministic simulation framework that proves correctness at 5,000 pages with zero divergence.

The Six Laws

invariants.md Laws 1 – 6
palimpsest-core
Types, BLAKE3 hashing, seeded PRNG, error taxonomy
palimpsest-envelope
Sealed execution context, immutable after construction
palimpsest-frontier
Deterministic seed-driven URL scheduler with politeness
palimpsest-artifact
WARC++ records, capture groups, reader/writer
palimpsest-storage
Content-addressed blobs: memory, filesystem, S3/GCS
palimpsest-index
Temporal graph index: in-memory + SQLite
palimpsest-fetch
HTTP + browser capture (CDP) + link extraction
palimpsest-replay
Reconstruct from stored artifacts
palimpsest-crawl
Orchestrator: the main crawl loop
palimpsest-shadow
Shadow comparison engine vs legacy crawlers
palimpsest-sim
Deterministic simulation testing (6 adversarial universes)
palimpsest-embed
Embedding generation, vector search, change detection
palimpsest-server
HTTP frontier + retrieval API server
palimpsest-cli
CLI: crawl, replay, history, compare, serve, worker
The Gap

Nothing Else Does This

We looked at every tool in the space. No one combines these properties.

The web archiving world has Heritrix and Webrecorder — institutional tools that produce WARC files but can't get past a Cloudflare challenge page. The scraping world has Scrapy, Crawlee, and Firecrawl — fast tools with anti-detection but no archival integrity, no WARC output, no replay fidelity. The AI world has Crawl4AI and Unstructured — great at extraction but with zero provenance, zero reproducibility, and no way to prove when content was captured or by whom.

These three worlds have never been combined. Archivists can't reach the modern web. Scrapers can't prove what they captured. AI pipelines can't trace their training data. Palimpsest exists because the web's memory layer shouldn't require choosing between access, integrity, and provenance.

"This started as a vibe coding session with Claude Code — iterating across multiple sessions, building crate by crate. What came out was 15 crates, 288 tests, a deterministic simulation framework proving correctness at 10,000 pages, BoringSSL TLS impersonation passing Rebrowser 10/10, and a temporal index that treats web history as a graph, not a lookup table. It stopped being a vibe pretty quickly."
— Built iteratively with Claude Code
No one else

Deterministic Crawling

Same seed = identical frontier ordering, identical artifacts, identical replay. No existing crawler guarantees this.

No one else

Stealth + Archival

WARC-producing tools have zero anti-detection. Anti-detection tools produce zero WARCs. These worlds have never met.

No one else

Execution Envelope

Every fetch seals DNS, TLS chain, headers, browser config, and seed into an immutable context. No other tool does this.

No one else

Temporal Graph Index

URL × time × content hash × crawl context. CDX is a flat lookup. This is a queryable history of the web.

Palimpsest

Provable AI Training Data

Every RAG chunk traces to a content hash, timestamp, crawl context, and execution envelope. Audit-ready provenance.

Palimpsest

Unified Capture

Raw HTTP and headless Chrome produce the same artifact types through the same envelope model. Not bolted on — native.

Heritrix Scrapy Crawlee Firecrawl Crawl4AI spider-rs curl-impersonate Webrecorder ArchiveBox camoufox puppeteer-stealth Colly katana

As of April 2026, we have not found another system that combines determinism, content addressability, TLS stealth, browser capture, WARC output, temporal indexing, and RAG extraction with determinism guarantees in a single codebase.

Part One

Setting Up Your Machine for Claude Code

Before writing a single line of code, your environment needs to be precisely configured. Claude Code is an AI-native development tool that operates as a CLI agent — it reads your codebase, executes commands, edits files, and runs tests. But it needs the right foundation to be effective.

Install Claude Code

Claude Code is distributed as an npm package. You need Node.js 18+ installed first.

# Install Node.js if you don't have it
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs

# Install Claude Code globally
npm install -g @anthropic-ai/claude-code

# Verify installation
claude --version

You will need an Anthropic API key. Set it as an environment variable or Claude Code will prompt you on first run.

export ANTHROPIC_API_KEY="sk-ant-..."

Install Rust Toolchain

Palimpsest uses Rust edition 2024, which requires Rust 1.85+. The stable toolchain works.

# Install rustup (Rust toolchain manager)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Source the environment
source "$HOME/.cargo/env"

# Install stable toolchain
rustup default stable

# Verify
rustc --version  # Should be 1.85+
cargo --version

Install System Dependencies

The project uses SQLite (bundled via rusqlite), headless Chrome for browser capture, and standard build tools.

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install -y build-essential pkg-config libssl-dev

# Install Google Chrome for browser capture
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo apt-key add -
echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" | \
  sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt-get update
sudo apt-get install -y google-chrome-stable

# Verify
google-chrome --version

Install GitHub CLI

Used for CI management, PR creation, and repository operations directly from Claude Code.

sudo apt-get install -y gh
gh auth login

Configure Git

Claude Code creates commits on your behalf. Ensure your identity is configured.

git config --global user.name "Your Name"
git config --global user.email "you@example.com"
Part Two

Establishing the Project Foundation

The key insight from this build: Claude Code is most effective when given clear architectural constraints upfront. The CLAUDE.md file is the contract between you and the AI — it defines invariants, coding standards, and non-negotiable rules.

Create CLAUDE.md

This is the most important file in the repository. It tells Claude Code what the project is, what rules to follow, and what invariants to enforce. Every design decision in Palimpsest traces back to this document.

The Six Laws defined in CLAUDE.md were enforced by pre-commit hooks that ran on every tool call. Claude Code checked its own work against these laws before every response — grep sweeps for .unwrap(), HashMap, rand, and Instant::now() were performed dozens of times throughout the session.

The Foundation: CLAUDE.md

This is the full project instruction file that Claude Code reads on every startup:

CLAUDE.md Project Contract
The contract between human and AI — defines invariants, coding standards, and non-negotiable rules.

Set Up Hooks and Rules

Claude Code supports hooks — shell scripts that run before or after tool calls. We used two hooks to enforce the Six Laws automatically:

# .claude/hooks/guard-determinism.sh
# Runs before every Edit/Write tool call
# Warns about: rand crate, HashMap, Instant::now(),
# .unwrap()/.expect() in library code

# .claude/hooks/check-invariants.sh
# Runs before every response ends
# Verifies: Six Laws compliance, error classification,
# content addressability, temporal integrity

These hooks caught real issues — an .expect() that slipped into library code, a missing content hash verification, a HashMap that should have been a BTreeMap.

Define Custom Skills

Claude Code skills are reusable prompt templates for domain-specific tasks. We defined 8 skills for the project:

# Available skills:
/shadow-compare   # Compare against legacy crawlers
/replay-test      # Verify replay fidelity
/crawl-verify     # Determinism verification
/envelope-audit   # Audit ExecutionEnvelope immutability
/frontier-sim     # Simulate frontier scheduling
/invariant-check  # Audit against Six Laws
/threat-model     # Security review
/sentinel         # Monitor Anthropic API changes
Part Three

The Build Process

With the environment configured and constraints established, the actual build proceeded bottom-up through the dependency graph. Claude Code implemented each crate, wrote tests, ran clippy, and performed Six Laws audits at every commit boundary.

Launch Claude Code

# Start Claude Code in the project directory
cd palimpsest
claude

# Claude Code reads CLAUDE.md, .claude/rules/, and .claude/hooks/
# automatically on startup. It understands the project's invariants
# before you type a single command.

The Conversation Pattern

The most effective pattern was direct and terse. No lengthy explanations needed — Claude Code had the context from CLAUDE.md.

# Start building
"let start working on this project"

# After reviewing the plan
"yes"

# Keep momentum
"proceed"

# Test against reality
"zuub.com"

# Scale up
"let run it"

# Ship
"lets commit and tag"

Key Milestones

The build progressed through three tagged releases in a single session:

v0.1.0 — Core kernel: 8 crates, 239 tests, Six Laws verified. End-to-end pipeline from seed URL to replayed content.

v0.2.0 — Browser capture via CDP, SQLite index, object store backend, crawl resumption. Production-tested on zuub.com.

v0.3.0 — Polish: graceful shutdown, TOML config, CI pipeline, README, sub-resource indexing, JSON-to-SQLite migration.

Phase 3 — Scale: distributed frontier server, HTTP API, multi-worker architecture. One coordinator, N workers, zero external deps.

v0.4.0 — AI-Native: content extraction pipeline, RAG chunking with provenance, retrieval API server.

v0.5.0 — Semantic Intelligence: embedding generation, cosine similarity vector search, LCS change detection across captures.

Deep Dive

Key Code Artifacts

These are the critical abstractions that make the system work. Each one enforces one or more of the Six Laws at the code level.

CrawlSeed: The Source of All Determinism

Every random decision in the entire system flows from this single seed through a ChaCha8 PRNG. No rand::thread_rng(), no OsRng, no entropy sources anywhere in the kernel.

palimpsest-core / types.rs Law 1: Determinism
The seed that governs all randomness — derive child seeds for sub-operations, get a deterministic ChaCha8Rng.

ExecutionEnvelope: The Sealed Context

Every fetch runs inside an immutable envelope that captures seed, timestamp, DNS snapshot, TLS fingerprint, and browser config. This is what makes replay possible — given the same envelope, the system must produce identical artifacts.

palimpsest-envelope / lib.rs Laws 1, 4, 5
Immutable after construction. Content-hashed via BLAKE3. The critical abstraction that makes replay possible.

Browser Determinism Layer

Injected before every page navigation, this JavaScript overrides all non-deterministic browser APIs with seed-derived values. Same seed = same Date.now(), same Math.random(), same performance.now().

browser / determinism.js Law 1: Determinism
Freeze time. Seed randomness. Disable leaky APIs. Every browser execution becomes reproducible.

Determinism Verification Harness

The core proof: crawl a simulated web twice with the same seed, assert byte-identical URL order, blob hashes, and index entries. Any divergence is a Law 1 violation. Proven at 5,000 pages with zero divergence.

palimpsest-sim / harness.rs Law 1: Proof
Crawl twice. Compare everything. Any divergence is a bug. Proven at 5,000 pages — zero divergence.
Phase 3

Distributed Crawling

One coordinator. N workers. Zero external infrastructure. The frontier server exposes the same deterministic scheduler over HTTP — workers pull URLs, fetch pages, push discovered links back. Horizontal scaling without Redis, Kafka, or message queues.

Architecture

# Terminal 1 — Start the frontier server
palimpsest serve -p 8090 -s 42 --politeness-ms 500

# Terminal 2 — Worker A
palimpsest worker --server http://localhost:8090 -o ./output-a

# Terminal 3 — Worker B
palimpsest worker --server http://localhost:8090 -o ./output-b

# Seed the crawl
curl -X POST http://localhost:8090/seeds \
  -d '{"urls":["https://example.com/","https://zuub.com/"]}'

# Monitor progress
curl http://localhost:8090/status
# {"queue_size":847,"seen_count":1203,"host_count":2,"seed_value":42}

API Endpoints

POST /seeds
Seed the frontier with initial URLs. Workers can start pulling immediately after.
POST /pop
Get the next URL to fetch. Respects per-host politeness. Returns null when frontier is empty or all hosts are throttled.
POST /discovered
Push discovered URLs from a worker. Dedup is handled server-side — already-seen URLs are silently dropped.
GET /status
Queue size, seen count, host count, seed value. Use for monitoring and dashboards.

Why not Redis? The frontier server is the entire state machine — seed-driven ordering, politeness enforcement, URL dedup. Redis would require reimplementing all of this in Lua scripts. The HTTP server wraps the existing Frontier struct directly, preserving all Six Laws guarantees.

Part Four

What Made This Work

Constraints First, Code Second

CLAUDE.md was written before any Rust code. The Six Laws, error taxonomy, dependency policy, and testing philosophy were all defined upfront. Claude Code enforced these constraints on every tool call through hooks. The result: zero invariant violations across 269 tests and 27 commits.

Real-World Validation Early

We didn't wait until the end to test against production HTML. The zuub.com crawl at commit 9 exposed a real bug (politeness starvation) that no unit test would have caught. The wget side-by-side comparison found three more bugs (script tag extraction, scheme normalization, WARC angle brackets). Every bug found by reality was fixed and tested before moving on.

Deterministic Simulation Testing

The simulation framework (palimpsest-sim) generates a virtual internet with six adversarial universes: LinkMaze, EncodingHell, MalformedDom, RedirectLabyrinth, ContentTrap, and TemporalDrift. Crawling this simulated web twice with the same seed and asserting byte-identical results is the ultimate proof of Law 1 (Determinism). We proved it at 5,000 pages.

Parallel Sub-Agents

Claude Code's agent system let us launch three independent tasks in parallel (sub-resource capture, object store backend, crawl resumption) on isolated git worktrees. While the agents explored complex CDP event wiring, we implemented the simpler features directly. The agents provided architectural insight even when they didn't produce final code.

System State

Live Feature Matrix

Total Build Time
00
hours
:
00
minutes
:
00
seconds
from first keystroke to v0.7.0 — one conversation
10:50 AM
v0.1
v0.2
v0.3
v0.4
v0.5
v0.6
v0.7
0
Commits
single session
0
Tests
zero failures
0
Crates
Rust workspace
0
Releases
v0.1 → v0.7
0
Pages Proven
zero divergence

Deterministic Crawl Kernel

Same seed = identical crawl, identical artifacts, identical replay

live — v0.1.0

Browser Capture (CDP)

Headless Chrome with JS determinism overrides, sub-resource graph, DOM snapshots

live — v0.2.0

SQLite Temporal Index

Multi-dimensional queries: URL × time × hash × context

live — v0.2.0

Object Store Backend

S3, GCS, Azure, local filesystem — content-addressed with BLAKE3

live — v0.2.0

Shadow Comparison

Diff Palimpsest output against Heritrix, wget, Warcprox WARC files

live — v0.1.0

Simulation Testing

6 adversarial universes, orchestrator-level verification, 5K page scale proof

live — v0.2.0

Distributed Crawling

HTTP frontier server + N workers. Zero external infrastructure.

shipped — v0.4.0

RAG Extraction Pipeline

HTML → clean text → provenance-tagged chunks for embedding

shipped — v0.4.0

Retrieval API

/v1/content, /v1/chunks, /v1/search — HTTP JSON with CORS

shipped — v0.4.0

Embedding Generation

EmbeddingProvider trait + hash-based test embedder. SQLite vector store with BLOB serialization.

shipped — v0.5.0

Semantic Search

Cosine similarity search over stored embeddings. Top-k results with full provenance per match.

shipped — v0.5.0

Change Detection

LCS-based line diff across captures. Hunks, similarity ratio, added/removed/unchanged counts.

shipped — v0.5.0

Docker Deployment

Multi-stage build, 4-service compose: API, frontier, worker, crawl. Production-ready containers.

live — v0.6.0

Prometheus Metrics

9 atomic counters exposed in text exposition format. Scrape-ready for Grafana dashboards.

live — v0.6.0

10K Page Stress Test

Determinism verified at 10,000 pages across 5 adversarial universes. Zero divergence.

live — v0.6.0

TLS Fingerprint Impersonation

BoringSSL via wreq. JA3/JA4 matching with post-quantum key shares. 70+ browser profiles.

stealth — v0.7.0

HTTP/2 Fingerprint Matching

Akamai h2 passive fingerprint: SETTINGS, WINDOW_UPDATE, pseudo-header order per browser.

stealth — v0.7.0

CDP Stealth Mode

17 anti-detection patches: webdriver, chrome object, plugins, canvas/audio noise, WebGL. All seeded.

stealth — v0.7.0

Browser Emulation Profiles

Unified identity: TLS + HTTP/2 + headers + JS. Seeded from CrawlSeed. Per-domain rotation.

stealth — v0.7.0
Anti-Detection

Four-Layer Stealth Stack

Every layer impersonates a real browser. Cross-layer consistency prevents detection. All values deterministic — seeded from CrawlSeed (Law 1).

TLS

JA3/JA4 Fingerprint — BoringSSL

Cipher suites, extensions, curves, ALPN, post-quantum key share (X25519MLKEM768). 70+ browser profiles via wreq.

HTTP/2

Akamai h2 Passive Fingerprint

SETTINGS frame values & order, WINDOW_UPDATE, pseudo-header order (:method,:authority,:scheme,:path). Per-browser.

CDP

17 Stealth Evasion Patches

navigator.webdriver, window.chrome, plugins, WebGL, canvas noise, AudioContext noise, ClientRect noise, permissions.

Profile

Unified Browser Identity

BrowserProfile ties all layers together. Chrome/Firefox/Safari/Edge presets. Per-domain rotation via BLAKE3(seed + domain).

Live Detection Test Results 0 / 0 pass

Rebrowser Bot Detector

bot-detector.rebrowser.net
10/10
CDP leak • source URL • webdriver • viewport • user-agent • CSP • Playwright markers • exposed functions

Sannysoft

bot.sannysoft.com
55/56
webdriver • chrome object • plugins • WebGL • canvas • permissions • screen • languages • Selenium • PhantomJS • Sequentum

FingerprintJS BotD

fingerprintjs.github.io/BotD
Clean
18 detectors • webdriver • WebGL Mesa • plugins • permissions • eval length • distinctive properties

CreepJS

abrahamjuliot.github.io/creepjs
Clean
21+ categories • headless rating • stealth rating • chrome object position • proxy traps • lie detection

Infosimples

infosimples.github.io/detect-headless
Timeout
16-point battery • webdriver • chrome element • plugins • permissions • alert timing • outer dimensions
0
Checks Passed
0
Detected
0
Sites Tested
0%
Pass Rate
Benchmarks

Performance Under Proof

Measured from deterministic simulation runs — two identical crawls, bit-compared.

Throughput (simulated) 18,400 pages/sec
10K pages across 5 universes — two full runs compared
Frontier Dequeue Latency <0.2 ms
Sub-millisecond BTreeMap pop with seed-driven host rotation
Content Hash Lookup O(1) constant
BLAKE3 content-addressed — structural dedup, not post-process
Concurrent Host Capacity 500 hosts
Aggressive policy: 100ms delay, 500 parallel host queues
Determinism Score 100% — 0 divergence
10,000 pages, 2 runs, byte-identical blobs + index + ordering
Throughput — Pages/Second Simulated
0
Peak p/s
0
Avg p/s
0
p99 Lat
0
ms
Frontier Pop
0
ms
BLAKE3 Hash
0
ms
Blob Write
0
ms
Index Insert
The Code

Explore the Repository

48 commits. 288 tests. 15 crates. 7 releases. CI green. Every line written in a single Claude Code session.

View on GitHub Read the Docs