Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Palimpsest

Palimpsest is a deterministic crawl kernel — not a crawler, not a Wayback clone, not a scraping framework. It is the foundational memory layer of the web: a system where the same input and the same seed produce an identical crawl, identical artifacts, and identical replay. Every design decision bends around this property.

What Makes This Different

Traditional web archiving tools (Heritrix, wget, Scrapy, Brozzler) treat crawling as an inherently non-deterministic process. Network jitter, DNS resolution timing, thread scheduling, and random retry backoff all introduce entropy. Two runs of the same crawl produce different results. This makes verification impossible, replay approximate, and auditing meaningless.

Palimpsest eliminates this. The system is governed by Six Laws — determinism, idempotence, content addressability, temporal integrity, replay fidelity, and observability as proof — that are enforced at every layer, from the frontier scheduler to the artifact serializer.

The result: a crawl kernel that auditors can trust, AI systems can consume, historians can depend on, and adversaries cannot easily corrupt.

The System at a Glance

MetricValue
Crates15 Rust workspace members
Tests301 (zero failures)
Determinism proof10,000 pages, zero divergence
StorageContent-addressed (BLAKE3) with structural deduplication
FormatWARC++ (ISO 28500 extension)
IndexTemporal graph: URL x time x hash x context
CaptureRaw HTTP + headless Chrome (CDP)
DistributionHTTP frontier server + N workers

How to Read This Documentation

  • Getting Started — Install, run your first crawl, configure the system.
  • Architecture — System design, the Six Laws, crate dependency graph, data flow.
  • Core Concepts — Deep dives into determinism, content addressability, the execution envelope, temporal indexing, and the WARC++ format.
  • Crate Reference — Complete API documentation for all 15 crates.
  • Operations — Docker deployment, distributed crawling, retrieval API, monitoring.
  • Security — Trust boundaries, fetch safety, browser sandboxing.
  • Testing — Testing philosophy, the simulation framework, adversarial universes.
  • Contributing — Development setup, code standards, commit conventions.
  • Appendix — Error taxonomy, API quick reference, glossary.

Installation

Prerequisites

DependencyRequiredNotes
Rust 1.86+YesStable toolchain via rustup
GitYesSource checkout
C compiler + CMakeYesBoringSSL build (via wreq)
Go 1.19+YesBoringSSL build (via wreq)
Chrome or ChromiumOptionalBrowser capture mode (--browser)
DockerOptionalContainerized deployment

Why the C/Go toolchain?

Palimpsest uses wreq with BoringSSL for TLS fingerprint impersonation. BoringSSL is compiled from source during cargo build, which requires CMake, a C compiler, and Go.

macOS

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Install build dependencies (Xcode command line tools + CMake + Go)
xcode-select --install
brew install cmake go

# Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release

# Verify
./target/release/palimpsest --help

Chrome for browser capture:

# Chrome is usually at:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version

# Or install via Homebrew:
brew install --cask google-chrome

Linux (Ubuntu/Debian)

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Install build dependencies
sudo apt update
sudo apt install -y build-essential cmake golang-go pkg-config libclang-dev

# Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release

# Verify
./target/release/palimpsest --help

Chrome for browser capture:

# Install Chrome
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo gpg --dearmor -o /usr/share/keyrings/google-chrome.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/google-chrome.gpg] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update && sudo apt install -y google-chrome-stable

# Verify
google-chrome --version

Linux (Fedora/RHEL)

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env

# Install build dependencies
sudo dnf install -y gcc gcc-c++ cmake golang clang-devel pkg-config

# Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release

Linux (Arch)

sudo pacman -S rust cmake go clang pkg-config
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release

Windows

Option A: Native (MSVC)

# 1. Install Rust from https://rustup.rs/ (choose MSVC toolchain)

# 2. Install Visual Studio Build Tools (C/C++ workload)
#    https://visualstudio.microsoft.com/visual-cpp-build-tools/

# 3. Install CMake
#    https://cmake.org/download/ (add to PATH during install)

# 4. Install Go
#    https://go.dev/dl/ (add to PATH during install)

# 5. Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release

# 6. Verify
.\target\release\palimpsest.exe --help

Windows Subsystem for Linux gives you a full Linux environment. Follow the Linux (Ubuntu/Debian) instructions above inside WSL2:

# Install WSL2 with Ubuntu
wsl --install -d Ubuntu

# Then inside the WSL2 terminal, follow the Linux instructions

Option C: Docker (any platform)

If you don’t want to install build tools, Docker works on all platforms:

docker build -t palimpsest .
docker run palimpsest --help
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data

See Docker Deployment for the full compose setup.

Verifying the Build

After building, you should see all 10 subcommands:

$ palimpsest --help
Usage: palimpsest <COMMAND>

Commands:
  crawl           Start a crawl with seed URLs
  replay          Reconstruct a captured URL from artifacts
  history         Show capture history for a URL
  extract         Extract text and RAG chunks from captured content
  shadow-compare  Compare against legacy crawler WARC files
  serve           Start a distributed frontier server
  worker          Connect to a frontier server and crawl
  api             Start the retrieval API server
  stats           Print workspace statistics
  migrate         Run storage migrations

Running the Test Suite

# All tests (288, excludes long-running scale tests)
cargo test --workspace

# Simulation framework only
cargo test -p palimpsest-sim --test simulation_tests

# Scale tests (1K + 5K pages, ~90 seconds)
cargo test -p palimpsest-sim --test scale_test

# Stress test (10K pages)
cargo test -p palimpsest-sim --test stress_test

# Stealth regression tests (requires Chrome + network)
cargo test -p palimpsest-fetch --test stealth_test -- --ignored --nocapture --test-threads=1

Troubleshooting

BoringSSL build fails

The most common build issue. Check:

cmake --version   # Need 3.x+
go version        # Need 1.19+
clang --version   # Or gcc — need a C compiler

On macOS, ensure Xcode command line tools are installed: xcode-select --install

On Windows, ensure Visual Studio Build Tools include the “Desktop development with C++” workload.

Chrome not found (browser capture)

Palimpsest looks for Chrome/Chromium in PATH. If installed in a non-standard location:

# macOS — add to PATH
export PATH="/Applications/Google Chrome.app/Contents/MacOS:$PATH"

# Windows — add to PATH
set PATH=%PATH%;C:\Program Files\Google\Chrome\Application

openssl-sys linker errors

Palimpsest uses BoringSSL (via wreq), not OpenSSL. If you see openssl-sys errors, another dependency may be pulling it in. Check with:

cargo tree -i openssl-sys

If present, the boring-sys2 crate’s prefix-symbols feature should prevent symbol conflicts on Linux. On macOS this is not typically an issue.

Your First Crawl

Basic Crawl

palimpsest crawl https://example.com -d 2 -m 50 -o ./output
FlagMeaning
-d 2Maximum depth from seed URL
-m 50Maximum 50 URLs to fetch
-o ./outputPersist artifacts to disk

The default seed is 42. The default politeness delay is 1 second per host.

Output Structure

After the crawl completes, ./output contains:

output/
  blobs/          # Content-addressed storage (BLAKE3 hashes)
    af/
      1349b9f5... # Blob file named by hash
    c7/
      d2fe1a6b...
  index.sqlite    # Temporal index database
  output.warc     # WARC++ file (ISO 28500 compatible)
  frontier.json   # Saved frontier state (for resumption)

Replay a Captured URL

Reconstruct the captured version of a page from stored artifacts:

palimpsest replay https://example.com/ --data-dir ./output

This retrieves the stored blob, HTTP headers, and execution context to reproduce the original response.

View Capture History

List all captures of a URL with timestamps and content hashes:

palimpsest history https://example.com/ --data-dir ./output

Extract Text and RAG Chunks

Extract clean text and provenance-tagged chunks from a captured page:

palimpsest extract https://example.com/ --data-dir ./output --json

This strips HTML, removes scripts and styles, splits into chunks (default 1000 chars with 200 overlap), and tags each chunk with source_url, captured_at, source_hash, chunk_hash, and char_offset.

Browser Capture

Capture JavaScript-rendered pages with headless Chrome:

palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output

This captures:

  • Rendered DOM after JavaScript execution
  • All sub-resources (CSS, JS, images, fonts)
  • Resource dependency graph with load ordering

Using a Deterministic Seed

The seed controls all randomness — frontier ordering, host rotation, and browser JS overrides:

# These two runs produce identical output
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-a
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-b

# Verify
diff <(find ./run-a/blobs -type f | sort) <(find ./run-b/blobs -type f | sort)
# No output = identical

Shadow Comparison

Compare output against a legacy crawler:

# Crawl with wget
wget --warc-file=legacy -r -l 1 https://example.com/

# Crawl with Palimpsest
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out

# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-out

Configuration

TOML Config File

Pass a TOML configuration file instead of CLI flags:

palimpsest crawl -c crawl.toml

Example Configuration

seeds = ["https://example.com/", "https://docs.example.com/"]

[crawl]
seed = 42
max_depth = 3
max_urls = 500
concurrency = 10
user_agent = "PalimpsestBot/0.1"
browser_mode = false
scope = "same_domain"
output_dir = "./output"

[politeness]
min_host_delay_ms = 1000
max_concurrent_hosts = 100

Configuration Fields

Seeds

seeds = ["https://example.com/", "https://docs.example.com/"]

One or more seed URLs. The crawl starts from these and discovers links outward.

Crawl Seed

seed = 42

The deterministic seed value. Controls all randomness in the system: frontier ordering, host rotation, browser JS overrides. Same seed = identical crawl.

Scope

scope = "same_domain"
ValueBehavior
same_domainFollow links within the registrable domain (e.g., www.example.com and docs.example.com both match example.com)
same_hostExact host match only
anyNo scope restriction (use with caution)

Politeness Policy

[politeness]
min_host_delay_ms = 1000      # Minimum delay between same-host requests
max_concurrent_hosts = 100     # Maximum hosts being fetched in parallel

Presets (when using the API directly):

PresetHost DelayConcurrent Hosts
default_policy()1 second100
aggressive()100ms500
no_delay()0unlimited

Depth and Limits

max_depth = 3       # Max link-following depth from seed (0 = seed page only)
max_urls = 500      # Hard cap on total URLs fetched
concurrency = 10    # Parallel fetch tasks

Browser Mode

browser_mode = true

Enables headless Chrome capture via CDP. Each page is loaded in a fresh browser context with determinism overrides applied (Date.now(), Math.random(), performance.now() are all seeded from CrawlSeed).

Output Directory

output_dir = "./output"

When set, artifacts are persisted to disk: content-addressed blobs, SQLite index, WARC++ file, and frontier state. When omitted, the crawl runs in-memory only.

CLI Flag Mapping

Config FieldCLI FlagDefault
seedspositional args(required)
seed-s, --seed42
max_depth-d, --depth2
max_urls-m, --max-urls100
min_host_delay_ms--politeness-ms1000
user_agent--user-agentPalimpsestBot/0.1
browser_mode--browserfalse
output_dir-o, --output-dir(none)
config file-c, --config(none)

System Overview

Palimpsest is a crawl kernel, not a crawler. The distinction matters: a crawler is a tool that fetches web pages. A crawl kernel is the deterministic execution engine that schedules fetches, seals execution contexts, captures artifacts, stores content-addressed blobs, indexes temporal state, and enables bit-identical replay.

The CLI, server, and UI are thin wrappers. The kernel is the product.

Layer Model

The system is organized into five layers, each with strict responsibilities:

┌─────────────────────────────────────────────────┐
│  Interface Layer                                 │
│  palimpsest-cli · palimpsest-server              │
├─────────────────────────────────────────────────┤
│  Orchestration Layer                             │
│  palimpsest-crawl · palimpsest-sim               │
├─────────────────────────────────────────────────┤
│  Capture Layer                                   │
│  palimpsest-fetch · palimpsest-artifact          │
│  palimpsest-extract · palimpsest-embed           │
├─────────────────────────────────────────────────┤
│  Persistence Layer                               │
│  palimpsest-storage · palimpsest-index           │
│  palimpsest-replay · palimpsest-shadow           │
├─────────────────────────────────────────────────┤
│  Foundation Layer                                │
│  palimpsest-core · palimpsest-envelope           │
│  palimpsest-frontier                             │
└─────────────────────────────────────────────────┘

Design Principles

Zero shared mutable state. The core kernel has no global state. All state flows through explicit parameters — seeds, envelopes, configs.

The ExecutionEnvelope is the critical abstraction. It seals the execution context (seed, timestamp, DNS snapshot, TLS fingerprint, browser config, headers) before any fetch occurs. Without the envelope, you cannot replay, verify, or prove anything.

Errors are artifacts. Every failure is classified into one of seven categories and stored as part of the crawl record. Errors are not noise — they are history.

Content is addressed, not located. Every blob is stored and retrieved by its BLAKE3 hash. Deduplication is structural, not post-process.

The Six Laws

Every design decision in Palimpsest bends around these six immutable laws. If a change violates any law, the change is wrong — not the law.

Law 1: Determinism

Frontier ordering is seed-driven. Retry logic is explicit. No hidden randomness anywhere.

Why it matters: Without determinism, you cannot verify a crawl, replay a crawl, or prove that two crawls are equivalent. Determinism is the foundation that makes every other law possible.

How it’s enforced:

  • All randomness flows from CrawlSeed through ChaCha8Rng (seeded PRNG)
  • No rand crate in any core path
  • BTreeMap for all ordered collections (never HashMap)
  • No Instant::now() or SystemTime::now() in core logic — time comes from the ExecutionEnvelope
  • Atomics are allowed for metrics counters only, never for control flow
  • Browser JS overrides: Date.now(), Math.random(), performance.now() are all seeded

What breaks if violated: Two runs with the same seed produce different output. Replay becomes approximate. Verification becomes impossible. The entire system reduces to a conventional crawler.

Law 2: Idempotence

Same URL + same execution context = identical artifact hash.

Why it matters: Idempotence enables deduplication, verification, and caching. If the same fetch produces different artifacts, you cannot distinguish content changes from system noise.

How it’s enforced:

  • ContentHash::of(data) produces a deterministic BLAKE3 hash
  • RecordId is generated from content_hash + record_type, not from random UUIDs
  • The ExecutionEnvelope freezes all inputs before the fetch begins
  • Response normalization is deterministic

What breaks if violated: Storage bloats with duplicate content under different hashes. Change detection produces false positives. Audit trails become unreliable.

Law 3: Content Addressability

All artifacts are BLAKE3 hash-addressed. Deduplication is structural.

Why it matters: Content addressing makes storage self-verifying. You can detect tampering by recomputing the hash. You get deduplication for free — identical content maps to the same hash, stored once.

How it’s enforced:

  • Every WarcRecord carries a Palimpsest-Content-Hash header
  • Every blob in storage is stored at a path derived from its BLAKE3 hash
  • FileSystemBlobStore uses git-style layout: {hash[0..2]}/{hash[2..]}
  • Integrity is verified on every read

What breaks if violated: Tampering becomes undetectable. Deduplication fails. Storage grows linearly instead of sublinearly.

Law 4: Temporal Integrity

Every capture binds wall clock + logical clock + crawl context + dependency chain.

Why it matters: The web changes constantly. Without precise temporal binding, you cannot answer “what did this page look like at time T?” or “which crawl produced this artifact?”

How it’s enforced:

  • CaptureInstant pairs wall clock (DateTime<Utc>) with logical clock (u64)
  • Every IndexEntry records URL, captured_at, content_hash, and crawl_context
  • CrawlContextId identifies the specific crawl session
  • CaptureGroup binds all records from a single fetch with their shared timestamp

What breaks if violated: History queries return ambiguous results. You cannot distinguish “same content, different time” from “different content, same time.”

Law 5: Replay Fidelity

Stored artifacts must be sufficient to reconstruct the HTTP exchange, DOM state, and resource dependency graph.

Why it matters: Replay is the proof that the system works. If you cannot reconstruct the original response from stored artifacts, the archive is incomplete.

How it’s enforced:

  • The ExecutionEnvelope stores the full context (seed, DNS, TLS, headers, browser config)
  • WARC++ records include envelope, dom-snapshot, resource-graph, and timing records
  • ReplayEngine reconstructs from envelope + stored artifacts
  • Same envelope + same artifacts = bit-identical reconstruction

What breaks if violated: The archive becomes a collection of blobs without enough context to interpret them. Legal and forensic use cases fail.

Law 6: Observability as Proof

Every decision is queryable. Every failure is replayable. Every artifact is verifiable.

Why it matters: A crawl system that cannot explain its own behavior is a black box. Observability is not a feature — it is the proof that the other five laws hold.

How it’s enforced:

  • Structured logging via tracing throughout the codebase
  • Prometheus metrics (9 atomic counters) exposed at /metrics
  • PalimpsestError classifies every failure into exactly one of seven categories
  • Errors are stored as artifacts in the crawl record
  • The temporal index makes every decision queryable

What breaks if violated: Debugging becomes guesswork. Compliance audits fail. Users cannot distinguish system bugs from legitimate content changes.

Crate Map

All 15 Crates

CrateLayerResponsibilityKey Invariant
palimpsest-coreFoundationTypes, BLAKE3 hashing, seeded PRNG, error taxonomyNo IO. Pure types only.
palimpsest-envelopeFoundationSealed execution contextImmutable after construction
palimpsest-frontierFoundationDeterministic URL scheduler with politenessSame seed = same traversal order
palimpsest-fetchCaptureHTTP client + browser capture (CDP) + link extractionEvery fetch wraps an envelope
palimpsest-artifactCaptureWARC++ serialization, capture groupsContent-addressed outputs
palimpsest-storagePersistenceContent-addressable blobs (memory, fs, S3/GCS/Azure)Dedup is structural
palimpsest-indexPersistenceTemporal graph: URL x time x hash x contextQueryable history
palimpsest-replayPersistenceHTTP reconstruction, DOM rehydrationBit-identical replay from artifacts
palimpsest-crawlOrchestrationMain crawl loop and coordinationIntegrates all layers
palimpsest-shadowPersistenceComparison engine vs legacy crawlersCross-format validation
palimpsest-extractCaptureHTML-to-text + RAG chunking with provenanceDeterministic extraction
palimpsest-embedCaptureEmbedding generation, vector search, change detectionBLAKE3-based test embeddings
palimpsest-serverInterfaceHTTP frontier server + retrieval API + metricsThread-safe state
palimpsest-simOrchestrationDeterministic simulation testing frameworkProves Laws 1-6
palimpsest-cliInterfaceCommand-line interface (10 subcommands)Thin wrapper

Dependency Graph

palimpsest-cli
├── palimpsest-core
├── palimpsest-crawl
│   ├── palimpsest-core
│   ├── palimpsest-envelope
│   ├── palimpsest-frontier
│   │   └── palimpsest-core
│   ├── palimpsest-fetch
│   │   └── palimpsest-core
│   ├── palimpsest-artifact
│   │   └── palimpsest-core
│   ├── palimpsest-storage
│   │   └── palimpsest-core
│   └── palimpsest-index
│       └── palimpsest-core
├── palimpsest-frontier
├── palimpsest-index
├── palimpsest-storage
├── palimpsest-replay
├── palimpsest-server
│   ├── palimpsest-frontier
│   ├── palimpsest-index
│   └── palimpsest-storage
├── palimpsest-shadow
├── palimpsest-artifact
├── palimpsest-envelope
├── palimpsest-extract
└── palimpsest-fetch

Key Pattern

Every crate depends on palimpsest-core for shared types (CrawlSeed, ContentHash, CaptureInstant, PalimpsestError). No crate performs IO unless its responsibility requires it. The foundation layer is pure computation.

Data Flow

This chapter traces a single URL through the entire Palimpsest system, from seed to replay.

1. Seed URL Enters the Frontier

#![allow(unused)]
fn main() {
let seed = CrawlSeed::new(42);
let mut frontier = Frontier::new(seed, PolitenessPolicy::default_policy());
frontier.push_seed(Url::parse("https://example.com/").unwrap());
}

The frontier deduplicates by URL string and buckets entries by host.

2. Frontier Dequeues (Deterministic Ordering)

#![allow(unused)]
fn main() {
let entry: FrontierEntry = frontier.pop(now).unwrap();
// entry.url = "https://example.com/"
// entry.depth = 0
// entry.priority = 0
}

The dequeue order is deterministic: hosts are rotated via a seeded Fisher-Yates shuffle, and within each host, entries are ordered by priority then depth.

3. ExecutionEnvelope Seals the Context

#![allow(unused)]
fn main() {
let envelope = EnvelopeBuilder::new()
    .seed(seed)
    .timestamp(CaptureInstant::new(wall_time, logical_clock))
    .target_url(entry.url.clone())
    .dns_snapshot(DnsSnapshot { host: "example.com".into(), addrs: vec!["93.184.216.34".into()], ttl: 300 })
    .build()?;
}

The envelope is immutable after construction. It captures everything needed to reproduce this fetch.

4. Fetch Executes

#![allow(unused)]
fn main() {
let fetcher = HttpFetcher::with_defaults()?;
let result: FetchResult = fetcher.fetch(&envelope).await?;
}

For browser mode, BrowserFetcher launches headless Chrome with determinism overrides and captures DOM + sub-resources via CDP.

#![allow(unused)]
fn main() {
let links: Vec<Url> = extract_links(&html_body, &entry.url);
for link in links {
    frontier.push_discovered(link, entry.depth + 1, content_hash);
}
}

Links are extracted from HTML (after stripping <script> and <style> tags), normalized (fragments stripped, query params sorted, default ports removed), deduplicated, and sorted for determinism.

6. Artifact Creation

#![allow(unused)]
fn main() {
let record = WarcRecord::new(
    RecordType::Response,
    "application/http;msgtype=response".into(),
    response_bytes,
);
assert!(record.verify_integrity()); // BLAKE3 hash matches payload
}

The CaptureGroup bundles the envelope record, request record, response record, and optional DOM/resource-graph/timing records.

7. Content-Addressed Storage

#![allow(unused)]
fn main() {
let hash: ContentHash = store.put(response_bytes).await?;
// hash = blake3(response_bytes)
// Stored at: blobs/af/1349b9f5f9a1a6a0404dea36dcc949...
}

If a blob with the same hash already exists, the write is a no-op (structural deduplication).

8. Temporal Index Insert

#![allow(unused)]
fn main() {
index.insert(IndexEntry::new(
    entry.url.clone(),
    envelope.timestamp(),
    hash,
    CrawlContextId(1),
))?;
}

The index records this capture in four dimensions: URL, time, content hash, and crawl context.

9. WARC++ Output

#![allow(unused)]
fn main() {
write_warc_file(&path, &capture_group.all_records()).await?;
}

The WARC++ file contains standard ISO 28500 records plus Palimpsest extensions (envelope, dom-snapshot, resource-graph, timing). Standard WARC readers can parse the basic records; Palimpsest readers get the full execution context.

10. Replay

#![allow(unused)]
fn main() {
let content = store.get(&hash).await?;
let entries = index.query(&IndexQuery::for_url(&url))?;
}

Replay retrieves the stored blob and execution envelope, then reconstructs the original HTTP exchange, DOM state, and resource dependency graph. Same envelope + same artifacts = bit-identical output.

Determinism

Determinism is Law 1 — the foundation on which every other property depends. This chapter explains the technical mechanisms that enforce it.

CrawlSeed

All randomness in Palimpsest flows from a single 64-bit seed:

#![allow(unused)]
fn main() {
pub struct CrawlSeed {
    pub value: u64,
}

impl CrawlSeed {
    pub fn new(value: u64) -> Self { Self { value } }

    pub fn rng(&self) -> ChaCha8Rng {
        ChaCha8Rng::seed_from_u64(self.value)
    }

    pub fn derive(&self, index: u64) -> Self {
        // BLAKE3 mixing: hash(seed_bytes || index_bytes)
        let mut hasher = blake3::Hasher::new();
        hasher.update(&self.value.to_le_bytes());
        hasher.update(&index.to_le_bytes());
        let hash = hasher.finalize();
        let bytes: [u8; 8] = hash.as_bytes()[..8].try_into().unwrap();
        Self { value: u64::from_le_bytes(bytes) }
    }
}
}

ChaCha8Rng is a cryptographically secure PRNG that produces identical sequences for identical seeds on all platforms.

No rand Crate

The rand crate is forbidden in all core crates. Palimpsest uses rand_chacha and rand_core directly. The workspace Cargo.toml specifies rand with default-features = false — no OS entropy source is available.

Ordered Collections

HashMap iteration order is randomized per Rust’s specification. Palimpsest uses BTreeMap everywhere that iteration order is observable:

#![allow(unused)]
fn main() {
// The frontier's host queues
struct Frontier {
    host_queues: BTreeMap<String, BTreeSet<FrontierEntry>>,
    seen: BTreeSet<String>,
    // ...
}
}

This ensures the same URLs produce the same host ordering on every run.

Seeded Host Rotation

When the frontier rotates between hosts, it uses a seeded Fisher-Yates shuffle:

#![allow(unused)]
fn main() {
let mut hosts: Vec<&String> = self.host_queues.keys().collect();
let mut rng = self.seed.rng();
// Fisher-Yates shuffle with seeded RNG
for i in (1..hosts.len()).rev() {
    let j = rng.gen_range(0..=i);
    hosts.swap(i, j);
}
}

Same seed, same hosts = same rotation order.

Time is Explicit

No Instant::now() or SystemTime::now() appears in core logic. All time comes from one of two sources:

  1. Caller-providedfrontier.pop(now) takes a DateTime<Utc> parameter
  2. ExecutionEnvelopeenvelope.timestamp() returns the sealed CaptureInstant

This means tests can inject fixed timestamps, and replays use the original timestamps exactly.

Browser Determinism

When using headless Chrome, Palimpsest injects JavaScript overrides before any page scripts execute:

// Seeded from CrawlSeed
Date.now = function() { return 1700000000000 + (__date_offset += 1); };
Math.random = function() { /* seeded xorshift */ };
performance.now = function() { return (__perf_offset += 0.1); };

This prevents JavaScript on the page from introducing non-determinism.

Verification

The determinism test pattern runs the same operation twice and asserts byte-identical output:

#![allow(unused)]
fn main() {
#[test]
fn frontier_ordering_is_deterministic() {
    let seed = CrawlSeed::new(42);
    let run_a = run_frontier(seed, &urls);
    let run_b = run_frontier(seed, &urls);
    assert_eq!(run_a, run_b);
}
}

The simulation framework (palimpsest-sim) proves this at scale: 10,000 pages across 5 adversarial universes, two full runs, zero divergence.

Content Addressability

Law 3 requires that all artifacts are BLAKE3 hash-addressed. This chapter explains the mechanics.

ContentHash

#![allow(unused)]
fn main() {
#[derive(Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct ContentHash([u8; 32]);

impl ContentHash {
    pub fn of(data: &[u8]) -> Self {
        Self(*blake3::hash(data).as_bytes())
    }

    pub fn as_bytes(&self) -> &[u8; 32] { &self.0 }
    pub fn as_hex(&self) -> String { hex::encode(self.0) }
    pub fn from_bytes(bytes: [u8; 32]) -> Self { Self(bytes) }
}
}

ContentHash is a Copy type — 32 bytes, passed by value. It implements Ord for use in BTreeMap keys.

Why BLAKE3

PropertyBLAKE3SHA-256
Speed~1 GB/s (single core)~250 MB/s
Security256-bit, cryptographic256-bit, cryptographic
ParallelismTree-based, SIMD nativeSequential
DeterminismPlatform-independentPlatform-independent

BLAKE3 is faster than SHA-256 with equivalent security properties. For a system that hashes every blob, every record, and every envelope, throughput matters.

Storage Layout

FileSystemBlobStore uses a git-style two-level directory structure:

blobs/
  af/
    1349b9f5f9a1a6a0404dea36dcc949...    # Full hash as filename
  c7/
    d2fe1a6b...

The first two hex characters of the hash form the directory name. This prevents any single directory from accumulating too many entries.

Structural Deduplication

When BlobStore::put() is called, it computes the BLAKE3 hash and checks if that blob already exists:

#![allow(unused)]
fn main() {
async fn put(&self, data: Bytes) -> Result<ContentHash, StorageError> {
    let hash = ContentHash::of(&data);
    if self.exists(&hash).await? {
        return Ok(hash); // Already stored — no-op
    }
    // Atomic write: temp file + rename
    self.write_blob(&hash, &data).await?;
    Ok(hash)
}
}

Identical content maps to the same hash and is stored exactly once.

Integrity Verification

Every read verifies the hash of the retrieved data:

#![allow(unused)]
fn main() {
async fn get(&self, hash: &ContentHash) -> Result<Bytes, StorageError> {
    let data = self.read_blob(hash).await?;
    let actual = ContentHash::of(&data);
    if actual != *hash {
        return Err(StorageError::IntegrityError {
            expected: *hash,
            actual,
        });
    }
    Ok(data)
}
}

Tampering is detectable by any reader at any time.

WARC Record Hashing

Every WarcRecord carries its content hash in the Palimpsest-Content-Hash header:

WARC/1.1
WARC-Type: response
Palimpsest-Content-Hash: blake3:af1349b9f5f9a1a6a0404dea36dcc949...
Content-Length: 4096

[payload bytes]

RecordId is also derived from the content hash, not from random UUIDs:

#![allow(unused)]
fn main() {
pub fn from_content(content_hash: &ContentHash, record_type: &RecordType) -> Self {
    // Deterministic UUID v5 from hash + type
}
}

Execution Envelope

The ExecutionEnvelope is Palimpsest’s critical abstraction. It seals every input that affects a fetch — seed, timestamp, target URL, DNS state, TLS fingerprint, browser config, and custom headers — into an immutable record constructed before the fetch begins.

Without the envelope, you cannot replay a fetch, verify its output, or prove it was executed correctly.

Construction

Envelopes are built via the fluent EnvelopeBuilder:

#![allow(unused)]
fn main() {
let envelope = EnvelopeBuilder::new()
    .seed(CrawlSeed::new(42))
    .timestamp(CaptureInstant::new(wall_time, logical_clock))
    .target_url(Url::parse("https://example.com/").unwrap())
    .dns_snapshot(DnsSnapshot {
        host: "example.com".into(),
        addrs: vec!["93.184.216.34".into()],
        ttl: 300,
    })
    .tls_fingerprint(TlsFingerprint {
        protocol: "TLSv1.3".into(),
        cipher: "TLS_AES_256_GCM_SHA384".into(),
        cert_chain_hash: "blake3:...".into(),
    })
    .header("User-Agent".into(), "PalimpsestBot/0.1".into())
    .build()?;
}

Required Fields

FieldTypePurpose
seedCrawlSeedDeterministic randomness source
timestampCaptureInstantWall clock + logical clock
target_urlUrlThe URL being fetched
dns_snapshotDnsSnapshotRecorded DNS resolution state

Calling .build() without any required field returns an EnvelopeError.

Optional Fields

FieldTypePurpose
tls_fingerprintTlsFingerprintTLS protocol, cipher, cert chain hash
browser_configBrowserConfigViewport, user agent, JS enabled
request_headersVec<(String, String)>Custom HTTP headers

Immutability

Once build() succeeds, the ExecutionEnvelope is frozen. There are no setter methods — only getters:

#![allow(unused)]
fn main() {
envelope.seed()             // CrawlSeed
envelope.timestamp()        // CaptureInstant
envelope.target_url()       // &Url
envelope.request_headers()  // &[(String, String)]
envelope.dns_snapshot()     // &DnsSnapshot
envelope.tls_fingerprint()  // Option<&TlsFingerprint>
envelope.browser_config()   // Option<&BrowserConfig>
envelope.content_hash()     // ContentHash (computed from canonical JSON)
}

Content Hash

The envelope’s content_hash() is computed from its canonical JSON serialization. This means two envelopes with identical fields produce the same hash, and any field change produces a different hash.

WARC++ Envelope Record

The envelope is serialized as the first record in every WARC++ capture group:

WARC/1.1
WARC-Type: envelope
Palimpsest-Envelope-Version: 1
Content-Type: application/json

{
  "seed": 42,
  "timestamp": {"wall": "2026-04-12T10:30:00Z", "logical": 1234},
  "target_url": "https://example.com/",
  "dns_snapshot": {"host": "example.com", "addrs": ["93.184.216.34"], "ttl": 300},
  "tls_fingerprint": {"protocol": "TLSv1.3", "cipher": "...", "cert_chain_hash": "..."},
  "browser_config": null
}

Temporal Index

The temporal index is a graph, not a flat lookup table. It records every capture across four dimensions: URL, time, content hash, and crawl context. This enables queries like “show me every version of this page” or “what changed between these two crawls.”

CaptureInstant

Every capture is timestamped with a paired clock:

#![allow(unused)]
fn main() {
pub struct CaptureInstant {
    pub wall: DateTime<Utc>,    // Real-world time
    pub logical: u64,           // Monotonic counter within a crawl
}
}

Wall time records when the capture happened. Logical time records the ordering within a single crawl session, immune to clock drift.

IndexEntry

#![allow(unused)]
fn main() {
pub struct IndexEntry {
    pub url: Url,
    pub captured_at: CaptureInstant,
    pub content_hash: ContentHash,
    pub crawl_context: CrawlContextId,
}
}

Each entry represents one capture of one URL at one point in time, with a pointer (content hash) to the stored artifact.

Backends

InMemoryIndex

Uses BTreeMap for deterministic ordering. Suitable for testing and short-lived crawls.

#![allow(unused)]
fn main() {
let mut index = InMemoryIndex::new();
index.insert(entry);
let results = index.query(&IndexQuery::for_url(&url));
}

SqliteIndex

Persistent SQL-backed index with WAL mode for concurrent reads:

#![allow(unused)]
fn main() {
let mut index = SqliteIndex::open(Path::new("./output/index.sqlite"))?;
index.insert(entry)?;
let results = index.query(&IndexQuery::for_url(&url))?;
}

The schema includes a UNIQUE constraint on (url, wall_time, content_hash) to prevent duplicate entries.

Queries

IndexQuery supports multi-dimensional filtering:

  • By URL — all captures of a specific URL
  • By time range — all captures within a window
  • By content hash — find which URLs produced a specific blob
  • By crawl context — all captures from a specific crawl session

Results are ordered by captured_at (ascending), then URL string.

Use Cases

History — “Show me every version of https://example.com/ across all crawls”:

#![allow(unused)]
fn main() {
let history = index.query(&IndexQuery::for_url(&url))?;
for entry in &history {
    println!("{} -> {}", entry.captured_at.wall, entry.content_hash.as_hex());
}
}

Change Detection — Compare content hashes across captures to identify when a page changed.

Provenance — Every RAG chunk and embedding links back to an IndexEntry via source_url, captured_at, and source_hash.

WARC++ Format

WARC++ extends ISO 28500 (the standard WARC format) with structured metadata for execution context, DOM snapshots, resource graphs, and timing breakdowns. Standard WARC readers can parse the basic records. Palimpsest-aware readers get the full execution context.

Standard Record Types

These follow ISO 28500 exactly:

TypePurpose
warcinfoCrawl-level metadata
requestHTTP request
responseHTTP response
resourceStandalone resource
metadataAdditional metadata

Extension Record Types

TypePurpose
envelopeFull ExecutionEnvelope (seed, timestamp, DNS, TLS, browser config)
dom-snapshotRendered DOM state after JavaScript execution
resource-graphDependency graph of all resources loaded for a page
timingDetailed timing breakdown (DNS, connect, TLS, TTFB, transfer, render)

Content Hash Header

Every record includes a Palimpsest-Content-Hash header:

Palimpsest-Content-Hash: blake3:af1349b9f5f9a1a6a0404dea36dcc949...

This enables content-addressable retrieval and integrity verification without reading the full record.

Envelope Record Example

WARC/1.1
WARC-Type: envelope
WARC-Record-ID: <urn:uuid:a1b2c3d4-...>
Content-Type: application/json
Palimpsest-Envelope-Version: 1
Palimpsest-Content-Hash: blake3:c7d2fe...

{
  "seed": 42,
  "timestamp": {"wall": "2026-04-12T10:30:00.123456789Z", "logical": 1234},
  "request_headers": [["User-Agent", "PalimpsestBot/0.1"]],
  "dns_snapshot": {"host": "example.com", "addrs": ["93.184.216.34"], "ttl": 300},
  "tls_fingerprint": {"protocol": "TLSv1.3", "cipher": "TLS_AES_256_GCM_SHA384", "cert_chain_hash": "blake3:..."},
  "browser_config": null
}

Resource Graph Record Example

{
  "root": "https://example.com/",
  "resources": [
    {"url": "https://example.com/style.css", "type": "stylesheet", "hash": "blake3:...", "initiated_by": "https://example.com/"},
    {"url": "https://example.com/app.js", "type": "script", "hash": "blake3:...", "initiated_by": "https://example.com/"}
  ],
  "load_order": [0, 1]
}

Serialization Rules

RuleValue
Text encodingUTF-8
JSON formatCompact (no pretty-print)
TimestampsRFC 3339 with nanosecond precision
Record separatorCRLFCRLF (per ISO 28500)
Max payload4 GiB (WARC spec limit)

Backward Compatibility

Standard WARC tools (warc-tools, warcio, pywb) can read the request, response, warcinfo, resource, and metadata records without modification. They skip extension records (envelope, dom-snapshot, resource-graph, timing) per the WARC spec’s extension handling rules. The Palimpsest-* headers are ignored by non-Palimpsest readers.

palimpsest-core

Shared types, BLAKE3 hashing, seeded PRNG, and error taxonomy. This crate performs no IO — it is pure types and computation.

CrawlSeed

#![allow(unused)]
fn main() {
pub struct CrawlSeed { pub value: u64 }

impl CrawlSeed {
    pub fn new(value: u64) -> Self;
    pub fn rng(&self) -> ChaCha8Rng;       // Deterministic PRNG
    pub fn derive(&self, index: u64) -> Self; // Child seed via BLAKE3 mixing
}
}

ContentHash

#![allow(unused)]
fn main() {
pub struct ContentHash([u8; 32]); // Copy, Eq, Ord, Hash

impl ContentHash {
    pub fn of(data: &[u8]) -> Self;           // BLAKE3 hash
    pub fn as_bytes(&self) -> &[u8; 32];
    pub fn as_hex(&self) -> String;
    pub fn from_bytes(bytes: [u8; 32]) -> Self;
}
}

CaptureInstant

#![allow(unused)]
fn main() {
pub struct CaptureInstant {
    pub wall: DateTime<Utc>,  // Wall clock
    pub logical: u64,         // Monotonic counter
}

impl CaptureInstant {
    pub fn new(wall: DateTime<Utc>, logical: u64) -> Self;
}
}

Implements Copy, Ord, Serialize, Deserialize.

CrawlContextId

#![allow(unused)]
fn main() {
pub struct CrawlContextId(pub u64);
}

Opaque identifier for a crawl session. Implements Copy.

CrawlTarget

#![allow(unused)]
fn main() {
pub struct CrawlTarget {
    pub url: Url,
    pub depth: u32,
    pub parent: Option<ContentHash>,
}
}

PalimpsestError

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum PalimpsestError {
    Network(String),
    Protocol(String),
    Rendering(String),
    Policy(String),
    DeterminismViolation { context: String, expected: String, actual: String },
    Storage(String),
    Replay(String),
}
}

Every failure in the system is classified into exactly one of these seven variants. See Error Taxonomy for details.

Key Invariant

This crate contains no IO, no async, no network calls. It is the foundation that every other crate depends on.

palimpsest-envelope

Sealed execution context — immutable after construction. The envelope captures every input that affects a fetch, enabling deterministic replay and verification.

ExecutionEnvelope

#![allow(unused)]
fn main() {
impl ExecutionEnvelope {
    pub fn seed(&self) -> CrawlSeed;
    pub fn timestamp(&self) -> CaptureInstant;
    pub fn target_url(&self) -> &Url;
    pub fn request_headers(&self) -> &[(String, String)];
    pub fn dns_snapshot(&self) -> &DnsSnapshot;
    pub fn tls_fingerprint(&self) -> Option<&TlsFingerprint>;
    pub fn browser_config(&self) -> Option<&BrowserConfig>;
    pub fn content_hash(&self) -> ContentHash;
}
}

No setter methods. Immutable after build().

EnvelopeBuilder

#![allow(unused)]
fn main() {
impl EnvelopeBuilder {
    pub fn new() -> Self;
    pub fn seed(self, seed: CrawlSeed) -> Self;
    pub fn timestamp(self, ts: CaptureInstant) -> Self;
    pub fn target_url(self, url: Url) -> Self;
    pub fn header(self, name: String, value: String) -> Self;
    pub fn headers(self, headers: Vec<(String, String)>) -> Self;
    pub fn dns_snapshot(self, dns: DnsSnapshot) -> Self;
    pub fn tls_fingerprint(self, tls: TlsFingerprint) -> Self;
    pub fn browser_config(self, config: BrowserConfig) -> Self;
    pub fn build(self) -> Result<ExecutionEnvelope, EnvelopeError>;
}
}

EnvelopeError

#![allow(unused)]
fn main() {
pub enum EnvelopeError {
    MissingSeed,
    MissingTimestamp,
    MissingTargetUrl,
    MissingDnsSnapshot,
}
}

Supporting Types

#![allow(unused)]
fn main() {
pub struct DnsSnapshot { pub host: String, pub addrs: Vec<String>, pub ttl: u32 }
pub struct TlsFingerprint { pub protocol: String, pub cipher: String, pub cert_chain_hash: String }
pub struct BrowserConfig { pub viewport_width: u32, pub viewport_height: u32, pub user_agent: String, pub js_enabled: bool }
}
  • palimpsest-core — provides CrawlSeed, CaptureInstant, ContentHash
  • palimpsest-fetch — consumes envelopes for fetch execution
  • palimpsest-artifact — serializes envelopes as WARC++ records

palimpsest-frontier

Deterministic seed-driven URL scheduler with politeness enforcement. Same seed + same URLs = identical dequeue order.

Frontier

#![allow(unused)]
fn main() {
impl Frontier {
    pub fn new(seed: CrawlSeed, policy: PolitenessPolicy) -> Self;
    pub fn push(&mut self, entry: FrontierEntry) -> bool;
    pub fn push_seed(&mut self, url: Url);
    pub fn push_discovered(&mut self, url: Url, depth: u32, parent: ContentHash) -> bool;
    pub fn pop(&mut self, now: DateTime<Utc>) -> Option<FrontierEntry>;
    pub fn len(&self) -> usize;
    pub fn is_empty(&self) -> bool;
    pub fn host_count(&self) -> usize;
    pub fn seen_count(&self) -> usize;
    pub fn save(&self, path: &Path) -> Result<(), FrontierPersistError>;
    pub fn load(&mut self, path: &Path) -> Result<usize, FrontierPersistError>;
    pub fn load_if_exists(seed: CrawlSeed, policy: PolitenessPolicy, path: &Path) -> Result<Self, FrontierPersistError>;
    pub fn seed(&self) -> CrawlSeed;
}
}

Internally uses BTreeMap<String, BTreeSet<FrontierEntry>> for host queues. URL deduplication via BTreeSet<String>.

FrontierEntry

#![allow(unused)]
fn main() {
pub struct FrontierEntry {
    pub url: Url,
    pub depth: u32,
    pub priority: u32,  // Lower = dequeued first
    pub parent: Option<ContentHash>,
}
}

Implements Ord: sorted by (priority, depth, url string).

PolitenessPolicy

#![allow(unused)]
fn main() {
pub struct PolitenessPolicy {
    pub min_host_delay: Duration,
    pub max_concurrent_hosts: usize,
}

impl PolitenessPolicy {
    pub fn default_policy() -> Self;  // 1s delay, 100 hosts
    pub fn aggressive() -> Self;      // 100ms delay, 500 hosts
    pub fn no_delay() -> Self;        // Zero delay, unlimited (testing only)
}
}

Persistence

save() serializes the frontier state to JSON. load() restores it. This enables crawl resumption — stop a crawl, restart later, continue from exactly where you left off.

Key Invariant

Same seed + same URLs pushed in same order = identical pop() sequence. Verified at 10,000 pages with zero divergence.

palimpsest-fetch

HTTP client + browser capture (CDP) + link extraction + robots.txt parsing + TLS/HTTP2 fingerprint impersonation + CDP stealth mode. Every fetch wraps an ExecutionEnvelope.

HttpFetcher

#![allow(unused)]
fn main() {
impl HttpFetcher {
    pub fn new(config: FetchConfig) -> Result<Self, PalimpsestError>;
    pub fn with_defaults() -> Result<Self, PalimpsestError>;
    pub async fn fetch(&self, envelope: &ExecutionEnvelope) -> Result<FetchResult, PalimpsestError>;
}
}

Uses wreq (BoringSSL backend) instead of reqwest. When an emulation profile is set, the TLS ClientHello and HTTP/2 SETTINGS frame match the selected browser.

FetchConfig

#![allow(unused)]
fn main() {
pub struct FetchConfig {
    pub connect_timeout: Duration,                   // Default: 30s
    pub total_timeout: Duration,                     // Default: 120s
    pub max_body_size: u64,                          // Default: 256 MiB
    pub max_redirects: usize,                        // Default: 10
    pub emulation: Option<wreq_util::Emulation>,     // Default: None
}
}

When emulation is set (e.g., Emulation::Chrome133), wreq impersonates the selected browser’s TLS fingerprint (JA3/JA4 including post-quantum key shares) and HTTP/2 settings (SETTINGS frame values/order, WINDOW_UPDATE, pseudo-header ordering). 70+ browser profiles available: Chrome 100-137, Firefox 109-139, Safari 15-18.5, Edge, Opera.

BrowserFetcher

#![allow(unused)]
fn main() {
impl BrowserFetcher {
    pub fn new(config: BrowserFetchConfig) -> Self;
    pub async fn fetch(&self, url: &Url, envelope: &ExecutionEnvelope, seed: CrawlSeed)
        -> Result<BrowserFetchResult, PalimpsestError>;
}
}

Launches headless Chrome via CDP. Injects determinism overrides and (optionally) 17 stealth evasion patches. Captures DOM snapshot, sub-resources via Network events, and resource dependency graph.

BrowserFetchConfig

#![allow(unused)]
fn main() {
pub struct BrowserFetchConfig {
    pub page_timeout: Duration,          // Default: 30s
    pub viewport_width: u32,             // Default: 1920
    pub viewport_height: u32,            // Default: 1080
    pub js_enabled: bool,                // Default: true
    pub user_agent: String,              // Default: "PalimpsestBot/0.1"
    pub stealth: bool,                   // Default: false
    pub webdriver_value: WebdriverValue, // Default: False
}
}

WebdriverValue

#![allow(unused)]
fn main() {
pub enum WebdriverValue {
    False,     // Matches real non-automated Chrome (default)
    Undefined, // Property appears deleted
}
}

Explicit, auditable config choice for navigator.webdriver. Default False passes Rebrowser Bot Detector (10/10).

CDP Stealth Mode

When stealth: true, the browser fetcher applies:

Chrome launch hardening:

  • --disable-blink-features=AutomationControlled
  • --disable-component-extensions-with-background-pages

17 stealth evasion patches (injected via addScriptToEvaluateOnNewDocument):

#PatchWhat It Does
1navigator.webdriverSet to false (configurable via WebdriverValue)
2window.chromeFull object mock (app, csi, loadTimes, runtime)
3navigator.pluginsChrome PDF Plugin, Chrome PDF Viewer, Native Client
4navigator.mimeTypesapplication/pdf, application/x-nacl
5navigator.permissionsFix Notification state inconsistency
6navigator.languages["en-US", "en"]
7navigator.hardwareConcurrency8
8navigator.deviceMemory8
9WebGL vendor/rendererIntel UHD Graphics 630
10Canvas fingerprintSeeded sub-pixel noise (CrawlSeed)
11Window dimensionsouterWidth/outerHeight match viewport + chrome UI
12Screen dimensionswidth/height/availWidth/availHeight/colorDepth
13AudioContextSeeded oscillator noise (CrawlSeed)
14ClientRectSeeded sub-pixel noise (CrawlSeed)
15sourceURL markersStrip pptr/playwright stack traces
16navigator.userAgentConsistent with HTTP header
17navigator.maxTouchPoints0

All noise patches use deterministic xorshift PRNGs seeded from CrawlSeed (Law 1).

Browser Emulation Profiles

#![allow(unused)]
fn main() {
pub struct BrowserProfile { /* unified TLS + HTTP/2 + headers + JS identity */ }

pub enum ProfileMode {
    None,                        // No impersonation (default)
    Fixed(BrowserProfile),       // Same profile for all requests
    Seeded,                      // Generate from CrawlSeed
    RotatePerDomain,             // Per-domain via BLAKE3(seed + domain)
}
}

Pre-built profiles: BrowserProfile::chrome_windows(), firefox_linux(), safari_macos().

See profile module for details.

BrowserFetchResult

#![allow(unused)]
fn main() {
pub struct BrowserFetchResult {
    pub fetch_result: FetchResult,
    pub dom_snapshot: Option<DomSnapshot>,
    pub resource_graph: Option<ResourceGraph>,
    pub sub_resources: Vec<WarcRecord>,
}
}
#![allow(unused)]
fn main() {
pub fn extract_links(html: &str, base_url: &Url) -> Vec<Url>;
pub fn normalize_url(url: &Url) -> Option<Url>;
pub fn normalize_url_for_comparison(url: &Url) -> String;
}

extract_links strips <script> and <style> content before scanning for href and src attributes. Output is deduplicated and sorted for determinism.

Robots.txt

#![allow(unused)]
fn main() {
pub struct RobotsRules { pub crawl_delay: Option<Duration> }

impl RobotsRules {
    pub fn parse(body: &str) -> Self;  // RFC 9309 compliant
}
}

Per-origin caching in BTreeMap (deterministic).

Stealth Regression Tests

5 integration tests against live public detection sites:

SiteScoreKey Checks
Rebrowser Bot Detector10/10CDP leak, webdriver, viewport, user-agent
Sannysoft55/56webdriver, chrome, plugins, WebGL, canvas, permissions
FingerprintJS BotDClean18 detectors, no bot verdict
CreepJSCleanHeadless rating, stealth rating, lie detection
InfosimplesSkippedSite timeout

Run with: cargo test -p palimpsest-fetch --test stealth_test -- --ignored --nocapture --test-threads=1

Key Invariant

Every fetch receives an ExecutionEnvelope. The envelope seals the context before the network request begins, enabling replay and verification. Emulation profile and stealth config are deterministic inputs (Law 1).

palimpsest-artifact

WARC++ serialization: records, capture groups, reader/writer. Content-addressed outputs compatible with ISO 28500.

RecordType

#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum RecordType {
    // Standard (ISO 28500)
    Warcinfo, Request, Response, Resource, Metadata,
    // Palimpsest extensions
    Envelope, DomSnapshot, ResourceGraph, Timing,
}

impl RecordType {
    pub fn is_standard(&self) -> bool;
}
}

WarcRecord

#![allow(unused)]
fn main() {
pub struct WarcRecord {
    pub record_type: RecordType,
    pub record_id: RecordId,
    pub content_hash: ContentHash,
    pub content_type: String,
    pub content_length: u64,
    pub target_uri: Option<String>,
    pub payload: Bytes,
}

impl WarcRecord {
    pub fn new(record_type: RecordType, content_type: String, payload: Bytes) -> Self;
    pub fn verify_integrity(&self) -> bool;
}
}

RecordId

#![allow(unused)]
fn main() {
pub struct RecordId(String);

impl RecordId {
    pub fn from_content(content_hash: &ContentHash, record_type: &RecordType) -> Self;
    pub fn as_str(&self) -> &str;
}
}

Deterministic — derived from content hash + record type, not random UUID.

CaptureGroup

#![allow(unused)]
fn main() {
pub struct CaptureGroup {
    pub group_hash: ContentHash,
    pub url: Url,
    pub captured_at: CaptureInstant,
    pub crawl_context: CrawlContextId,
    pub envelope: WarcRecord,
    pub request: WarcRecord,
    pub response: WarcRecord,
    pub dom_snapshot: Option<DomSnapshot>,
    pub resource_graph: Option<ResourceGraph>,
    pub timing: Option<TimingBreakdown>,
}
}

Built via CaptureGroupBuilder (fluent builder with required fields validation).

WARC Writer/Reader

#![allow(unused)]
fn main() {
pub async fn write_warc_file(path: &Path, records: &[WarcRecord]) -> Result<(), WarcWriteError>;
pub fn parse_warc_records(data: &[u8]) -> Result<Vec<WarcRecord>, WarcWriteError>;
}

Key Invariant

All record IDs and content hashes are deterministic. Same content = same hash = same record ID.

palimpsest-storage

Content-addressable blob storage with three backends: in-memory, filesystem, and object store (S3/GCS/Azure). Deduplication is structural — same content is stored once.

BlobStore Trait

#![allow(unused)]
fn main() {
pub trait BlobStore: Send + Sync {
    async fn put(&self, data: Bytes) -> Result<ContentHash, StorageError>;
    async fn get(&self, hash: &ContentHash) -> Result<Bytes, StorageError>;
    async fn exists(&self, hash: &ContentHash) -> Result<bool, StorageError>;
    async fn delete(&self, hash: &ContentHash) -> Result<(), StorageError>;
    async fn metadata(&self, hash: &ContentHash) -> Result<BlobMetadata, StorageError>;
}
}

BlobMetadata

#![allow(unused)]
fn main() {
pub struct BlobMetadata { pub size: u64, pub stored_at: DateTime<Utc> }
}

StorageError

#![allow(unused)]
fn main() {
pub enum StorageError {
    Backend(String),
    NotFound(ContentHash),
    IntegrityError { expected: ContentHash, actual: ContentHash },
}
}

InMemoryBlobStore

#![allow(unused)]
fn main() {
impl InMemoryBlobStore {
    pub fn new() -> Self;
    pub fn len(&self) -> usize;
    pub fn is_empty(&self) -> bool;
    pub fn total_bytes(&self) -> u64;
}
}

Uses BTreeMap for deterministic ordering.

FileSystemBlobStore

#![allow(unused)]
fn main() {
impl FileSystemBlobStore {
    pub async fn new(root: impl Into<PathBuf>) -> Result<Self, StorageError>;
    pub fn root(&self) -> &Path;
}
}

Git-style layout: {root}/{hash[0..2]}/{hash[2..]}. Atomic writes via temp file + rename. Integrity verification on every read.

ObjectStoreBlobStore

S3, GCS, and Azure support via the object_store crate. Same BlobStore trait interface.

Key Invariant

Every put computes ContentHash::of(data). Every get verifies the hash of retrieved data. Tampering is always detectable.

palimpsest-index

Temporal graph index: URL x time x hash x context. Two backends — in-memory (BTreeMap) and SQLite (WAL mode).

IndexEntry

#![allow(unused)]
fn main() {
pub struct IndexEntry {
    pub url: Url,
    pub captured_at: CaptureInstant,
    pub content_hash: ContentHash,
    pub crawl_context: CrawlContextId,
}

impl IndexEntry {
    pub fn new(url: Url, captured_at: CaptureInstant, content_hash: ContentHash, crawl_context: CrawlContextId) -> Self;
}
}

Implements Ord: ordered by captured_at, then URL string.

InMemoryIndex

#![allow(unused)]
fn main() {
impl InMemoryIndex {
    pub fn new() -> Self;
    pub fn insert(&mut self, entry: IndexEntry);
    pub fn query(&self, query: &IndexQuery) -> Vec<IndexEntry>;
    pub fn history(&self, url: &Url) -> Vec<IndexEntry>;
}
}

SqliteIndex

#![allow(unused)]
fn main() {
impl SqliteIndex {
    pub fn open(path: &Path) -> Result<Self, IndexError>;
    pub fn insert(&mut self, entry: IndexEntry) -> Result<(), IndexError>;
    pub fn query(&self, query: &IndexQuery) -> Result<Vec<IndexEntry>, IndexError>;
    pub fn history(&self, url: &Url) -> Result<Vec<IndexEntry>, IndexError>;
}
}

Uses WAL mode for concurrent reads. Parameterized queries. UNIQUE constraint on (url, wall_time, content_hash).

IndexQuery

Multi-dimensional filtering: by URL, time range, content hash, or crawl context. Results ordered by captured_at ascending.

Key Invariant

The index is a graph, not a lookup table. It captures the temporal dimension of the web — when content appeared, changed, and disappeared.

palimpsest-replay

Deterministic reconstruction from stored artifacts. Same envelope + same storage = bit-identical output.

Concept

The replay engine retrieves the ExecutionEnvelope and stored blobs for a given URL and timestamp, then reconstructs:

  1. HTTP exchange — request and response headers + body
  2. DOM state — rendered DOM from the dom-snapshot record
  3. Resource graph — sub-resource dependency tree with load ordering

Usage

#![allow(unused)]
fn main() {
let entries = index.history(&url);
let latest = entries.last().unwrap();
let blob = store.get(&latest.content_hash).await?;
}

For full reconstruction including DOM and sub-resources, the replay engine reads the complete CaptureGroup from the WARC++ file and rehydrates each record.

Law 5 Guarantee

Replay fidelity is the proof that the archive works. If the same envelope and the same artifacts produce different output on two runs, Law 5 is violated.

The simulation framework verifies this: verify_determinism crawls twice with the same seed and asserts byte-identical blob hashes, index entries, and page counts.

  • palimpsest-storage — provides blob retrieval
  • palimpsest-index — provides temporal lookups
  • palimpsest-artifact — provides WARC++ record parsing
  • palimpsest-envelope — provides execution context

palimpsest-crawl

The orchestrator — the main crawl loop that integrates all layers: frontier scheduling, envelope construction, HTTP/browser fetching, link extraction, artifact creation, blob storage, temporal indexing, WARC output, and frontier persistence.

CrawlConfig

#![allow(unused)]
fn main() {
pub struct CrawlConfig {
    pub seeds: Vec<Url>,
    pub crawl_seed: CrawlSeed,
    pub crawl_context: CrawlContextId,
    pub max_depth: u32,
    pub max_urls: usize,
    pub politeness: PolitenessPolicy,
    pub scope: CrawlScope,
    pub concurrency: usize,
    pub user_agent: String,
    pub browser_mode: bool,
    pub output_dir: Option<PathBuf>,
}

impl CrawlConfig {
    pub fn for_test(seed_url: Url) -> Self;
    pub fn seed_hosts(&self) -> Vec<String>;
    pub fn seed_domains(&self) -> Vec<String>;
}
}

CrawlScope

#![allow(unused)]
fn main() {
pub enum CrawlScope {
    SameDomain,  // Registrable domain match
    SameHost,    // Exact host match
    Any,         // No restriction
}
}

CrawlStats

#![allow(unused)]
fn main() {
pub struct CrawlStats {
    pub urls_fetched: usize,
    pub urls_failed: usize,
    pub urls_discovered: usize,
    pub robots_blocked: usize,
    pub blobs_stored: usize,
    pub bytes_stored: u64,
    pub warc_path: Option<String>,
}
}

CrawlOrchestrator

#![allow(unused)]
fn main() {
impl CrawlOrchestrator {
    pub async fn new(config: CrawlConfig) -> Result<Self, PalimpsestError>;
}
}

The orchestrator loop:

  1. Pop batch of URLs from frontier (respects politeness)
  2. Build ExecutionEnvelope for each
  3. Fetch concurrently via tokio::spawn
  4. Extract links from HTML responses
  5. Push discovered URLs back to frontier (scope-filtered)
  6. Store blobs, insert index entries, write WARC records
  7. Save frontier state for resumption
  8. Repeat until frontier empty or max_urls reached

Key Invariant

The orchestrator is the integration point. It does not add non-determinism — all ordering comes from the frontier, all time from envelopes, all randomness from the seed.

palimpsest-shadow

Shadow comparison engine for validating Palimpsest output against legacy crawlers (Heritrix, wget, Warcprox, Brozzler).

Purpose

During migration from legacy crawl infrastructure, shadow comparison proves that Palimpsest captures the same content. It reads .warc and .warc.gz files from any crawler, normalizes URLs for cross-format comparison, and reports matches, mismatches, and coverage gaps.

Usage

palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]

Comparison Logic

  1. Read all WARC records from the legacy directory (.warc and .warc.gz)
  2. Read all WARC records from the Palimpsest output
  3. Normalize URLs: strip fragments, unify schemes (http/https), sort query params, strip angle brackets
  4. Match records by normalized URL
  5. For matched pairs: compare content size, report byte-level diffs
  6. Report unmatched URLs in each direction (coverage gaps)

URL Normalization

Legacy crawlers store URLs differently:

  • wget uses <http://url> angle bracket syntax per WARC spec
  • wget stores post-redirect URLs (https), Palimpsest may store pre-redirect (http)
  • Fragment handling varies across tools

normalize_url_for_comparison() unifies all representations.

Output Format

Plain text by default, JSON with --json flag. Reports:

  • Total URLs in each dataset
  • Matched URLs with size comparison
  • Mismatches with byte-level size diffs
  • URLs present in legacy but missing from Palimpsest
  • URLs present in Palimpsest but missing from legacy

palimpsest-extract

HTML-to-text extraction and RAG chunking with full provenance tracking. Every chunk carries its source URL, capture timestamp, content hash, and character offset.

ExtractedDocument

#![allow(unused)]
fn main() {
pub struct ExtractedDocument {
    pub url: String,
    pub title: Option<String>,
    pub description: Option<String>,
    pub text: String,
    pub text_length: usize,
    pub chunks: Vec<ContentChunk>,
    pub text_hash: String,
    pub source_hash: String,
    pub captured_at: String,
}
}

extract_document

#![allow(unused)]
fn main() {
pub fn extract_document(
    raw_response: &[u8],
    source_url: &Url,
    captured_at: CaptureInstant,
    source_hash: ContentHash,
    chunk_config: &ChunkConfig,
) -> ExtractedDocument
}

Pipeline: raw HTTP response -> strip headers -> HTML to clean text -> chunk with provenance.

ContentChunk

#![allow(unused)]
fn main() {
pub struct ContentChunk {
    pub text: String,
    pub source_url: Url,
    pub captured_at: CaptureInstant,
    pub source_hash: ContentHash,
    pub chunk_hash: ContentHash,       // BLAKE3 of chunk text
    pub chunk_index: usize,
    pub total_chunks: usize,
    pub char_offset: usize,            // Position in source text
}
}

ChunkConfig

#![allow(unused)]
fn main() {
pub struct ChunkConfig {
    pub target_size: usize,  // Default: 1000 characters
    pub overlap: usize,      // Default: 200 characters
}
}

Chunking Strategy

Splitting respects natural boundaries in priority order:

  1. Paragraph boundaries (double newline)
  2. Sentence boundaries (period/question/exclamation + space)
  3. Word boundaries (space)
  4. Character boundary (last resort)

Each chunk overlaps with the next by overlap characters to preserve context at boundaries.

Key Invariant

Extraction is deterministic. Same input = same chunks = same hashes. Every chunk’s provenance chain is complete: chunk_hash -> source_hash -> source_url + captured_at.

palimpsest-embed

Embedding generation, SQLite vector search, and LCS-based change detection.

Embedding

#![allow(unused)]
fn main() {
pub struct Embedding { pub values: Vec<f32> }

impl Embedding {
    pub fn dimension(&self) -> usize;
    pub fn cosine_similarity(&self, other: &Embedding) -> f32;
}
}

EmbeddingProvider Trait

#![allow(unused)]
fn main() {
pub trait EmbeddingProvider: Send + Sync {
    async fn embed(&self, text: &str) -> Result<Embedding, PalimpsestError>;
    async fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Embedding>, PalimpsestError>;
    fn dimension(&self) -> usize;
    fn name(&self) -> &str;
}
}

HashEmbedder

Deterministic test embedder using BLAKE3:

#![allow(unused)]
fn main() {
impl HashEmbedder {
    pub fn new(dimension: usize) -> Self;
}
}

Generates pseudo-embeddings by hashing the input text with BLAKE3 and mapping hash bytes to f32 values. Deterministic — same text = same embedding. Not semantically meaningful, but sufficient for testing the vector store pipeline.

VectorStore

SQLite-backed embedding storage with brute-force cosine similarity search:

#![allow(unused)]
fn main() {
impl VectorStore {
    pub fn open(path: &Path) -> Result<Self, VectorStoreError>;
    pub fn in_memory() -> Result<Self, VectorStoreError>;
    pub fn insert(&self, chunk_hash: &str, source_url: &str, captured_at: &str,
                  text: &str, embedding: &Embedding, provider: &str) -> Result<bool, VectorStoreError>;
    pub fn search(&self, query_embedding: &Embedding, limit: usize)
                  -> Result<Vec<StoredEmbedding>, VectorStoreError>;
}
}

StoredEmbedding

#![allow(unused)]
fn main() {
pub struct StoredEmbedding {
    pub chunk_hash: String,
    pub source_url: String,
    pub captured_at: String,
    pub text: String,
    pub similarity: f32,
}
}

Change Detection

LCS-based (Longest Common Subsequence) line-level diff:

#![allow(unused)]
fn main() {
pub struct ContentDiff {
    pub hunks: Vec<DiffHunk>,
    pub similarity: f32,      // 0.0 to 1.0
    pub added: usize,
    pub removed: usize,
    pub unchanged: usize,
}

pub enum DiffHunk {
    Added(String),
    Removed(String),
    Unchanged(String),
}
}

Compares two captures of the same URL to identify what changed between them.

palimpsest-server

HTTP frontier server, retrieval API, and Prometheus metrics. Three distinct services in one crate.

Frontier API

Distributed crawling coordination. Workers pop URLs, fetch them, and push discoveries back.

FrontierState

#![allow(unused)]
fn main() {
pub struct FrontierState {
    pub frontier: Mutex<Frontier>,
    pub seed: CrawlSeed,
}
}

Endpoints

MethodPathRequest BodyResponse
POST/seeds{"urls": ["..."]}{"accepted": N}
POST/pop{}{"url": "...", "depth": 0, "priority": 0}
POST/discovered{"urls": [{"url": "...", "depth": 1, "parent_hash": "..."}]}{"accepted": N}
GET/status{"queue_size": N, "seen_count": N, "host_count": N, "seed_value": N}
GET/health"ok"

Retrieval API

Content serving for AI pipelines and search.

RetrievalState

#![allow(unused)]
fn main() {
pub struct RetrievalState {
    pub index: Mutex<SqliteIndex>,
    pub storage: FileSystemBlobStore,
    pub chunk_config: ChunkConfig,
}
}

Endpoints

MethodPathQuery ParamsDescription
GET/v1/contenturlRaw captured content
GET/v1/chunksurlRAG-ready chunks with provenance
GET/v1/historyurlAll captures with timestamps
GET/v1/searchqFull-text search
GET/healthHealth check

Metrics

#![allow(unused)]
fn main() {
pub struct Metrics {
    pub urls_fetched: AtomicU64,
    pub urls_failed: AtomicU64,
    pub urls_discovered: AtomicU64,
    pub robots_blocked: AtomicU64,
    pub bytes_stored: AtomicU64,
    pub blobs_stored: AtomicU64,
    pub api_requests: AtomicU64,
    pub frontier_pops: AtomicU64,
    pub frontier_pushes: AtomicU64,
}

impl Metrics {
    pub fn new() -> Self;
    pub fn render(&self) -> String;  // Prometheus text exposition format
}
}

All counters use AtomicU64 with Ordering::Relaxed — thread-safe, no locks, no control flow impact (Law 1 safe).

palimpsest-sim

Deterministic simulation testing framework. Proves the Six Laws hold at scale by crawling a virtual internet twice with the same seed and asserting byte-identical results.

SimulatedWeb

#![allow(unused)]
fn main() {
pub struct SimulatedWeb {
    seed: CrawlSeed,
    universes: BTreeMap<String, Box<dyn UniverseGenerator>>,
}

impl SimulatedWeb {
    pub fn new(seed: CrawlSeed) -> Self;
    pub fn add_universe(&mut self, generator: Box<dyn UniverseGenerator>);
    pub fn fetch(&self, url: &Url) -> Option<SimulatedResponse>;
}
}

UniverseGenerator Trait

#![allow(unused)]
fn main() {
pub trait UniverseGenerator: Send + Sync {
    fn domain(&self) -> &str;
    fn generate(&self, seed: &CrawlSeed, url: &Url) -> SimulatedResponse;
    fn page_count(&self) -> usize;
}
}

Each universe owns a domain (e.g., linkmaze.sim) and generates deterministic responses for any URL under that domain.

SimulatedResponse

#![allow(unused)]
fn main() {
pub struct SimulatedResponse {
    pub status: u16,
    pub headers: Vec<(String, String)>,
    pub body: Bytes,
    pub delay: Option<Duration>,
    pub fault: Option<FaultType>,
}

pub enum FaultType {
    ConnectionRefused, Timeout, Reset, RedirectLoop,
}
}

Six Adversarial Universes

UniverseDomainTests
LinkMazelinkmaze.simDeep graph traversal, configurable fan-out
EncodingHellencoding.simUTF-8 edge cases, mixed encodings, BOM
MalformedDommalformed.simBroken HTML, unclosed tags, invalid attributes
RedirectLabyrinthredirect.simRedirect chains, loops, cross-domain redirects
ContentTraptrap.simInfinite calendars, session IDs, spider traps
TemporalDriftdrift.simContent changes between fetches

Verification Harness

#![allow(unused)]
fn main() {
pub async fn verify_determinism<F>(
    web_factory: F,
    seeds: &[Url],
    max_depth: u32,
    max_urls: usize,
) -> Result<VerificationResult, String>
where F: Fn() -> SimulatedWeb;
}

Crawls twice, compares URLs, blob hashes, and index entries. Any divergence = failure.

VerificationResult

#![allow(unused)]
fn main() {
pub struct VerificationResult {
    pub urls: Vec<String>,
    pub blob_hashes: Vec<String>,
    pub index_entries: Vec<(String, String)>,
    pub pages_fetched: usize,
    pub errors: usize,
}
}

Scale

Proven deterministic at 1,000, 5,000, and 10,000 pages across all six universes with zero divergence.

palimpsest-cli

Command-line interface with 10 subcommands. Thin wrapper around the kernel crates.

crawl

Start a crawl with seed URLs.

palimpsest crawl <SEEDS>... [OPTIONS]

  -d, --depth <N>          Max crawl depth [default: 2]
  -m, --max-urls <N>       Max URLs to fetch [default: 100]
  -s, --seed <N>           Deterministic seed [default: 42]
  -o, --output-dir <DIR>   Persist to disk
      --browser            Headless Chrome capture
      --user-agent <UA>    User-Agent [default: PalimpsestBot/0.1]
      --politeness-ms <N>  Per-host delay in ms [default: 1000]
  -c, --config <FILE>      TOML config file

replay

palimpsest replay <URL> --data-dir <DIR>

history

palimpsest history <URL> --data-dir <DIR>

extract

palimpsest extract <URL> --data-dir <DIR> [--json]

shadow-compare

palimpsest shadow-compare --legacy <DIR> --palimpsest <DIR> [--json]

serve

Start a distributed frontier server.

palimpsest serve --port <PORT> --seed <N> --politeness-ms <N>

Default port: 8090.

worker

Connect to a frontier server and crawl.

palimpsest worker --server <URL> --output-dir <DIR> [--user-agent <UA>]

api

Start the retrieval API server.

palimpsest api --port <PORT> --data-dir <DIR>

Default port: 8080.

stats

Print workspace statistics.

palimpsest stats

migrate

Run storage migrations (JSON index to SQLite).

palimpsest migrate --data-dir <DIR>

Docker Deployment

Dockerfile

Multi-stage build: Rust 1.86 builder stage compiles a release binary, Debian slim runtime stage runs it.

# Build the image
docker build -t palimpsest .

# Single crawl
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data

# View help
docker run palimpsest --help

The final image includes only the stripped binary and minimal runtime dependencies (ca-certificates, libssl3).

Docker Compose

The compose file runs four services sharing a named volume:

docker compose up
ServiceCommandPortPurpose
apiapi -p 8080 --data-dir /data8080Retrieval API
frontierserve -p 8090 -s 42 --politeness-ms 5008090Frontier server
workerworker --server http://frontier:8090 -o /dataFetch worker
crawlcrawl <URL> -d 2 -m 50 -o /dataOne-shot crawl

The crawl service uses the crawl profile — run it explicitly:

docker compose run --profile crawl crawl

Shared Volume

All services share the palimpsest-data named volume mounted at /data. This contains blobs, the SQLite index, WARC files, and frontier state.

Production Considerations

  • Set resource limits (mem_limit, cpus) per service
  • The frontier server is stateful — run a single instance
  • Workers are stateless — scale horizontally with docker compose up --scale worker=N
  • Mount the data volume to persistent storage for durability
  • Expose only the api service port externally; keep frontier internal

Distributed Crawling

Palimpsest supports horizontal scaling via an HTTP frontier server and N worker processes.

Architecture

                    ┌──────────────┐
    curl POST       │   Frontier   │ ◄── Deterministic ordering
    /seeds ────────►│   Server     │     (seed-driven)
                    │  :8090       │
                    └──┬───┬───┬──┘
                       │   │   │
              POST /pop│   │   │POST /discovered
                       │   │   │
                    ┌──┴┐ ┌┴──┐┌┴──┐
                    │W1 │ │W2 ││W3 │ ◄── Stateless workers
                    └───┘ └───┘└───┘
                       │   │   │
                       ▼   ▼   ▼
                    ┌──────────────┐
                    │ Shared Disk  │ (blobs, index, WARC)
                    └──────────────┘

Start the Frontier Server

palimpsest serve --port 8090 --seed 42 --politeness-ms 500

The frontier maintains deterministic URL ordering and politeness enforcement across all workers.

Seed URLs

curl -X POST http://localhost:8090/seeds \
  -H 'Content-Type: application/json' \
  -d '{"urls": ["https://example.com/", "https://docs.example.com/"]}'

Start Workers

# Terminal 2
palimpsest worker --server http://localhost:8090 --output-dir ./data

# Terminal 3 (scale out)
palimpsest worker --server http://localhost:8090 --output-dir ./data

Each worker loops: pop URL -> fetch -> store artifacts -> push discovered URLs.

Worker Flow

  1. POST /pop — receive next URL from frontier
  2. Fetch the URL (HTTP or browser)
  3. Store blob to content-addressed storage
  4. Insert entry into temporal index
  5. Write WARC++ records
  6. POST /discovered — push new URLs back to frontier
  7. Repeat

Monitoring

# Check frontier status
curl http://localhost:8090/status

# Response:
# {"queue_size": 1234, "seen_count": 5678, "host_count": 42, "seed_value": 42}

Determinism Guarantee

The frontier server maintains the same seed-driven ordering regardless of how many workers connect or in what order they pop URLs. Same seed = same frontier ordering.

Retrieval API

The retrieval API serves captured content over HTTP for AI pipelines, RAG systems, and content auditing.

Start the Server

palimpsest api --port 8080 --data-dir ./output

Endpoints

GET /v1/content

Retrieve raw captured content for a URL.

curl "http://localhost:8080/v1/content?url=https://example.com/"

Returns the stored HTTP response body.

GET /v1/chunks

Retrieve RAG-ready chunks with full provenance.

curl "http://localhost:8080/v1/chunks?url=https://example.com/"

Response:

{
  "url": "https://example.com/",
  "chunks": [
    {
      "text": "Example Domain. This domain is for use in illustrative examples...",
      "chunk_index": 0,
      "total_chunks": 3,
      "char_offset": 0,
      "chunk_hash": "blake3:af13...",
      "source_hash": "blake3:c7d2...",
      "captured_at": "2026-04-12T10:30:00Z"
    }
  ]
}

GET /v1/history

All captures of a URL with timestamps and content hashes.

curl "http://localhost:8080/v1/history?url=https://example.com/"

Response:

{
  "url": "https://example.com/",
  "captures": [
    {"captured_at": "2026-04-12T10:30:00Z", "content_hash": "blake3:af13...", "crawl_context": 1},
    {"captured_at": "2026-04-13T08:00:00Z", "content_hash": "blake3:b8e2...", "crawl_context": 2}
  ]
}

GET /v1/search

Search across captured content.

curl "http://localhost:8080/v1/search?q=example+domain"

GET /metrics

Prometheus-compatible metrics (see Monitoring).

GET /health

curl http://localhost:8080/health
# "ok"

Use Cases

  • RAG pipelines/v1/chunks provides pre-chunked text with provenance for embedding
  • Content auditing/v1/history shows exactly when content changed
  • AI training/v1/content serves raw captured pages
  • Search systems/v1/search provides full-text search across the archive

Monitoring & Metrics

Prometheus Endpoint

The API server exposes metrics at GET /metrics in Prometheus text exposition format:

curl http://localhost:8080/metrics
# HELP palimpsest_urls_fetched Total URLs successfully fetched.
# TYPE palimpsest_urls_fetched counter
palimpsest_urls_fetched 4521
# HELP palimpsest_urls_failed Total URLs that failed to fetch.
# TYPE palimpsest_urls_failed counter
palimpsest_urls_failed 12
# HELP palimpsest_urls_discovered Total URLs discovered via link extraction.
# TYPE palimpsest_urls_discovered counter
palimpsest_urls_discovered 15890
...

Available Metrics

MetricTypeDescription
palimpsest_urls_fetchedcounterTotal URLs successfully fetched
palimpsest_urls_failedcounterTotal fetch failures
palimpsest_urls_discoveredcounterTotal URLs discovered via links
palimpsest_robots_blockedcounterTotal URLs blocked by robots.txt
palimpsest_bytes_storedcounterTotal bytes written to blob storage
palimpsest_blobs_storedgaugeUnique blobs in storage
palimpsest_api_requestscounterTotal API requests served
palimpsest_frontier_popscounterTotal frontier pop operations
palimpsest_frontier_pushescounterTotal frontier push operations

All counters use AtomicU64 with Ordering::Relaxed — lock-free, thread-safe, no impact on crawl ordering (Law 1 safe).

Structured Logging

Palimpsest uses tracing with tracing-subscriber for structured logging:

# Set log level via environment
RUST_LOG=info palimpsest crawl https://example.com -o ./output

# Debug level for specific crate
RUST_LOG=palimpsest_frontier=debug palimpsest crawl ...

# JSON output for log aggregation
RUST_LOG=info palimpsest crawl ... 2>&1 | jq .

Grafana Dashboard Suggestions

PanelQueryType
Throughputrate(palimpsest_urls_fetched[1m])Graph
Error Raterate(palimpsest_urls_failed[1m]) / rate(palimpsest_urls_fetched[1m])Gauge
Discovery Ratiopalimpsest_urls_discovered / palimpsest_urls_fetchedStat
Robots Blockedrate(palimpsest_robots_blocked[1m])Graph
Storage Growthpalimpsest_bytes_storedGraph
API Loadrate(palimpsest_api_requests[1m])Graph

Alerting Suggestions

  • Error rate > 5% — possible network or DNS issues
  • Throughput drop > 50% — politeness starvation or backend slowdown
  • Frontier pops = 0 — crawl may be stalled
  • Storage growth flatline — dedup working well, or crawl stopped

Trust Boundaries

Untrusted Inputs

All fetched content is untrusted. HTTP responses, HTML, JavaScript, CSS, images — all of it. Never execute, eval, or interpret fetched content outside a sandbox.

All URLs are untrusted. Validate scheme, host, and port. Block private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), link-local (169.254.0.0/16), and loopback (127.0.0.0/8) unless explicitly configured.

All DNS responses are untrusted. Record them in the ExecutionEnvelope for forensic replay, but verify against policy before connecting. DNS rebinding attacks can redirect requests to internal infrastructure.

All TLS certificates are recorded. The full certificate chain is stored in the envelope’s TlsFingerprint (protocol, cipher, cert chain hash). This enables forensic analysis of TLS state at capture time.

Storage Integrity

All artifacts are content-addressed. Tampering is detectable by recomputing the BLAKE3 hash and comparing it against the stored ContentHash. This verification happens on every read.

Storage backends must support atomic writes. The FileSystemBlobStore uses temp-file-plus-rename to prevent partial artifacts from being visible.

Blob deletion requires an explicit garbage collection pass — never inline during normal operation.

Credential Safety

  • No credentials in source code, configuration files committed to git, or artifact metadata
  • HTTP auth credentials (for authenticated crawls) are injected via environment variables
  • TLS client certificates are loaded from a configured path, never embedded

Fetch Safety

Resource Limits

LimitDefaultConfigurable
Maximum response body256 MiBFetchConfig.max_body_size
Maximum redirect chain10FetchConfig.max_redirects
Connect timeout30 secondsFetchConfig.connect_timeout
Total request timeout120 secondsFetchConfig.total_timeout

Decompression Bomb Protection

Responses with Content-Encoding: gzip (or brotli, deflate) are decompressed with size validation. The decompressed size is checked against Content-Length * reasonable_ratio to prevent zip bomb attacks.

Unsafe URL Schemes

Link extraction blocks unsafe URL schemes. These are logged but never followed:

  • javascript: — code execution
  • data: — embedded content (can be arbitrarily large)
  • blob: — browser-internal references

HTML Sanitization

Before link extraction, <script> and <style> tag content is stripped entirely. This prevents extracting junk URLs from JavaScript source code (e.g., minified variable names that look like relative paths).

#![allow(unused)]
fn main() {
pub fn extract_links(html: &str, base_url: &Url) -> Vec<Url> {
    let cleaned = strip_tag_content(html, &["script", "style"]);
    // ... scan for href, src attributes
}
}

robots.txt Enforcement

Palimpsest respects robots.txt per RFC 9309:

  • Fetches and caches robots.txt per origin before crawling
  • Respects Disallow directives for the configured user agent
  • Honors Crawl-delay when specified
  • Blocked URLs are counted in metrics (palimpsest_robots_blocked)

Browser Sandbox

Isolation Model

Headless Chrome runs in a sandboxed process with strict isolation:

  • No persistent storage — each page load starts from a clean browser context. No cookies, localStorage, or IndexedDB carry over between pages.
  • Controlled network — the browser communicates only through the fetch engine’s controlled proxy. Direct network access is blocked.
  • Disabled exfiltration — WebRTC, geolocation, notifications, and clipboard APIs are disabled to prevent data leakage.

Timeout Enforcement

Every page load has a hard timeout (default: 30 seconds). If the page does not complete loading within the timeout, the browser process is killed.

Determinism Overrides

Before any page scripts execute, Palimpsest injects JavaScript overrides seeded from CrawlSeed:

// Time is frozen and advances deterministically
Date.now = function() { return 1700000000000 + (__date_offset += 1); };

// Math.random is seeded (xorshift)
Math.random = function() { /* seeded PRNG */ };

// performance.now advances in fixed increments
performance.now = function() { return (__perf_offset += 0.1); };

This prevents JavaScript on the page from introducing non-determinism. Same seed = same execution.

CDP Stealth Mode

When stealth: true is set on BrowserFetchConfig, a comprehensive anti-detection suite is applied on top of the determinism overrides.

Chrome Launch Hardening

--disable-blink-features=AutomationControlled
--disable-component-extensions-with-background-pages
--no-first-run
--no-default-browser-check

17 Stealth Evasion Patches

All patches injected via Page.addScriptToEvaluateOnNewDocument before navigation:

PatchPurpose
navigator.webdriverSet to false (configurable via WebdriverValue enum)
window.chromeFull Chrome object mock (app, csi, loadTimes, runtime)
navigator.plugins3 plugins matching real Chrome
navigator.mimeTypesPDF + NaCl mime types
navigator.permissionsFix Notification permission inconsistency
navigator.languages["en-US", "en"]
navigator.hardwareConcurrency8 cores
navigator.deviceMemory8 GB
WebGL vendor/rendererIntel UHD Graphics 630
Canvas fingerprintSeeded sub-pixel noise
Window/screen dimensionsMatch viewport + chrome UI offset
AudioContextSeeded oscillator noise
ClientRectSeeded sub-pixel noise
sourceURL markersStrip automation stack traces
navigator.userAgentConsistent with HTTP User-Agent header
navigator.maxTouchPoints0

Determinism Guarantee

All noise patches (canvas, audio, ClientRect) use deterministic xorshift PRNGs with sub-seeds derived from CrawlSeed. Same seed = same noise = same fingerprint. This is Law 1 compliant.

Verified Results

Tested against 5 public bot detection sites:

  • Rebrowser Bot Detector: 10/10 pass
  • Sannysoft: 55/56 pass (only PluginArray prototype)
  • FingerprintJS BotD: No bot verdict
  • CreepJS: No hard failures

Sub-Resource Capture

Chrome DevTools Protocol (CDP) network event listeners capture all sub-resources:

  • Network.requestWillBeSent — records every outgoing request
  • Network.responseReceived — captures response metadata
  • Network.getResponseBody — retrieves response body for each sub-resource

Each sub-resource is recorded as a separate WARC record with its own ContentHash, and the full dependency graph is stored in the resource-graph record.

Testing Philosophy

The Hierarchy

Tests are prioritized by the strength of the guarantee they provide:

  1. Determinism tests — Same seed + same input = bit-identical output. These are the proof that the system works. Highest priority.
  2. Property-based testsproptest generates random inputs and verifies invariants hold for all of them. Catches edge cases humans miss.
  3. Snapshot testsinsta for serialization formats (WARC++, JSON, index entries). Snapshots are reviewed artifacts.
  4. Integration tests — Real HTTP via wiremock, real storage backends, real index queries.
  5. Unit tests — Standard #[test] for isolated logic.

No Mocking Core Interfaces

The storage layer, index, and envelope are the system’s integrity boundaries. Never mock them. Use real implementations with in-memory backends:

#![allow(unused)]
fn main() {
// Correct: real implementation, in-memory backend
let store = InMemoryBlobStore::new();
let index = InMemoryIndex::new();

// Wrong: mocked storage that always returns Ok
// let store = MockBlobStore::new();  // DON'T
}

Test Naming

test_{what}_{condition}_{expected_outcome}

Examples:

  • test_frontier_with_same_seed_produces_identical_order
  • test_artifact_hash_changes_when_content_differs
  • test_storage_put_get_roundtrip_preserves_content

Adversarial Testing

Every adversarial input must produce a classified error, never a panic or silent corruption:

  • Malformed HTTP responses
  • Truncated connections mid-transfer
  • DNS resolution failures
  • TLS certificate anomalies
  • Content that attempts to exploit parsers (polyglot files, zip bombs)

Test Coverage

301 tests across 21 test files, covering all 15 crates. The simulation framework proves determinism at 10,000 pages with zero divergence.

Simulation Framework

The simulation framework (palimpsest-sim) provides a virtual internet for testing. It replaces real HTTP with deterministic, seed-driven responses — enabling proof that the Six Laws hold at scale.

SimulatedWeb

A SimulatedWeb hosts multiple “universes,” each owning a domain:

#![allow(unused)]
fn main() {
let mut web = SimulatedWeb::new(CrawlSeed::new(42));
web.add_universe(Box::new(LinkMaze { links_per_page: 500, total_pages: 100_000 }));
web.add_universe(Box::new(EncodingHell));
web.add_universe(Box::new(MalformedDom));
}

Calling web.fetch(&url) returns a SimulatedResponse generated deterministically from the seed and URL.

SimulatedServer

Wraps SimulatedWeb with wiremock to serve responses over real HTTP. The CrawlOrchestrator connects to it as if it were the real web.

verify_determinism

The core harness:

#![allow(unused)]
fn main() {
let result = verify_determinism(
    || build_web(seed),  // Factory creates identical web each time
    &seed_urls,
    max_depth,
    max_urls,
).await?;
}

This function:

  1. Creates a SimulatedWeb from the factory
  2. Runs a full crawl (orchestrator + frontier + storage + index)
  3. Records all URLs, blob hashes, and index entries
  4. Creates a second SimulatedWeb from the same factory
  5. Runs an identical crawl
  6. Asserts the two runs produced byte-identical results

Any divergence in URLs, blob hashes, or index entries causes a test failure.

verify_resumption_determinism

Tests crawl resumption:

  1. Crawl 500 pages, save frontier state
  2. Create new frontier, load saved state
  3. Continue crawling to 1000 pages
  4. Compare against a single 1000-page run

Same result = Law 1 holds across save/load boundaries.

Scale Tests

TestPagesUniversesResult
test_scale_1000_pages_deterministic1,0005Zero divergence
test_scale_5000_pages_linkmaze_only5,0001Zero divergence
test_stress_10k_pages_deterministic10,0005Zero divergence

Adversarial Universes

The simulation framework includes six adversarial universes, each designed to stress a specific aspect of the crawl kernel.

LinkMaze

Domain: linkmaze.sim

A deep, wide graph. Each page contains links_per_page links to other pages in the maze. Tests frontier scheduling, deduplication, and depth limiting at scale.

#![allow(unused)]
fn main() {
LinkMaze { links_per_page: 500, total_pages: 1_000_000 }
}

EncodingHell

Domain: encoding.sim

UTF-8 edge cases: mixed encodings, byte-order marks, surrogate pairs, right-to-left text, zero-width characters, overlong sequences. Tests that content hashing and text extraction handle encoding correctly.

MalformedDom

Domain: malformed.sim

Broken HTML: unclosed tags, deeply nested tables, invalid attributes, missing doctype, mixed content models. Tests link extraction robustness — the parser must not crash or produce junk URLs.

RedirectLabyrinth

Domain: redirect.sim

Redirect chains (301 -> 302 -> 301 -> 200), redirect loops, cross-domain redirects, redirect-to-self. Tests redirect chain depth enforcement and URL normalization.

ContentTrap

Domain: trap.sim

Spider traps: infinite calendars (every date links to the next), session IDs in URLs (creating infinite unique URLs), query parameter permutations. Tests that max_urls and deduplication prevent infinite crawls.

TemporalDrift

Domain: drift.sim

Content changes between fetches. The same URL returns different content depending on the logical clock value. Tests temporal integrity — the index must correctly record each version.

#![allow(unused)]
fn main() {
TemporalDrift::new(1)  // Content changes every 1 logical tick
}

Composition

All six universes run simultaneously in scale tests:

#![allow(unused)]
fn main() {
let mut web = SimulatedWeb::new(seed);
web.add_universe(Box::new(LinkMaze { ... }));
web.add_universe(Box::new(EncodingHell));
web.add_universe(Box::new(MalformedDom));
web.add_universe(Box::new(RedirectLabyrinth));
web.add_universe(Box::new(ContentTrap));
web.add_universe(Box::new(TemporalDrift::new(1)));
}

The crawl must handle all six simultaneously — deterministic ordering across domains, correct error classification, and zero divergence between runs.

Development Setup

Prerequisites

ToolVersionWhy
Rust1.86+ stablerustup update stable
CMake3.x+BoringSSL compilation
Go1.19+BoringSSL compilation
C compilergcc, clang, or MSVCBoringSSL compilation
GitanySource checkout

See Installation for platform-specific setup (macOS, Linux, Windows).

Clone and Build

git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --workspace

First build takes 2-4 minutes (BoringSSL compiles from source). Subsequent builds are incremental.

Running Tests

# Full test suite (288 tests, excludes long-running scale tests)
cargo test --workspace

# Simulation tests only
cargo test -p palimpsest-sim --test simulation_tests

# Scale tests (1K + 5K pages, ~90 seconds)
cargo test -p palimpsest-sim --test scale_test

# Stress test (10K pages)
cargo test -p palimpsest-sim --test stress_test

# Stealth regression tests (requires Chrome + network access)
cargo test -p palimpsest-fetch --test stealth_test -- --ignored --nocapture --test-threads=1

# Single crate
cargo test -p palimpsest-frontier

Pre-Commit Checks

Before submitting a PR:

cargo fmt --check            # Formatting
cargo clippy -- -D warnings  # Lints (must be warning-free)
cargo test --workspace       # All tests pass

IDE Setup

rust-analyzer is recommended for all editors. The workspace Cargo.toml at the project root configures all 15 crates automatically.

EditorSetup
VS CodeInstall rust-analyzer extension
JetBrains (CLion/RustRover)Built-in Rust support
Neovimmason.nvim → install rust-analyzer
Emacslsp-mode + rustic

Docker Testing

docker build -t palimpsest .
docker run palimpsest --help

Platform Notes

macOS

BoringSSL builds cleanly with Xcode command line tools + Homebrew CMake + Go. No special flags needed.

Windows (MSVC)

Requires Visual Studio Build Tools with the “Desktop development with C++” workload. CMake and Go must be in PATH. WSL2 is the recommended alternative for a smoother experience.

Linux

All major distributions work. Ensure cmake, go, and clang (or gcc) are installed. See Installation for distro-specific package commands.

Code Standards

Deterministic Concurrency

  • BTreeMap over HashMap when iteration order is observable
  • tokio for all concurrency — no thread::spawn
  • No rand crate — all randomness via CrawlSeed -> ChaCha8Rng
  • No Instant::now() in core logic — time from ExecutionEnvelope or caller
  • Atomics for counters/metrics only, never for control flow

Error Handling

  • All errors typed as PalimpsestError variants — no anyhow or eyre in library crates
  • No .unwrap() or .expect() in library code — binary crates may use them in main() only
  • Every ? propagation must preserve the error taxonomy
  • panic! is a bug report, not control flow

Memory and Performance

  • bytes::Bytes for buffers crossing async boundaries
  • Zero-copy: &[u8] > Vec<u8> > String
  • No Clone on large types — use Arc<T> for shared ownership
  • Pre-allocate buffers in hot paths

Serialization

  • serde derive on all types crossing crate boundaries
  • #[serde(rename_all = "snake_case")] on enum variants
  • JSON for human-readable formats
  • Never change serialized field names without a migration plan

Type Design

  • Newtypes for domain concepts: ContentHash, CaptureInstant, CrawlSeed, CrawlContextId
  • Parse, don’t validate: constructors enforce invariants
  • Copy for small values (hashes, timestamps, IDs)
  • #[non_exhaustive] on public enums that may grow

Testing Requirements

  • Every public function has at least one test
  • Property-based tests (proptest) for data transformation functions
  • Snapshot tests (insta) for serialization formats
  • Determinism tests: same seed = byte-identical output

Commit & PR Conventions

Commit Messages

Format:

<type>(<scope>): <description>

<body — explains WHY, not what>

Types: feat, fix, refactor, test, docs, perf, chore

Scope: the crate name in parens: feat(frontier):, fix(envelope):, refactor(storage):

Examples:

feat(frontier): add crawl resumption via frontier save/load

Enables stopping and restarting crawls without losing state.
The frontier serializes its complete state (host queues, seen set,
politeness timestamps) to JSON and restores it on load.
fix(fetch): strip script/style content before link extraction

Link extraction was producing junk URLs from minified JavaScript.
Stripping <script> and <style> tags before scanning eliminates
false positives without affecting real links.

Breaking changes: prefix the body with BREAKING:

Special requirements:

  • Commits touching fetch/artifact/replay must include a replay fidelity test
  • Commits touching frontier/envelope must include a determinism test

Pull Requests

  • One concern per PR. No bundled drive-bys.
  • Must include tests exercising the invariant being changed
  • Benchmark before/after for performance-sensitive paths
  • cargo clippy -- -D warnings and cargo test must pass
  • New dependencies require justification in the PR description

Dependency Policy

Minimize external dependencies — every dep is attack surface.

Approved:

  • tokio — async runtime
  • reqwest/hyper — HTTP
  • serde + serde_json — serialization
  • blake3 — content hashing
  • chrono — temporal types
  • tracing — structured observability

Forbidden in core crates:

  • rand — use CrawlSeed for all randomness
  • anyhow/eyre — use typed PalimpsestError

Process:

  • Pin all versions in Cargo.lock (committed)
  • Run cargo audit before merging new deps
  • No build scripts that download or execute external code

Error Taxonomy

Every failure in Palimpsest is classified into exactly one category. No silent retries. No swallowed errors. Failures are stored artifacts — they are part of the crawl record, not noise to discard.

PalimpsestError

The top-level error enum. Every error in the system ultimately maps to one of these seven variants.

Network

#![allow(unused)]
fn main() {
PalimpsestError::Network(String)
}

Connection failures, DNS resolution errors, TCP timeouts, TLS handshake failures. The fetch could not reach the server.

Examples: DNS NXDOMAIN, connection refused, connect timeout, TLS certificate expired.

Protocol

#![allow(unused)]
fn main() {
PalimpsestError::Protocol(String)
}

HTTP protocol violations. The server responded, but the response is malformed or violates HTTP semantics.

Examples: Invalid status line, malformed headers, truncated chunked encoding, invalid Content-Length.

Rendering

#![allow(unused)]
fn main() {
PalimpsestError::Rendering(String)
}

Browser/DOM errors. Chrome launched but could not render the page correctly.

Examples: JavaScript execution error, page load timeout, CDP connection lost, DOM snapshot failure.

Policy

#![allow(unused)]
fn main() {
PalimpsestError::Policy(String)
}

The system refused to process a URL based on configured policy.

Examples: robots.txt disallow, scope violation (URL outside configured domain), rate limit enforcement, max depth exceeded.

DeterminismViolation

#![allow(unused)]
fn main() {
PalimpsestError::DeterminismViolation {
    context: String,
    expected: String,
    actual: String,
}
}

The nuclear option. This means a Law was broken. Two runs with the same seed produced different results. This should never happen in production — if it does, it’s a bug in the kernel.

Examples: Frontier produced different ordering for same seed, content hash mismatch for identical input, replay diverged from original.

Storage

#![allow(unused)]
fn main() {
PalimpsestError::Storage(String)
}

Blob store failures: write errors, read errors, integrity check failures, backend unavailability.

Examples: Disk full, permission denied, blob corrupted (hash mismatch on read), S3 connection error.

Replay

#![allow(unused)]
fn main() {
PalimpsestError::Replay(String)
}

Missing artifacts, incomplete capture groups, reconstruction failures. The stored data is insufficient to replay.

Examples: Missing blob for content hash, no envelope record in WARC, incomplete resource graph.

Other Error Types

ErrorCrateVariants
StorageErrorpalimpsest-storageBackend, NotFound, IntegrityError
EnvelopeErrorpalimpsest-envelopeMissingSeed, MissingTimestamp, MissingTargetUrl, MissingDnsSnapshot
CaptureGroupErrorpalimpsest-artifactMissingUrl, MissingTimestamp, MissingCrawlContext, MissingEnvelope, MissingRequest, MissingResponse
FrontierPersistErrorpalimpsest-frontierWraps serialization/IO errors
IndexErrorpalimpsest-indexWraps SQLite errors
WarcWriteErrorpalimpsest-artifactWraps IO/format errors
VectorStoreErrorpalimpsest-embedWraps SQLite errors

API Quick Reference

Frontier API (default port 8090)

POST /seeds

Seed the frontier with URLs to crawl.

curl -X POST http://localhost:8090/seeds \
  -H 'Content-Type: application/json' \
  -d '{"urls": ["https://example.com/", "https://docs.example.com/"]}'

Response: {"accepted": 2}

POST /pop

Pop the next URL from the frontier.

curl -X POST http://localhost:8090/pop \
  -H 'Content-Type: application/json' \
  -d '{}'

Response: {"url": "https://example.com/", "depth": 0, "priority": 0}

Returns {"url": null} when the frontier is empty.

POST /discovered

Push discovered URLs back to the frontier.

curl -X POST http://localhost:8090/discovered \
  -H 'Content-Type: application/json' \
  -d '{"urls": [{"url": "https://example.com/page", "depth": 1, "parent_hash": "af1349b9..."}]}'

Response: {"accepted": 1}

GET /status

curl http://localhost:8090/status

Response: {"queue_size": 1234, "seen_count": 5678, "host_count": 42, "seed_value": 42}

GET /health

curl http://localhost:8090/health

Response: "ok"


Retrieval API (default port 8080)

GET /v1/content

curl "http://localhost:8080/v1/content?url=https://example.com/"

Returns raw captured content.

GET /v1/chunks

curl "http://localhost:8080/v1/chunks?url=https://example.com/"

Returns RAG chunks with provenance (chunk_hash, source_hash, captured_at, char_offset).

GET /v1/history

curl "http://localhost:8080/v1/history?url=https://example.com/"

Returns all captures with timestamps and content hashes.

GET /v1/search

curl "http://localhost:8080/v1/search?q=example+domain"

Returns matching content across all captured pages.

GET /metrics

curl http://localhost:8080/metrics

Returns Prometheus text exposition format.

GET /health

curl http://localhost:8080/health

Response: "ok"

Glossary

Core Types

Crawl Kernel — The deterministic execution engine at the heart of Palimpsest. Schedules fetches, seals execution contexts, captures artifacts, stores blobs, indexes temporal state. Not a crawler — a kernel that crawlers are built on.

CrawlSeed — A 64-bit value that controls all randomness in the system. CrawlSeed::rng() returns a ChaCha8Rng PRNG. Same seed = identical behavior.

ContentHash — A 32-byte BLAKE3 hash. Used to address, store, retrieve, and verify every artifact. ContentHash::of(data) computes the hash.

CaptureInstant — A paired timestamp: wall clock (DateTime<Utc>) + logical clock (u64). Binds captures to both real-world time and crawl-internal ordering.

CrawlContextId — An opaque u64 identifier for a crawl session. Distinguishes captures from different runs.

ExecutionEnvelope — An immutable, sealed record of everything that affects a fetch: seed, timestamp, target URL, DNS snapshot, TLS fingerprint, browser config, and headers. Constructed via EnvelopeBuilder, frozen after build().

Frontier & Scheduling

Frontier — The deterministic URL scheduler. Maintains per-host priority queues in a BTreeMap, deduplicates by URL, and enforces politeness delays.

FrontierEntry — A URL in the frontier with depth, priority, and parent hash.

PolitenessPolicy — Configurable per-host rate limiting: minimum delay between requests and maximum concurrent hosts.

Artifacts & WARC

WARC++ — Palimpsest’s extension of the ISO 28500 WARC format. Adds envelope, dom-snapshot, resource-graph, and timing record types while maintaining backward compatibility.

WarcRecord — A single WARC record with type, record ID, content hash, and payload.

CaptureGroup — A bundle of related WARC records from a single fetch: envelope + request + response + optional DOM/resource graph/timing.

RecordType — Enum of WARC record types: 5 standard (warcinfo, request, response, resource, metadata) + 4 extensions (envelope, dom-snapshot, resource-graph, timing).

DomSnapshot — The rendered DOM state after JavaScript execution, captured via CDP.

ResourceGraph — The dependency graph of all sub-resources loaded for a page, with type, hash, initiator, and load ordering.

Storage

BlobStore — The trait interface for content-addressed storage. Implementations: InMemoryBlobStore, FileSystemBlobStore, ObjectStoreBlobStore.

Content-Addressed Storage — Storage where the key is the hash of the content. Same content = same key = stored once. Integrity is verifiable by recomputing the hash.

Deduplication — Structural dedup: if ContentHash::of(data_a) == ContentHash::of(data_b), the data is stored once. Not a post-process step — built into the storage model.

Index & Replay

Temporal Index — A multi-dimensional index mapping URL x time x hash x crawl_context. Not a lookup table — a queryable graph of web history.

IndexEntry — One capture record in the index: URL, CaptureInstant, ContentHash, CrawlContextId.

Replay Fidelity — The guarantee that stored artifacts are sufficient to reconstruct the original HTTP exchange, DOM state, and resource graph. Law 5.

Comparison & Analysis

Shadow Comparison — Side-by-side validation of Palimpsest output against legacy crawler WARC files (Heritrix, wget, Warcprox).

ContentChunk — A provenance-tagged text chunk for RAG pipelines. Carries source_url, captured_at, source_hash, chunk_hash, and char_offset.

Embedding — A vector of f32 values representing text semantics. Generated by an EmbeddingProvider.

EmbeddingProvider — The trait for embedding generation. HashEmbedder provides deterministic test embeddings via BLAKE3.

VectorStore — SQLite-backed storage for embeddings with brute-force cosine similarity search.

Cosine Similarity — The similarity metric between two embedding vectors. Range: -1.0 to 1.0. Used for semantic search.

Simulation

SimulatedWeb — A virtual internet for testing. Hosts multiple UniverseGenerator instances, each responding to URLs on its domain.

UniverseGenerator — The trait for generating deterministic responses. Implementations: LinkMaze, EncodingHell, MalformedDom, RedirectLabyrinth, ContentTrap, TemporalDrift.

Adversarial Universe — A simulation universe designed to stress a specific aspect of the crawl kernel (encoding, DOM parsing, redirects, spider traps, temporal changes).

Anti-Detection

JA3 — TLS fingerprinting method that hashes five fields from the ClientHello: TLS version, cipher suites, extensions, supported groups, EC point formats. Legacy but still deployed by WAFs.

JA4 — Current TLS fingerprinting standard (FoxIO). Sorts before hashing to defeat extension randomization. Three sections: header, sorted cipher hash, sorted extension hash.

BoringSSL — Google’s fork of OpenSSL used by Chrome. Palimpsest uses it (via wreq) for full ClientHello control, enabling browser-grade TLS impersonation.

Akamai h2 Fingerprint — Passive HTTP/2 fingerprint capturing SETTINGS frame values/order, WINDOW_UPDATE, PRIORITY frames, and pseudo-header ordering. Distinguishes browsers from automation clients.

CDP Stealth Mode — Anti-detection suite for headless Chrome. 17 evasion patches covering navigator.webdriver, window.chrome, plugins, WebGL, canvas noise, AudioContext noise, and more.

BrowserProfile — A unified, internally consistent browser identity tying TLS fingerprint + HTTP/2 settings + HTTP headers + JS surface into a single profile. Prevents cross-layer detection mismatches.

ProfileMode — Controls how browser profiles are selected: None (default), Fixed (one profile), Seeded (deterministic from CrawlSeed), RotatePerDomain (per-domain via BLAKE3).

WebdriverValue — Explicit config for navigator.webdriver in stealth mode. False (matches real Chrome, default) or Undefined (property deleted). Auditable, not hidden.