Introduction
Palimpsest is a deterministic crawl kernel — not a crawler, not a Wayback clone, not a scraping framework. It is the foundational memory layer of the web: a system where the same input and the same seed produce an identical crawl, identical artifacts, and identical replay. Every design decision bends around this property.
What Makes This Different
Traditional web archiving tools (Heritrix, wget, Scrapy, Brozzler) treat crawling as an inherently non-deterministic process. Network jitter, DNS resolution timing, thread scheduling, and random retry backoff all introduce entropy. Two runs of the same crawl produce different results. This makes verification impossible, replay approximate, and auditing meaningless.
Palimpsest eliminates this. The system is governed by Six Laws — determinism, idempotence, content addressability, temporal integrity, replay fidelity, and observability as proof — that are enforced at every layer, from the frontier scheduler to the artifact serializer.
The result: a crawl kernel that auditors can trust, AI systems can consume, historians can depend on, and adversaries cannot easily corrupt.
The System at a Glance
| Metric | Value |
|---|---|
| Crates | 15 Rust workspace members |
| Tests | 301 (zero failures) |
| Determinism proof | 10,000 pages, zero divergence |
| Storage | Content-addressed (BLAKE3) with structural deduplication |
| Format | WARC++ (ISO 28500 extension) |
| Index | Temporal graph: URL x time x hash x context |
| Capture | Raw HTTP + headless Chrome (CDP) |
| Distribution | HTTP frontier server + N workers |
How to Read This Documentation
- Getting Started — Install, run your first crawl, configure the system.
- Architecture — System design, the Six Laws, crate dependency graph, data flow.
- Core Concepts — Deep dives into determinism, content addressability, the execution envelope, temporal indexing, and the WARC++ format.
- Crate Reference — Complete API documentation for all 15 crates.
- Operations — Docker deployment, distributed crawling, retrieval API, monitoring.
- Security — Trust boundaries, fetch safety, browser sandboxing.
- Testing — Testing philosophy, the simulation framework, adversarial universes.
- Contributing — Development setup, code standards, commit conventions.
- Appendix — Error taxonomy, API quick reference, glossary.
Installation
Prerequisites
| Dependency | Required | Notes |
|---|---|---|
| Rust 1.86+ | Yes | Stable toolchain via rustup |
| Git | Yes | Source checkout |
| C compiler + CMake | Yes | BoringSSL build (via wreq) |
| Go 1.19+ | Yes | BoringSSL build (via wreq) |
| Chrome or Chromium | Optional | Browser capture mode (--browser) |
| Docker | Optional | Containerized deployment |
Why the C/Go toolchain?
Palimpsest uses wreq with BoringSSL for TLS fingerprint impersonation. BoringSSL is compiled from source during cargo build, which requires CMake, a C compiler, and Go.
macOS
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Install build dependencies (Xcode command line tools + CMake + Go)
xcode-select --install
brew install cmake go
# Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release
# Verify
./target/release/palimpsest --help
Chrome for browser capture:
# Chrome is usually at:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --version
# Or install via Homebrew:
brew install --cask google-chrome
Linux (Ubuntu/Debian)
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Install build dependencies
sudo apt update
sudo apt install -y build-essential cmake golang-go pkg-config libclang-dev
# Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release
# Verify
./target/release/palimpsest --help
Chrome for browser capture:
# Install Chrome
wget -q -O - https://dl.google.com/linux/linux_signing_key.pub | sudo gpg --dearmor -o /usr/share/keyrings/google-chrome.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/google-chrome.gpg] http://dl.google.com/linux/chrome/deb/ stable main" | sudo tee /etc/apt/sources.list.d/google-chrome.list
sudo apt update && sudo apt install -y google-chrome-stable
# Verify
google-chrome --version
Linux (Fedora/RHEL)
# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
# Install build dependencies
sudo dnf install -y gcc gcc-c++ cmake golang clang-devel pkg-config
# Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release
Linux (Arch)
sudo pacman -S rust cmake go clang pkg-config
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release
Windows
Option A: Native (MSVC)
# 1. Install Rust from https://rustup.rs/ (choose MSVC toolchain)
# 2. Install Visual Studio Build Tools (C/C++ workload)
# https://visualstudio.microsoft.com/visual-cpp-build-tools/
# 3. Install CMake
# https://cmake.org/download/ (add to PATH during install)
# 4. Install Go
# https://go.dev/dl/ (add to PATH during install)
# 5. Clone and build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --release
# 6. Verify
.\target\release\palimpsest.exe --help
Option B: WSL2 (recommended)
Windows Subsystem for Linux gives you a full Linux environment. Follow the Linux (Ubuntu/Debian) instructions above inside WSL2:
# Install WSL2 with Ubuntu
wsl --install -d Ubuntu
# Then inside the WSL2 terminal, follow the Linux instructions
Option C: Docker (any platform)
If you don’t want to install build tools, Docker works on all platforms:
docker build -t palimpsest .
docker run palimpsest --help
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data
See Docker Deployment for the full compose setup.
Verifying the Build
After building, you should see all 10 subcommands:
$ palimpsest --help
Usage: palimpsest <COMMAND>
Commands:
crawl Start a crawl with seed URLs
replay Reconstruct a captured URL from artifacts
history Show capture history for a URL
extract Extract text and RAG chunks from captured content
shadow-compare Compare against legacy crawler WARC files
serve Start a distributed frontier server
worker Connect to a frontier server and crawl
api Start the retrieval API server
stats Print workspace statistics
migrate Run storage migrations
Running the Test Suite
# All tests (288, excludes long-running scale tests)
cargo test --workspace
# Simulation framework only
cargo test -p palimpsest-sim --test simulation_tests
# Scale tests (1K + 5K pages, ~90 seconds)
cargo test -p palimpsest-sim --test scale_test
# Stress test (10K pages)
cargo test -p palimpsest-sim --test stress_test
# Stealth regression tests (requires Chrome + network)
cargo test -p palimpsest-fetch --test stealth_test -- --ignored --nocapture --test-threads=1
Troubleshooting
BoringSSL build fails
The most common build issue. Check:
cmake --version # Need 3.x+
go version # Need 1.19+
clang --version # Or gcc — need a C compiler
On macOS, ensure Xcode command line tools are installed: xcode-select --install
On Windows, ensure Visual Studio Build Tools include the “Desktop development with C++” workload.
Chrome not found (browser capture)
Palimpsest looks for Chrome/Chromium in PATH. If installed in a non-standard location:
# macOS — add to PATH
export PATH="/Applications/Google Chrome.app/Contents/MacOS:$PATH"
# Windows — add to PATH
set PATH=%PATH%;C:\Program Files\Google\Chrome\Application
openssl-sys linker errors
Palimpsest uses BoringSSL (via wreq), not OpenSSL. If you see openssl-sys errors, another dependency may be pulling it in. Check with:
cargo tree -i openssl-sys
If present, the boring-sys2 crate’s prefix-symbols feature should prevent symbol conflicts on Linux. On macOS this is not typically an issue.
Your First Crawl
Basic Crawl
palimpsest crawl https://example.com -d 2 -m 50 -o ./output
| Flag | Meaning |
|---|---|
-d 2 | Maximum depth from seed URL |
-m 50 | Maximum 50 URLs to fetch |
-o ./output | Persist artifacts to disk |
The default seed is 42. The default politeness delay is 1 second per host.
Output Structure
After the crawl completes, ./output contains:
output/
blobs/ # Content-addressed storage (BLAKE3 hashes)
af/
1349b9f5... # Blob file named by hash
c7/
d2fe1a6b...
index.sqlite # Temporal index database
output.warc # WARC++ file (ISO 28500 compatible)
frontier.json # Saved frontier state (for resumption)
Replay a Captured URL
Reconstruct the captured version of a page from stored artifacts:
palimpsest replay https://example.com/ --data-dir ./output
This retrieves the stored blob, HTTP headers, and execution context to reproduce the original response.
View Capture History
List all captures of a URL with timestamps and content hashes:
palimpsest history https://example.com/ --data-dir ./output
Extract Text and RAG Chunks
Extract clean text and provenance-tagged chunks from a captured page:
palimpsest extract https://example.com/ --data-dir ./output --json
This strips HTML, removes scripts and styles, splits into chunks (default 1000 chars with 200 overlap), and tags each chunk with source_url, captured_at, source_hash, chunk_hash, and char_offset.
Browser Capture
Capture JavaScript-rendered pages with headless Chrome:
palimpsest crawl https://example.com --browser -d 1 -m 10 -o ./output
This captures:
- Rendered DOM after JavaScript execution
- All sub-resources (CSS, JS, images, fonts)
- Resource dependency graph with load ordering
Using a Deterministic Seed
The seed controls all randomness — frontier ordering, host rotation, and browser JS overrides:
# These two runs produce identical output
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-a
palimpsest crawl https://example.com -s 42 -d 2 -m 50 -o ./run-b
# Verify
diff <(find ./run-a/blobs -type f | sort) <(find ./run-b/blobs -type f | sort)
# No output = identical
Shadow Comparison
Compare output against a legacy crawler:
# Crawl with wget
wget --warc-file=legacy -r -l 1 https://example.com/
# Crawl with Palimpsest
palimpsest crawl https://example.com -d 1 -o ./palimpsest-out
# Compare
palimpsest shadow-compare --legacy ./ --palimpsest ./palimpsest-out
Configuration
TOML Config File
Pass a TOML configuration file instead of CLI flags:
palimpsest crawl -c crawl.toml
Example Configuration
seeds = ["https://example.com/", "https://docs.example.com/"]
[crawl]
seed = 42
max_depth = 3
max_urls = 500
concurrency = 10
user_agent = "PalimpsestBot/0.1"
browser_mode = false
scope = "same_domain"
output_dir = "./output"
[politeness]
min_host_delay_ms = 1000
max_concurrent_hosts = 100
Configuration Fields
Seeds
seeds = ["https://example.com/", "https://docs.example.com/"]
One or more seed URLs. The crawl starts from these and discovers links outward.
Crawl Seed
seed = 42
The deterministic seed value. Controls all randomness in the system: frontier ordering, host rotation, browser JS overrides. Same seed = identical crawl.
Scope
scope = "same_domain"
| Value | Behavior |
|---|---|
same_domain | Follow links within the registrable domain (e.g., www.example.com and docs.example.com both match example.com) |
same_host | Exact host match only |
any | No scope restriction (use with caution) |
Politeness Policy
[politeness]
min_host_delay_ms = 1000 # Minimum delay between same-host requests
max_concurrent_hosts = 100 # Maximum hosts being fetched in parallel
Presets (when using the API directly):
| Preset | Host Delay | Concurrent Hosts |
|---|---|---|
default_policy() | 1 second | 100 |
aggressive() | 100ms | 500 |
no_delay() | 0 | unlimited |
Depth and Limits
max_depth = 3 # Max link-following depth from seed (0 = seed page only)
max_urls = 500 # Hard cap on total URLs fetched
concurrency = 10 # Parallel fetch tasks
Browser Mode
browser_mode = true
Enables headless Chrome capture via CDP. Each page is loaded in a fresh browser context with determinism overrides applied (Date.now(), Math.random(), performance.now() are all seeded from CrawlSeed).
Output Directory
output_dir = "./output"
When set, artifacts are persisted to disk: content-addressed blobs, SQLite index, WARC++ file, and frontier state. When omitted, the crawl runs in-memory only.
CLI Flag Mapping
| Config Field | CLI Flag | Default |
|---|---|---|
seeds | positional args | (required) |
seed | -s, --seed | 42 |
max_depth | -d, --depth | 2 |
max_urls | -m, --max-urls | 100 |
min_host_delay_ms | --politeness-ms | 1000 |
user_agent | --user-agent | PalimpsestBot/0.1 |
browser_mode | --browser | false |
output_dir | -o, --output-dir | (none) |
| config file | -c, --config | (none) |
System Overview
Palimpsest is a crawl kernel, not a crawler. The distinction matters: a crawler is a tool that fetches web pages. A crawl kernel is the deterministic execution engine that schedules fetches, seals execution contexts, captures artifacts, stores content-addressed blobs, indexes temporal state, and enables bit-identical replay.
The CLI, server, and UI are thin wrappers. The kernel is the product.
Layer Model
The system is organized into five layers, each with strict responsibilities:
┌─────────────────────────────────────────────────┐
│ Interface Layer │
│ palimpsest-cli · palimpsest-server │
├─────────────────────────────────────────────────┤
│ Orchestration Layer │
│ palimpsest-crawl · palimpsest-sim │
├─────────────────────────────────────────────────┤
│ Capture Layer │
│ palimpsest-fetch · palimpsest-artifact │
│ palimpsest-extract · palimpsest-embed │
├─────────────────────────────────────────────────┤
│ Persistence Layer │
│ palimpsest-storage · palimpsest-index │
│ palimpsest-replay · palimpsest-shadow │
├─────────────────────────────────────────────────┤
│ Foundation Layer │
│ palimpsest-core · palimpsest-envelope │
│ palimpsest-frontier │
└─────────────────────────────────────────────────┘
Design Principles
Zero shared mutable state. The core kernel has no global state. All state flows through explicit parameters — seeds, envelopes, configs.
The ExecutionEnvelope is the critical abstraction. It seals the execution context (seed, timestamp, DNS snapshot, TLS fingerprint, browser config, headers) before any fetch occurs. Without the envelope, you cannot replay, verify, or prove anything.
Errors are artifacts. Every failure is classified into one of seven categories and stored as part of the crawl record. Errors are not noise — they are history.
Content is addressed, not located. Every blob is stored and retrieved by its BLAKE3 hash. Deduplication is structural, not post-process.
The Six Laws
Every design decision in Palimpsest bends around these six immutable laws. If a change violates any law, the change is wrong — not the law.
Law 1: Determinism
Frontier ordering is seed-driven. Retry logic is explicit. No hidden randomness anywhere.
Why it matters: Without determinism, you cannot verify a crawl, replay a crawl, or prove that two crawls are equivalent. Determinism is the foundation that makes every other law possible.
How it’s enforced:
- All randomness flows from
CrawlSeedthroughChaCha8Rng(seeded PRNG) - No
randcrate in any core path BTreeMapfor all ordered collections (neverHashMap)- No
Instant::now()orSystemTime::now()in core logic — time comes from theExecutionEnvelope - Atomics are allowed for metrics counters only, never for control flow
- Browser JS overrides:
Date.now(),Math.random(),performance.now()are all seeded
What breaks if violated: Two runs with the same seed produce different output. Replay becomes approximate. Verification becomes impossible. The entire system reduces to a conventional crawler.
Law 2: Idempotence
Same URL + same execution context = identical artifact hash.
Why it matters: Idempotence enables deduplication, verification, and caching. If the same fetch produces different artifacts, you cannot distinguish content changes from system noise.
How it’s enforced:
ContentHash::of(data)produces a deterministic BLAKE3 hashRecordIdis generated fromcontent_hash + record_type, not from random UUIDs- The
ExecutionEnvelopefreezes all inputs before the fetch begins - Response normalization is deterministic
What breaks if violated: Storage bloats with duplicate content under different hashes. Change detection produces false positives. Audit trails become unreliable.
Law 3: Content Addressability
All artifacts are BLAKE3 hash-addressed. Deduplication is structural.
Why it matters: Content addressing makes storage self-verifying. You can detect tampering by recomputing the hash. You get deduplication for free — identical content maps to the same hash, stored once.
How it’s enforced:
- Every
WarcRecordcarries aPalimpsest-Content-Hashheader - Every blob in storage is stored at a path derived from its BLAKE3 hash
FileSystemBlobStoreuses git-style layout:{hash[0..2]}/{hash[2..]}- Integrity is verified on every read
What breaks if violated: Tampering becomes undetectable. Deduplication fails. Storage grows linearly instead of sublinearly.
Law 4: Temporal Integrity
Every capture binds wall clock + logical clock + crawl context + dependency chain.
Why it matters: The web changes constantly. Without precise temporal binding, you cannot answer “what did this page look like at time T?” or “which crawl produced this artifact?”
How it’s enforced:
CaptureInstantpairs wall clock (DateTime<Utc>) with logical clock (u64)- Every
IndexEntryrecords URL,captured_at,content_hash, andcrawl_context CrawlContextIdidentifies the specific crawl sessionCaptureGroupbinds all records from a single fetch with their shared timestamp
What breaks if violated: History queries return ambiguous results. You cannot distinguish “same content, different time” from “different content, same time.”
Law 5: Replay Fidelity
Stored artifacts must be sufficient to reconstruct the HTTP exchange, DOM state, and resource dependency graph.
Why it matters: Replay is the proof that the system works. If you cannot reconstruct the original response from stored artifacts, the archive is incomplete.
How it’s enforced:
- The
ExecutionEnvelopestores the full context (seed, DNS, TLS, headers, browser config) - WARC++ records include
envelope,dom-snapshot,resource-graph, andtimingrecords ReplayEnginereconstructs fromenvelope + stored artifacts- Same envelope + same artifacts = bit-identical reconstruction
What breaks if violated: The archive becomes a collection of blobs without enough context to interpret them. Legal and forensic use cases fail.
Law 6: Observability as Proof
Every decision is queryable. Every failure is replayable. Every artifact is verifiable.
Why it matters: A crawl system that cannot explain its own behavior is a black box. Observability is not a feature — it is the proof that the other five laws hold.
How it’s enforced:
- Structured logging via
tracingthroughout the codebase - Prometheus metrics (9 atomic counters) exposed at
/metrics PalimpsestErrorclassifies every failure into exactly one of seven categories- Errors are stored as artifacts in the crawl record
- The temporal index makes every decision queryable
What breaks if violated: Debugging becomes guesswork. Compliance audits fail. Users cannot distinguish system bugs from legitimate content changes.
Crate Map
All 15 Crates
| Crate | Layer | Responsibility | Key Invariant |
|---|---|---|---|
palimpsest-core | Foundation | Types, BLAKE3 hashing, seeded PRNG, error taxonomy | No IO. Pure types only. |
palimpsest-envelope | Foundation | Sealed execution context | Immutable after construction |
palimpsest-frontier | Foundation | Deterministic URL scheduler with politeness | Same seed = same traversal order |
palimpsest-fetch | Capture | HTTP client + browser capture (CDP) + link extraction | Every fetch wraps an envelope |
palimpsest-artifact | Capture | WARC++ serialization, capture groups | Content-addressed outputs |
palimpsest-storage | Persistence | Content-addressable blobs (memory, fs, S3/GCS/Azure) | Dedup is structural |
palimpsest-index | Persistence | Temporal graph: URL x time x hash x context | Queryable history |
palimpsest-replay | Persistence | HTTP reconstruction, DOM rehydration | Bit-identical replay from artifacts |
palimpsest-crawl | Orchestration | Main crawl loop and coordination | Integrates all layers |
palimpsest-shadow | Persistence | Comparison engine vs legacy crawlers | Cross-format validation |
palimpsest-extract | Capture | HTML-to-text + RAG chunking with provenance | Deterministic extraction |
palimpsest-embed | Capture | Embedding generation, vector search, change detection | BLAKE3-based test embeddings |
palimpsest-server | Interface | HTTP frontier server + retrieval API + metrics | Thread-safe state |
palimpsest-sim | Orchestration | Deterministic simulation testing framework | Proves Laws 1-6 |
palimpsest-cli | Interface | Command-line interface (10 subcommands) | Thin wrapper |
Dependency Graph
palimpsest-cli
├── palimpsest-core
├── palimpsest-crawl
│ ├── palimpsest-core
│ ├── palimpsest-envelope
│ ├── palimpsest-frontier
│ │ └── palimpsest-core
│ ├── palimpsest-fetch
│ │ └── palimpsest-core
│ ├── palimpsest-artifact
│ │ └── palimpsest-core
│ ├── palimpsest-storage
│ │ └── palimpsest-core
│ └── palimpsest-index
│ └── palimpsest-core
├── palimpsest-frontier
├── palimpsest-index
├── palimpsest-storage
├── palimpsest-replay
├── palimpsest-server
│ ├── palimpsest-frontier
│ ├── palimpsest-index
│ └── palimpsest-storage
├── palimpsest-shadow
├── palimpsest-artifact
├── palimpsest-envelope
├── palimpsest-extract
└── palimpsest-fetch
Key Pattern
Every crate depends on palimpsest-core for shared types (CrawlSeed, ContentHash, CaptureInstant, PalimpsestError). No crate performs IO unless its responsibility requires it. The foundation layer is pure computation.
Data Flow
This chapter traces a single URL through the entire Palimpsest system, from seed to replay.
1. Seed URL Enters the Frontier
#![allow(unused)]
fn main() {
let seed = CrawlSeed::new(42);
let mut frontier = Frontier::new(seed, PolitenessPolicy::default_policy());
frontier.push_seed(Url::parse("https://example.com/").unwrap());
}
The frontier deduplicates by URL string and buckets entries by host.
2. Frontier Dequeues (Deterministic Ordering)
#![allow(unused)]
fn main() {
let entry: FrontierEntry = frontier.pop(now).unwrap();
// entry.url = "https://example.com/"
// entry.depth = 0
// entry.priority = 0
}
The dequeue order is deterministic: hosts are rotated via a seeded Fisher-Yates shuffle, and within each host, entries are ordered by priority then depth.
3. ExecutionEnvelope Seals the Context
#![allow(unused)]
fn main() {
let envelope = EnvelopeBuilder::new()
.seed(seed)
.timestamp(CaptureInstant::new(wall_time, logical_clock))
.target_url(entry.url.clone())
.dns_snapshot(DnsSnapshot { host: "example.com".into(), addrs: vec!["93.184.216.34".into()], ttl: 300 })
.build()?;
}
The envelope is immutable after construction. It captures everything needed to reproduce this fetch.
4. Fetch Executes
#![allow(unused)]
fn main() {
let fetcher = HttpFetcher::with_defaults()?;
let result: FetchResult = fetcher.fetch(&envelope).await?;
}
For browser mode, BrowserFetcher launches headless Chrome with determinism overrides and captures DOM + sub-resources via CDP.
5. Link Extraction
#![allow(unused)]
fn main() {
let links: Vec<Url> = extract_links(&html_body, &entry.url);
for link in links {
frontier.push_discovered(link, entry.depth + 1, content_hash);
}
}
Links are extracted from HTML (after stripping <script> and <style> tags), normalized (fragments stripped, query params sorted, default ports removed), deduplicated, and sorted for determinism.
6. Artifact Creation
#![allow(unused)]
fn main() {
let record = WarcRecord::new(
RecordType::Response,
"application/http;msgtype=response".into(),
response_bytes,
);
assert!(record.verify_integrity()); // BLAKE3 hash matches payload
}
The CaptureGroup bundles the envelope record, request record, response record, and optional DOM/resource-graph/timing records.
7. Content-Addressed Storage
#![allow(unused)]
fn main() {
let hash: ContentHash = store.put(response_bytes).await?;
// hash = blake3(response_bytes)
// Stored at: blobs/af/1349b9f5f9a1a6a0404dea36dcc949...
}
If a blob with the same hash already exists, the write is a no-op (structural deduplication).
8. Temporal Index Insert
#![allow(unused)]
fn main() {
index.insert(IndexEntry::new(
entry.url.clone(),
envelope.timestamp(),
hash,
CrawlContextId(1),
))?;
}
The index records this capture in four dimensions: URL, time, content hash, and crawl context.
9. WARC++ Output
#![allow(unused)]
fn main() {
write_warc_file(&path, &capture_group.all_records()).await?;
}
The WARC++ file contains standard ISO 28500 records plus Palimpsest extensions (envelope, dom-snapshot, resource-graph, timing). Standard WARC readers can parse the basic records; Palimpsest readers get the full execution context.
10. Replay
#![allow(unused)]
fn main() {
let content = store.get(&hash).await?;
let entries = index.query(&IndexQuery::for_url(&url))?;
}
Replay retrieves the stored blob and execution envelope, then reconstructs the original HTTP exchange, DOM state, and resource dependency graph. Same envelope + same artifacts = bit-identical output.
Determinism
Determinism is Law 1 — the foundation on which every other property depends. This chapter explains the technical mechanisms that enforce it.
CrawlSeed
All randomness in Palimpsest flows from a single 64-bit seed:
#![allow(unused)]
fn main() {
pub struct CrawlSeed {
pub value: u64,
}
impl CrawlSeed {
pub fn new(value: u64) -> Self { Self { value } }
pub fn rng(&self) -> ChaCha8Rng {
ChaCha8Rng::seed_from_u64(self.value)
}
pub fn derive(&self, index: u64) -> Self {
// BLAKE3 mixing: hash(seed_bytes || index_bytes)
let mut hasher = blake3::Hasher::new();
hasher.update(&self.value.to_le_bytes());
hasher.update(&index.to_le_bytes());
let hash = hasher.finalize();
let bytes: [u8; 8] = hash.as_bytes()[..8].try_into().unwrap();
Self { value: u64::from_le_bytes(bytes) }
}
}
}
ChaCha8Rng is a cryptographically secure PRNG that produces identical sequences for identical seeds on all platforms.
No rand Crate
The rand crate is forbidden in all core crates. Palimpsest uses rand_chacha and rand_core directly. The workspace Cargo.toml specifies rand with default-features = false — no OS entropy source is available.
Ordered Collections
HashMap iteration order is randomized per Rust’s specification. Palimpsest uses BTreeMap everywhere that iteration order is observable:
#![allow(unused)]
fn main() {
// The frontier's host queues
struct Frontier {
host_queues: BTreeMap<String, BTreeSet<FrontierEntry>>,
seen: BTreeSet<String>,
// ...
}
}
This ensures the same URLs produce the same host ordering on every run.
Seeded Host Rotation
When the frontier rotates between hosts, it uses a seeded Fisher-Yates shuffle:
#![allow(unused)]
fn main() {
let mut hosts: Vec<&String> = self.host_queues.keys().collect();
let mut rng = self.seed.rng();
// Fisher-Yates shuffle with seeded RNG
for i in (1..hosts.len()).rev() {
let j = rng.gen_range(0..=i);
hosts.swap(i, j);
}
}
Same seed, same hosts = same rotation order.
Time is Explicit
No Instant::now() or SystemTime::now() appears in core logic. All time comes from one of two sources:
- Caller-provided —
frontier.pop(now)takes aDateTime<Utc>parameter - ExecutionEnvelope —
envelope.timestamp()returns the sealedCaptureInstant
This means tests can inject fixed timestamps, and replays use the original timestamps exactly.
Browser Determinism
When using headless Chrome, Palimpsest injects JavaScript overrides before any page scripts execute:
// Seeded from CrawlSeed
Date.now = function() { return 1700000000000 + (__date_offset += 1); };
Math.random = function() { /* seeded xorshift */ };
performance.now = function() { return (__perf_offset += 0.1); };
This prevents JavaScript on the page from introducing non-determinism.
Verification
The determinism test pattern runs the same operation twice and asserts byte-identical output:
#![allow(unused)]
fn main() {
#[test]
fn frontier_ordering_is_deterministic() {
let seed = CrawlSeed::new(42);
let run_a = run_frontier(seed, &urls);
let run_b = run_frontier(seed, &urls);
assert_eq!(run_a, run_b);
}
}
The simulation framework (palimpsest-sim) proves this at scale: 10,000 pages across 5 adversarial universes, two full runs, zero divergence.
Content Addressability
Law 3 requires that all artifacts are BLAKE3 hash-addressed. This chapter explains the mechanics.
ContentHash
#![allow(unused)]
fn main() {
#[derive(Clone, Copy, PartialEq, Eq, Hash, PartialOrd, Ord)]
pub struct ContentHash([u8; 32]);
impl ContentHash {
pub fn of(data: &[u8]) -> Self {
Self(*blake3::hash(data).as_bytes())
}
pub fn as_bytes(&self) -> &[u8; 32] { &self.0 }
pub fn as_hex(&self) -> String { hex::encode(self.0) }
pub fn from_bytes(bytes: [u8; 32]) -> Self { Self(bytes) }
}
}
ContentHash is a Copy type — 32 bytes, passed by value. It implements Ord for use in BTreeMap keys.
Why BLAKE3
| Property | BLAKE3 | SHA-256 |
|---|---|---|
| Speed | ~1 GB/s (single core) | ~250 MB/s |
| Security | 256-bit, cryptographic | 256-bit, cryptographic |
| Parallelism | Tree-based, SIMD native | Sequential |
| Determinism | Platform-independent | Platform-independent |
BLAKE3 is faster than SHA-256 with equivalent security properties. For a system that hashes every blob, every record, and every envelope, throughput matters.
Storage Layout
FileSystemBlobStore uses a git-style two-level directory structure:
blobs/
af/
1349b9f5f9a1a6a0404dea36dcc949... # Full hash as filename
c7/
d2fe1a6b...
The first two hex characters of the hash form the directory name. This prevents any single directory from accumulating too many entries.
Structural Deduplication
When BlobStore::put() is called, it computes the BLAKE3 hash and checks if that blob already exists:
#![allow(unused)]
fn main() {
async fn put(&self, data: Bytes) -> Result<ContentHash, StorageError> {
let hash = ContentHash::of(&data);
if self.exists(&hash).await? {
return Ok(hash); // Already stored — no-op
}
// Atomic write: temp file + rename
self.write_blob(&hash, &data).await?;
Ok(hash)
}
}
Identical content maps to the same hash and is stored exactly once.
Integrity Verification
Every read verifies the hash of the retrieved data:
#![allow(unused)]
fn main() {
async fn get(&self, hash: &ContentHash) -> Result<Bytes, StorageError> {
let data = self.read_blob(hash).await?;
let actual = ContentHash::of(&data);
if actual != *hash {
return Err(StorageError::IntegrityError {
expected: *hash,
actual,
});
}
Ok(data)
}
}
Tampering is detectable by any reader at any time.
WARC Record Hashing
Every WarcRecord carries its content hash in the Palimpsest-Content-Hash header:
WARC/1.1
WARC-Type: response
Palimpsest-Content-Hash: blake3:af1349b9f5f9a1a6a0404dea36dcc949...
Content-Length: 4096
[payload bytes]
RecordId is also derived from the content hash, not from random UUIDs:
#![allow(unused)]
fn main() {
pub fn from_content(content_hash: &ContentHash, record_type: &RecordType) -> Self {
// Deterministic UUID v5 from hash + type
}
}
Execution Envelope
The ExecutionEnvelope is Palimpsest’s critical abstraction. It seals every input that affects a fetch — seed, timestamp, target URL, DNS state, TLS fingerprint, browser config, and custom headers — into an immutable record constructed before the fetch begins.
Without the envelope, you cannot replay a fetch, verify its output, or prove it was executed correctly.
Construction
Envelopes are built via the fluent EnvelopeBuilder:
#![allow(unused)]
fn main() {
let envelope = EnvelopeBuilder::new()
.seed(CrawlSeed::new(42))
.timestamp(CaptureInstant::new(wall_time, logical_clock))
.target_url(Url::parse("https://example.com/").unwrap())
.dns_snapshot(DnsSnapshot {
host: "example.com".into(),
addrs: vec!["93.184.216.34".into()],
ttl: 300,
})
.tls_fingerprint(TlsFingerprint {
protocol: "TLSv1.3".into(),
cipher: "TLS_AES_256_GCM_SHA384".into(),
cert_chain_hash: "blake3:...".into(),
})
.header("User-Agent".into(), "PalimpsestBot/0.1".into())
.build()?;
}
Required Fields
| Field | Type | Purpose |
|---|---|---|
seed | CrawlSeed | Deterministic randomness source |
timestamp | CaptureInstant | Wall clock + logical clock |
target_url | Url | The URL being fetched |
dns_snapshot | DnsSnapshot | Recorded DNS resolution state |
Calling .build() without any required field returns an EnvelopeError.
Optional Fields
| Field | Type | Purpose |
|---|---|---|
tls_fingerprint | TlsFingerprint | TLS protocol, cipher, cert chain hash |
browser_config | BrowserConfig | Viewport, user agent, JS enabled |
request_headers | Vec<(String, String)> | Custom HTTP headers |
Immutability
Once build() succeeds, the ExecutionEnvelope is frozen. There are no setter methods — only getters:
#![allow(unused)]
fn main() {
envelope.seed() // CrawlSeed
envelope.timestamp() // CaptureInstant
envelope.target_url() // &Url
envelope.request_headers() // &[(String, String)]
envelope.dns_snapshot() // &DnsSnapshot
envelope.tls_fingerprint() // Option<&TlsFingerprint>
envelope.browser_config() // Option<&BrowserConfig>
envelope.content_hash() // ContentHash (computed from canonical JSON)
}
Content Hash
The envelope’s content_hash() is computed from its canonical JSON serialization. This means two envelopes with identical fields produce the same hash, and any field change produces a different hash.
WARC++ Envelope Record
The envelope is serialized as the first record in every WARC++ capture group:
WARC/1.1
WARC-Type: envelope
Palimpsest-Envelope-Version: 1
Content-Type: application/json
{
"seed": 42,
"timestamp": {"wall": "2026-04-12T10:30:00Z", "logical": 1234},
"target_url": "https://example.com/",
"dns_snapshot": {"host": "example.com", "addrs": ["93.184.216.34"], "ttl": 300},
"tls_fingerprint": {"protocol": "TLSv1.3", "cipher": "...", "cert_chain_hash": "..."},
"browser_config": null
}
Temporal Index
The temporal index is a graph, not a flat lookup table. It records every capture across four dimensions: URL, time, content hash, and crawl context. This enables queries like “show me every version of this page” or “what changed between these two crawls.”
CaptureInstant
Every capture is timestamped with a paired clock:
#![allow(unused)]
fn main() {
pub struct CaptureInstant {
pub wall: DateTime<Utc>, // Real-world time
pub logical: u64, // Monotonic counter within a crawl
}
}
Wall time records when the capture happened. Logical time records the ordering within a single crawl session, immune to clock drift.
IndexEntry
#![allow(unused)]
fn main() {
pub struct IndexEntry {
pub url: Url,
pub captured_at: CaptureInstant,
pub content_hash: ContentHash,
pub crawl_context: CrawlContextId,
}
}
Each entry represents one capture of one URL at one point in time, with a pointer (content hash) to the stored artifact.
Backends
InMemoryIndex
Uses BTreeMap for deterministic ordering. Suitable for testing and short-lived crawls.
#![allow(unused)]
fn main() {
let mut index = InMemoryIndex::new();
index.insert(entry);
let results = index.query(&IndexQuery::for_url(&url));
}
SqliteIndex
Persistent SQL-backed index with WAL mode for concurrent reads:
#![allow(unused)]
fn main() {
let mut index = SqliteIndex::open(Path::new("./output/index.sqlite"))?;
index.insert(entry)?;
let results = index.query(&IndexQuery::for_url(&url))?;
}
The schema includes a UNIQUE constraint on (url, wall_time, content_hash) to prevent duplicate entries.
Queries
IndexQuery supports multi-dimensional filtering:
- By URL — all captures of a specific URL
- By time range — all captures within a window
- By content hash — find which URLs produced a specific blob
- By crawl context — all captures from a specific crawl session
Results are ordered by captured_at (ascending), then URL string.
Use Cases
History — “Show me every version of https://example.com/ across all crawls”:
#![allow(unused)]
fn main() {
let history = index.query(&IndexQuery::for_url(&url))?;
for entry in &history {
println!("{} -> {}", entry.captured_at.wall, entry.content_hash.as_hex());
}
}
Change Detection — Compare content hashes across captures to identify when a page changed.
Provenance — Every RAG chunk and embedding links back to an IndexEntry via source_url, captured_at, and source_hash.
WARC++ Format
WARC++ extends ISO 28500 (the standard WARC format) with structured metadata for execution context, DOM snapshots, resource graphs, and timing breakdowns. Standard WARC readers can parse the basic records. Palimpsest-aware readers get the full execution context.
Standard Record Types
These follow ISO 28500 exactly:
| Type | Purpose |
|---|---|
warcinfo | Crawl-level metadata |
request | HTTP request |
response | HTTP response |
resource | Standalone resource |
metadata | Additional metadata |
Extension Record Types
| Type | Purpose |
|---|---|
envelope | Full ExecutionEnvelope (seed, timestamp, DNS, TLS, browser config) |
dom-snapshot | Rendered DOM state after JavaScript execution |
resource-graph | Dependency graph of all resources loaded for a page |
timing | Detailed timing breakdown (DNS, connect, TLS, TTFB, transfer, render) |
Content Hash Header
Every record includes a Palimpsest-Content-Hash header:
Palimpsest-Content-Hash: blake3:af1349b9f5f9a1a6a0404dea36dcc949...
This enables content-addressable retrieval and integrity verification without reading the full record.
Envelope Record Example
WARC/1.1
WARC-Type: envelope
WARC-Record-ID: <urn:uuid:a1b2c3d4-...>
Content-Type: application/json
Palimpsest-Envelope-Version: 1
Palimpsest-Content-Hash: blake3:c7d2fe...
{
"seed": 42,
"timestamp": {"wall": "2026-04-12T10:30:00.123456789Z", "logical": 1234},
"request_headers": [["User-Agent", "PalimpsestBot/0.1"]],
"dns_snapshot": {"host": "example.com", "addrs": ["93.184.216.34"], "ttl": 300},
"tls_fingerprint": {"protocol": "TLSv1.3", "cipher": "TLS_AES_256_GCM_SHA384", "cert_chain_hash": "blake3:..."},
"browser_config": null
}
Resource Graph Record Example
{
"root": "https://example.com/",
"resources": [
{"url": "https://example.com/style.css", "type": "stylesheet", "hash": "blake3:...", "initiated_by": "https://example.com/"},
{"url": "https://example.com/app.js", "type": "script", "hash": "blake3:...", "initiated_by": "https://example.com/"}
],
"load_order": [0, 1]
}
Serialization Rules
| Rule | Value |
|---|---|
| Text encoding | UTF-8 |
| JSON format | Compact (no pretty-print) |
| Timestamps | RFC 3339 with nanosecond precision |
| Record separator | CRLFCRLF (per ISO 28500) |
| Max payload | 4 GiB (WARC spec limit) |
Backward Compatibility
Standard WARC tools (warc-tools, warcio, pywb) can read the request, response, warcinfo, resource, and metadata records without modification. They skip extension records (envelope, dom-snapshot, resource-graph, timing) per the WARC spec’s extension handling rules. The Palimpsest-* headers are ignored by non-Palimpsest readers.
palimpsest-core
Shared types, BLAKE3 hashing, seeded PRNG, and error taxonomy. This crate performs no IO — it is pure types and computation.
CrawlSeed
#![allow(unused)]
fn main() {
pub struct CrawlSeed { pub value: u64 }
impl CrawlSeed {
pub fn new(value: u64) -> Self;
pub fn rng(&self) -> ChaCha8Rng; // Deterministic PRNG
pub fn derive(&self, index: u64) -> Self; // Child seed via BLAKE3 mixing
}
}
ContentHash
#![allow(unused)]
fn main() {
pub struct ContentHash([u8; 32]); // Copy, Eq, Ord, Hash
impl ContentHash {
pub fn of(data: &[u8]) -> Self; // BLAKE3 hash
pub fn as_bytes(&self) -> &[u8; 32];
pub fn as_hex(&self) -> String;
pub fn from_bytes(bytes: [u8; 32]) -> Self;
}
}
CaptureInstant
#![allow(unused)]
fn main() {
pub struct CaptureInstant {
pub wall: DateTime<Utc>, // Wall clock
pub logical: u64, // Monotonic counter
}
impl CaptureInstant {
pub fn new(wall: DateTime<Utc>, logical: u64) -> Self;
}
}
Implements Copy, Ord, Serialize, Deserialize.
CrawlContextId
#![allow(unused)]
fn main() {
pub struct CrawlContextId(pub u64);
}
Opaque identifier for a crawl session. Implements Copy.
CrawlTarget
#![allow(unused)]
fn main() {
pub struct CrawlTarget {
pub url: Url,
pub depth: u32,
pub parent: Option<ContentHash>,
}
}
PalimpsestError
#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum PalimpsestError {
Network(String),
Protocol(String),
Rendering(String),
Policy(String),
DeterminismViolation { context: String, expected: String, actual: String },
Storage(String),
Replay(String),
}
}
Every failure in the system is classified into exactly one of these seven variants. See Error Taxonomy for details.
Key Invariant
This crate contains no IO, no async, no network calls. It is the foundation that every other crate depends on.
palimpsest-envelope
Sealed execution context — immutable after construction. The envelope captures every input that affects a fetch, enabling deterministic replay and verification.
ExecutionEnvelope
#![allow(unused)]
fn main() {
impl ExecutionEnvelope {
pub fn seed(&self) -> CrawlSeed;
pub fn timestamp(&self) -> CaptureInstant;
pub fn target_url(&self) -> &Url;
pub fn request_headers(&self) -> &[(String, String)];
pub fn dns_snapshot(&self) -> &DnsSnapshot;
pub fn tls_fingerprint(&self) -> Option<&TlsFingerprint>;
pub fn browser_config(&self) -> Option<&BrowserConfig>;
pub fn content_hash(&self) -> ContentHash;
}
}
No setter methods. Immutable after build().
EnvelopeBuilder
#![allow(unused)]
fn main() {
impl EnvelopeBuilder {
pub fn new() -> Self;
pub fn seed(self, seed: CrawlSeed) -> Self;
pub fn timestamp(self, ts: CaptureInstant) -> Self;
pub fn target_url(self, url: Url) -> Self;
pub fn header(self, name: String, value: String) -> Self;
pub fn headers(self, headers: Vec<(String, String)>) -> Self;
pub fn dns_snapshot(self, dns: DnsSnapshot) -> Self;
pub fn tls_fingerprint(self, tls: TlsFingerprint) -> Self;
pub fn browser_config(self, config: BrowserConfig) -> Self;
pub fn build(self) -> Result<ExecutionEnvelope, EnvelopeError>;
}
}
EnvelopeError
#![allow(unused)]
fn main() {
pub enum EnvelopeError {
MissingSeed,
MissingTimestamp,
MissingTargetUrl,
MissingDnsSnapshot,
}
}
Supporting Types
#![allow(unused)]
fn main() {
pub struct DnsSnapshot { pub host: String, pub addrs: Vec<String>, pub ttl: u32 }
pub struct TlsFingerprint { pub protocol: String, pub cipher: String, pub cert_chain_hash: String }
pub struct BrowserConfig { pub viewport_width: u32, pub viewport_height: u32, pub user_agent: String, pub js_enabled: bool }
}
Related Crates
palimpsest-core— providesCrawlSeed,CaptureInstant,ContentHashpalimpsest-fetch— consumes envelopes for fetch executionpalimpsest-artifact— serializes envelopes as WARC++ records
palimpsest-frontier
Deterministic seed-driven URL scheduler with politeness enforcement. Same seed + same URLs = identical dequeue order.
Frontier
#![allow(unused)]
fn main() {
impl Frontier {
pub fn new(seed: CrawlSeed, policy: PolitenessPolicy) -> Self;
pub fn push(&mut self, entry: FrontierEntry) -> bool;
pub fn push_seed(&mut self, url: Url);
pub fn push_discovered(&mut self, url: Url, depth: u32, parent: ContentHash) -> bool;
pub fn pop(&mut self, now: DateTime<Utc>) -> Option<FrontierEntry>;
pub fn len(&self) -> usize;
pub fn is_empty(&self) -> bool;
pub fn host_count(&self) -> usize;
pub fn seen_count(&self) -> usize;
pub fn save(&self, path: &Path) -> Result<(), FrontierPersistError>;
pub fn load(&mut self, path: &Path) -> Result<usize, FrontierPersistError>;
pub fn load_if_exists(seed: CrawlSeed, policy: PolitenessPolicy, path: &Path) -> Result<Self, FrontierPersistError>;
pub fn seed(&self) -> CrawlSeed;
}
}
Internally uses BTreeMap<String, BTreeSet<FrontierEntry>> for host queues. URL deduplication via BTreeSet<String>.
FrontierEntry
#![allow(unused)]
fn main() {
pub struct FrontierEntry {
pub url: Url,
pub depth: u32,
pub priority: u32, // Lower = dequeued first
pub parent: Option<ContentHash>,
}
}
Implements Ord: sorted by (priority, depth, url string).
PolitenessPolicy
#![allow(unused)]
fn main() {
pub struct PolitenessPolicy {
pub min_host_delay: Duration,
pub max_concurrent_hosts: usize,
}
impl PolitenessPolicy {
pub fn default_policy() -> Self; // 1s delay, 100 hosts
pub fn aggressive() -> Self; // 100ms delay, 500 hosts
pub fn no_delay() -> Self; // Zero delay, unlimited (testing only)
}
}
Persistence
save() serializes the frontier state to JSON. load() restores it. This enables crawl resumption — stop a crawl, restart later, continue from exactly where you left off.
Key Invariant
Same seed + same URLs pushed in same order = identical pop() sequence. Verified at 10,000 pages with zero divergence.
palimpsest-fetch
HTTP client + browser capture (CDP) + link extraction + robots.txt parsing + TLS/HTTP2 fingerprint impersonation + CDP stealth mode. Every fetch wraps an ExecutionEnvelope.
HttpFetcher
#![allow(unused)]
fn main() {
impl HttpFetcher {
pub fn new(config: FetchConfig) -> Result<Self, PalimpsestError>;
pub fn with_defaults() -> Result<Self, PalimpsestError>;
pub async fn fetch(&self, envelope: &ExecutionEnvelope) -> Result<FetchResult, PalimpsestError>;
}
}
Uses wreq (BoringSSL backend) instead of reqwest. When an emulation profile is set, the TLS ClientHello and HTTP/2 SETTINGS frame match the selected browser.
FetchConfig
#![allow(unused)]
fn main() {
pub struct FetchConfig {
pub connect_timeout: Duration, // Default: 30s
pub total_timeout: Duration, // Default: 120s
pub max_body_size: u64, // Default: 256 MiB
pub max_redirects: usize, // Default: 10
pub emulation: Option<wreq_util::Emulation>, // Default: None
}
}
When emulation is set (e.g., Emulation::Chrome133), wreq impersonates the selected browser’s TLS fingerprint (JA3/JA4 including post-quantum key shares) and HTTP/2 settings (SETTINGS frame values/order, WINDOW_UPDATE, pseudo-header ordering). 70+ browser profiles available: Chrome 100-137, Firefox 109-139, Safari 15-18.5, Edge, Opera.
BrowserFetcher
#![allow(unused)]
fn main() {
impl BrowserFetcher {
pub fn new(config: BrowserFetchConfig) -> Self;
pub async fn fetch(&self, url: &Url, envelope: &ExecutionEnvelope, seed: CrawlSeed)
-> Result<BrowserFetchResult, PalimpsestError>;
}
}
Launches headless Chrome via CDP. Injects determinism overrides and (optionally) 17 stealth evasion patches. Captures DOM snapshot, sub-resources via Network events, and resource dependency graph.
BrowserFetchConfig
#![allow(unused)]
fn main() {
pub struct BrowserFetchConfig {
pub page_timeout: Duration, // Default: 30s
pub viewport_width: u32, // Default: 1920
pub viewport_height: u32, // Default: 1080
pub js_enabled: bool, // Default: true
pub user_agent: String, // Default: "PalimpsestBot/0.1"
pub stealth: bool, // Default: false
pub webdriver_value: WebdriverValue, // Default: False
}
}
WebdriverValue
#![allow(unused)]
fn main() {
pub enum WebdriverValue {
False, // Matches real non-automated Chrome (default)
Undefined, // Property appears deleted
}
}
Explicit, auditable config choice for navigator.webdriver. Default False passes Rebrowser Bot Detector (10/10).
CDP Stealth Mode
When stealth: true, the browser fetcher applies:
Chrome launch hardening:
--disable-blink-features=AutomationControlled--disable-component-extensions-with-background-pages
17 stealth evasion patches (injected via addScriptToEvaluateOnNewDocument):
| # | Patch | What It Does |
|---|---|---|
| 1 | navigator.webdriver | Set to false (configurable via WebdriverValue) |
| 2 | window.chrome | Full object mock (app, csi, loadTimes, runtime) |
| 3 | navigator.plugins | Chrome PDF Plugin, Chrome PDF Viewer, Native Client |
| 4 | navigator.mimeTypes | application/pdf, application/x-nacl |
| 5 | navigator.permissions | Fix Notification state inconsistency |
| 6 | navigator.languages | ["en-US", "en"] |
| 7 | navigator.hardwareConcurrency | 8 |
| 8 | navigator.deviceMemory | 8 |
| 9 | WebGL vendor/renderer | Intel UHD Graphics 630 |
| 10 | Canvas fingerprint | Seeded sub-pixel noise (CrawlSeed) |
| 11 | Window dimensions | outerWidth/outerHeight match viewport + chrome UI |
| 12 | Screen dimensions | width/height/availWidth/availHeight/colorDepth |
| 13 | AudioContext | Seeded oscillator noise (CrawlSeed) |
| 14 | ClientRect | Seeded sub-pixel noise (CrawlSeed) |
| 15 | sourceURL markers | Strip pptr/playwright stack traces |
| 16 | navigator.userAgent | Consistent with HTTP header |
| 17 | navigator.maxTouchPoints | 0 |
All noise patches use deterministic xorshift PRNGs seeded from CrawlSeed (Law 1).
Browser Emulation Profiles
#![allow(unused)]
fn main() {
pub struct BrowserProfile { /* unified TLS + HTTP/2 + headers + JS identity */ }
pub enum ProfileMode {
None, // No impersonation (default)
Fixed(BrowserProfile), // Same profile for all requests
Seeded, // Generate from CrawlSeed
RotatePerDomain, // Per-domain via BLAKE3(seed + domain)
}
}
Pre-built profiles: BrowserProfile::chrome_windows(), firefox_linux(), safari_macos().
See profile module for details.
BrowserFetchResult
#![allow(unused)]
fn main() {
pub struct BrowserFetchResult {
pub fetch_result: FetchResult,
pub dom_snapshot: Option<DomSnapshot>,
pub resource_graph: Option<ResourceGraph>,
pub sub_resources: Vec<WarcRecord>,
}
}
Link Extraction
#![allow(unused)]
fn main() {
pub fn extract_links(html: &str, base_url: &Url) -> Vec<Url>;
pub fn normalize_url(url: &Url) -> Option<Url>;
pub fn normalize_url_for_comparison(url: &Url) -> String;
}
extract_links strips <script> and <style> content before scanning for href and src attributes. Output is deduplicated and sorted for determinism.
Robots.txt
#![allow(unused)]
fn main() {
pub struct RobotsRules { pub crawl_delay: Option<Duration> }
impl RobotsRules {
pub fn parse(body: &str) -> Self; // RFC 9309 compliant
}
}
Per-origin caching in BTreeMap (deterministic).
Stealth Regression Tests
5 integration tests against live public detection sites:
| Site | Score | Key Checks |
|---|---|---|
| Rebrowser Bot Detector | 10/10 | CDP leak, webdriver, viewport, user-agent |
| Sannysoft | 55/56 | webdriver, chrome, plugins, WebGL, canvas, permissions |
| FingerprintJS BotD | Clean | 18 detectors, no bot verdict |
| CreepJS | Clean | Headless rating, stealth rating, lie detection |
| Infosimples | Skipped | Site timeout |
Run with: cargo test -p palimpsest-fetch --test stealth_test -- --ignored --nocapture --test-threads=1
Key Invariant
Every fetch receives an ExecutionEnvelope. The envelope seals the context before the network request begins, enabling replay and verification. Emulation profile and stealth config are deterministic inputs (Law 1).
palimpsest-artifact
WARC++ serialization: records, capture groups, reader/writer. Content-addressed outputs compatible with ISO 28500.
RecordType
#![allow(unused)]
fn main() {
#[non_exhaustive]
pub enum RecordType {
// Standard (ISO 28500)
Warcinfo, Request, Response, Resource, Metadata,
// Palimpsest extensions
Envelope, DomSnapshot, ResourceGraph, Timing,
}
impl RecordType {
pub fn is_standard(&self) -> bool;
}
}
WarcRecord
#![allow(unused)]
fn main() {
pub struct WarcRecord {
pub record_type: RecordType,
pub record_id: RecordId,
pub content_hash: ContentHash,
pub content_type: String,
pub content_length: u64,
pub target_uri: Option<String>,
pub payload: Bytes,
}
impl WarcRecord {
pub fn new(record_type: RecordType, content_type: String, payload: Bytes) -> Self;
pub fn verify_integrity(&self) -> bool;
}
}
RecordId
#![allow(unused)]
fn main() {
pub struct RecordId(String);
impl RecordId {
pub fn from_content(content_hash: &ContentHash, record_type: &RecordType) -> Self;
pub fn as_str(&self) -> &str;
}
}
Deterministic — derived from content hash + record type, not random UUID.
CaptureGroup
#![allow(unused)]
fn main() {
pub struct CaptureGroup {
pub group_hash: ContentHash,
pub url: Url,
pub captured_at: CaptureInstant,
pub crawl_context: CrawlContextId,
pub envelope: WarcRecord,
pub request: WarcRecord,
pub response: WarcRecord,
pub dom_snapshot: Option<DomSnapshot>,
pub resource_graph: Option<ResourceGraph>,
pub timing: Option<TimingBreakdown>,
}
}
Built via CaptureGroupBuilder (fluent builder with required fields validation).
WARC Writer/Reader
#![allow(unused)]
fn main() {
pub async fn write_warc_file(path: &Path, records: &[WarcRecord]) -> Result<(), WarcWriteError>;
pub fn parse_warc_records(data: &[u8]) -> Result<Vec<WarcRecord>, WarcWriteError>;
}
Key Invariant
All record IDs and content hashes are deterministic. Same content = same hash = same record ID.
palimpsest-storage
Content-addressable blob storage with three backends: in-memory, filesystem, and object store (S3/GCS/Azure). Deduplication is structural — same content is stored once.
BlobStore Trait
#![allow(unused)]
fn main() {
pub trait BlobStore: Send + Sync {
async fn put(&self, data: Bytes) -> Result<ContentHash, StorageError>;
async fn get(&self, hash: &ContentHash) -> Result<Bytes, StorageError>;
async fn exists(&self, hash: &ContentHash) -> Result<bool, StorageError>;
async fn delete(&self, hash: &ContentHash) -> Result<(), StorageError>;
async fn metadata(&self, hash: &ContentHash) -> Result<BlobMetadata, StorageError>;
}
}
BlobMetadata
#![allow(unused)]
fn main() {
pub struct BlobMetadata { pub size: u64, pub stored_at: DateTime<Utc> }
}
StorageError
#![allow(unused)]
fn main() {
pub enum StorageError {
Backend(String),
NotFound(ContentHash),
IntegrityError { expected: ContentHash, actual: ContentHash },
}
}
InMemoryBlobStore
#![allow(unused)]
fn main() {
impl InMemoryBlobStore {
pub fn new() -> Self;
pub fn len(&self) -> usize;
pub fn is_empty(&self) -> bool;
pub fn total_bytes(&self) -> u64;
}
}
Uses BTreeMap for deterministic ordering.
FileSystemBlobStore
#![allow(unused)]
fn main() {
impl FileSystemBlobStore {
pub async fn new(root: impl Into<PathBuf>) -> Result<Self, StorageError>;
pub fn root(&self) -> &Path;
}
}
Git-style layout: {root}/{hash[0..2]}/{hash[2..]}. Atomic writes via temp file + rename. Integrity verification on every read.
ObjectStoreBlobStore
S3, GCS, and Azure support via the object_store crate. Same BlobStore trait interface.
Key Invariant
Every put computes ContentHash::of(data). Every get verifies the hash of retrieved data. Tampering is always detectable.
palimpsest-index
Temporal graph index: URL x time x hash x context. Two backends — in-memory (BTreeMap) and SQLite (WAL mode).
IndexEntry
#![allow(unused)]
fn main() {
pub struct IndexEntry {
pub url: Url,
pub captured_at: CaptureInstant,
pub content_hash: ContentHash,
pub crawl_context: CrawlContextId,
}
impl IndexEntry {
pub fn new(url: Url, captured_at: CaptureInstant, content_hash: ContentHash, crawl_context: CrawlContextId) -> Self;
}
}
Implements Ord: ordered by captured_at, then URL string.
InMemoryIndex
#![allow(unused)]
fn main() {
impl InMemoryIndex {
pub fn new() -> Self;
pub fn insert(&mut self, entry: IndexEntry);
pub fn query(&self, query: &IndexQuery) -> Vec<IndexEntry>;
pub fn history(&self, url: &Url) -> Vec<IndexEntry>;
}
}
SqliteIndex
#![allow(unused)]
fn main() {
impl SqliteIndex {
pub fn open(path: &Path) -> Result<Self, IndexError>;
pub fn insert(&mut self, entry: IndexEntry) -> Result<(), IndexError>;
pub fn query(&self, query: &IndexQuery) -> Result<Vec<IndexEntry>, IndexError>;
pub fn history(&self, url: &Url) -> Result<Vec<IndexEntry>, IndexError>;
}
}
Uses WAL mode for concurrent reads. Parameterized queries. UNIQUE constraint on (url, wall_time, content_hash).
IndexQuery
Multi-dimensional filtering: by URL, time range, content hash, or crawl context. Results ordered by captured_at ascending.
Key Invariant
The index is a graph, not a lookup table. It captures the temporal dimension of the web — when content appeared, changed, and disappeared.
palimpsest-replay
Deterministic reconstruction from stored artifacts. Same envelope + same storage = bit-identical output.
Concept
The replay engine retrieves the ExecutionEnvelope and stored blobs for a given URL and timestamp, then reconstructs:
- HTTP exchange — request and response headers + body
- DOM state — rendered DOM from the
dom-snapshotrecord - Resource graph — sub-resource dependency tree with load ordering
Usage
#![allow(unused)]
fn main() {
let entries = index.history(&url);
let latest = entries.last().unwrap();
let blob = store.get(&latest.content_hash).await?;
}
For full reconstruction including DOM and sub-resources, the replay engine reads the complete CaptureGroup from the WARC++ file and rehydrates each record.
Law 5 Guarantee
Replay fidelity is the proof that the archive works. If the same envelope and the same artifacts produce different output on two runs, Law 5 is violated.
The simulation framework verifies this: verify_determinism crawls twice with the same seed and asserts byte-identical blob hashes, index entries, and page counts.
Related Crates
palimpsest-storage— provides blob retrievalpalimpsest-index— provides temporal lookupspalimpsest-artifact— provides WARC++ record parsingpalimpsest-envelope— provides execution context
palimpsest-crawl
The orchestrator — the main crawl loop that integrates all layers: frontier scheduling, envelope construction, HTTP/browser fetching, link extraction, artifact creation, blob storage, temporal indexing, WARC output, and frontier persistence.
CrawlConfig
#![allow(unused)]
fn main() {
pub struct CrawlConfig {
pub seeds: Vec<Url>,
pub crawl_seed: CrawlSeed,
pub crawl_context: CrawlContextId,
pub max_depth: u32,
pub max_urls: usize,
pub politeness: PolitenessPolicy,
pub scope: CrawlScope,
pub concurrency: usize,
pub user_agent: String,
pub browser_mode: bool,
pub output_dir: Option<PathBuf>,
}
impl CrawlConfig {
pub fn for_test(seed_url: Url) -> Self;
pub fn seed_hosts(&self) -> Vec<String>;
pub fn seed_domains(&self) -> Vec<String>;
}
}
CrawlScope
#![allow(unused)]
fn main() {
pub enum CrawlScope {
SameDomain, // Registrable domain match
SameHost, // Exact host match
Any, // No restriction
}
}
CrawlStats
#![allow(unused)]
fn main() {
pub struct CrawlStats {
pub urls_fetched: usize,
pub urls_failed: usize,
pub urls_discovered: usize,
pub robots_blocked: usize,
pub blobs_stored: usize,
pub bytes_stored: u64,
pub warc_path: Option<String>,
}
}
CrawlOrchestrator
#![allow(unused)]
fn main() {
impl CrawlOrchestrator {
pub async fn new(config: CrawlConfig) -> Result<Self, PalimpsestError>;
}
}
The orchestrator loop:
- Pop batch of URLs from frontier (respects politeness)
- Build
ExecutionEnvelopefor each - Fetch concurrently via
tokio::spawn - Extract links from HTML responses
- Push discovered URLs back to frontier (scope-filtered)
- Store blobs, insert index entries, write WARC records
- Save frontier state for resumption
- Repeat until frontier empty or
max_urlsreached
Key Invariant
The orchestrator is the integration point. It does not add non-determinism — all ordering comes from the frontier, all time from envelopes, all randomness from the seed.
palimpsest-shadow
Shadow comparison engine for validating Palimpsest output against legacy crawlers (Heritrix, wget, Warcprox, Brozzler).
Purpose
During migration from legacy crawl infrastructure, shadow comparison proves that Palimpsest captures the same content. It reads .warc and .warc.gz files from any crawler, normalizes URLs for cross-format comparison, and reports matches, mismatches, and coverage gaps.
Usage
palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]
Comparison Logic
- Read all WARC records from the legacy directory (
.warcand.warc.gz) - Read all WARC records from the Palimpsest output
- Normalize URLs: strip fragments, unify schemes (http/https), sort query params, strip angle brackets
- Match records by normalized URL
- For matched pairs: compare content size, report byte-level diffs
- Report unmatched URLs in each direction (coverage gaps)
URL Normalization
Legacy crawlers store URLs differently:
- wget uses
<http://url>angle bracket syntax per WARC spec - wget stores post-redirect URLs (https), Palimpsest may store pre-redirect (http)
- Fragment handling varies across tools
normalize_url_for_comparison() unifies all representations.
Output Format
Plain text by default, JSON with --json flag. Reports:
- Total URLs in each dataset
- Matched URLs with size comparison
- Mismatches with byte-level size diffs
- URLs present in legacy but missing from Palimpsest
- URLs present in Palimpsest but missing from legacy
palimpsest-extract
HTML-to-text extraction and RAG chunking with full provenance tracking. Every chunk carries its source URL, capture timestamp, content hash, and character offset.
ExtractedDocument
#![allow(unused)]
fn main() {
pub struct ExtractedDocument {
pub url: String,
pub title: Option<String>,
pub description: Option<String>,
pub text: String,
pub text_length: usize,
pub chunks: Vec<ContentChunk>,
pub text_hash: String,
pub source_hash: String,
pub captured_at: String,
}
}
extract_document
#![allow(unused)]
fn main() {
pub fn extract_document(
raw_response: &[u8],
source_url: &Url,
captured_at: CaptureInstant,
source_hash: ContentHash,
chunk_config: &ChunkConfig,
) -> ExtractedDocument
}
Pipeline: raw HTTP response -> strip headers -> HTML to clean text -> chunk with provenance.
ContentChunk
#![allow(unused)]
fn main() {
pub struct ContentChunk {
pub text: String,
pub source_url: Url,
pub captured_at: CaptureInstant,
pub source_hash: ContentHash,
pub chunk_hash: ContentHash, // BLAKE3 of chunk text
pub chunk_index: usize,
pub total_chunks: usize,
pub char_offset: usize, // Position in source text
}
}
ChunkConfig
#![allow(unused)]
fn main() {
pub struct ChunkConfig {
pub target_size: usize, // Default: 1000 characters
pub overlap: usize, // Default: 200 characters
}
}
Chunking Strategy
Splitting respects natural boundaries in priority order:
- Paragraph boundaries (double newline)
- Sentence boundaries (period/question/exclamation + space)
- Word boundaries (space)
- Character boundary (last resort)
Each chunk overlaps with the next by overlap characters to preserve context at boundaries.
Key Invariant
Extraction is deterministic. Same input = same chunks = same hashes. Every chunk’s provenance chain is complete: chunk_hash -> source_hash -> source_url + captured_at.
palimpsest-embed
Embedding generation, SQLite vector search, and LCS-based change detection.
Embedding
#![allow(unused)]
fn main() {
pub struct Embedding { pub values: Vec<f32> }
impl Embedding {
pub fn dimension(&self) -> usize;
pub fn cosine_similarity(&self, other: &Embedding) -> f32;
}
}
EmbeddingProvider Trait
#![allow(unused)]
fn main() {
pub trait EmbeddingProvider: Send + Sync {
async fn embed(&self, text: &str) -> Result<Embedding, PalimpsestError>;
async fn embed_batch(&self, texts: &[&str]) -> Result<Vec<Embedding>, PalimpsestError>;
fn dimension(&self) -> usize;
fn name(&self) -> &str;
}
}
HashEmbedder
Deterministic test embedder using BLAKE3:
#![allow(unused)]
fn main() {
impl HashEmbedder {
pub fn new(dimension: usize) -> Self;
}
}
Generates pseudo-embeddings by hashing the input text with BLAKE3 and mapping hash bytes to f32 values. Deterministic — same text = same embedding. Not semantically meaningful, but sufficient for testing the vector store pipeline.
VectorStore
SQLite-backed embedding storage with brute-force cosine similarity search:
#![allow(unused)]
fn main() {
impl VectorStore {
pub fn open(path: &Path) -> Result<Self, VectorStoreError>;
pub fn in_memory() -> Result<Self, VectorStoreError>;
pub fn insert(&self, chunk_hash: &str, source_url: &str, captured_at: &str,
text: &str, embedding: &Embedding, provider: &str) -> Result<bool, VectorStoreError>;
pub fn search(&self, query_embedding: &Embedding, limit: usize)
-> Result<Vec<StoredEmbedding>, VectorStoreError>;
}
}
StoredEmbedding
#![allow(unused)]
fn main() {
pub struct StoredEmbedding {
pub chunk_hash: String,
pub source_url: String,
pub captured_at: String,
pub text: String,
pub similarity: f32,
}
}
Change Detection
LCS-based (Longest Common Subsequence) line-level diff:
#![allow(unused)]
fn main() {
pub struct ContentDiff {
pub hunks: Vec<DiffHunk>,
pub similarity: f32, // 0.0 to 1.0
pub added: usize,
pub removed: usize,
pub unchanged: usize,
}
pub enum DiffHunk {
Added(String),
Removed(String),
Unchanged(String),
}
}
Compares two captures of the same URL to identify what changed between them.
palimpsest-server
HTTP frontier server, retrieval API, and Prometheus metrics. Three distinct services in one crate.
Frontier API
Distributed crawling coordination. Workers pop URLs, fetch them, and push discoveries back.
FrontierState
#![allow(unused)]
fn main() {
pub struct FrontierState {
pub frontier: Mutex<Frontier>,
pub seed: CrawlSeed,
}
}
Endpoints
| Method | Path | Request Body | Response |
|---|---|---|---|
| POST | /seeds | {"urls": ["..."]} | {"accepted": N} |
| POST | /pop | {} | {"url": "...", "depth": 0, "priority": 0} |
| POST | /discovered | {"urls": [{"url": "...", "depth": 1, "parent_hash": "..."}]} | {"accepted": N} |
| GET | /status | — | {"queue_size": N, "seen_count": N, "host_count": N, "seed_value": N} |
| GET | /health | — | "ok" |
Retrieval API
Content serving for AI pipelines and search.
RetrievalState
#![allow(unused)]
fn main() {
pub struct RetrievalState {
pub index: Mutex<SqliteIndex>,
pub storage: FileSystemBlobStore,
pub chunk_config: ChunkConfig,
}
}
Endpoints
| Method | Path | Query Params | Description |
|---|---|---|---|
| GET | /v1/content | url | Raw captured content |
| GET | /v1/chunks | url | RAG-ready chunks with provenance |
| GET | /v1/history | url | All captures with timestamps |
| GET | /v1/search | q | Full-text search |
| GET | /health | — | Health check |
Metrics
#![allow(unused)]
fn main() {
pub struct Metrics {
pub urls_fetched: AtomicU64,
pub urls_failed: AtomicU64,
pub urls_discovered: AtomicU64,
pub robots_blocked: AtomicU64,
pub bytes_stored: AtomicU64,
pub blobs_stored: AtomicU64,
pub api_requests: AtomicU64,
pub frontier_pops: AtomicU64,
pub frontier_pushes: AtomicU64,
}
impl Metrics {
pub fn new() -> Self;
pub fn render(&self) -> String; // Prometheus text exposition format
}
}
All counters use AtomicU64 with Ordering::Relaxed — thread-safe, no locks, no control flow impact (Law 1 safe).
palimpsest-sim
Deterministic simulation testing framework. Proves the Six Laws hold at scale by crawling a virtual internet twice with the same seed and asserting byte-identical results.
SimulatedWeb
#![allow(unused)]
fn main() {
pub struct SimulatedWeb {
seed: CrawlSeed,
universes: BTreeMap<String, Box<dyn UniverseGenerator>>,
}
impl SimulatedWeb {
pub fn new(seed: CrawlSeed) -> Self;
pub fn add_universe(&mut self, generator: Box<dyn UniverseGenerator>);
pub fn fetch(&self, url: &Url) -> Option<SimulatedResponse>;
}
}
UniverseGenerator Trait
#![allow(unused)]
fn main() {
pub trait UniverseGenerator: Send + Sync {
fn domain(&self) -> &str;
fn generate(&self, seed: &CrawlSeed, url: &Url) -> SimulatedResponse;
fn page_count(&self) -> usize;
}
}
Each universe owns a domain (e.g., linkmaze.sim) and generates deterministic responses for any URL under that domain.
SimulatedResponse
#![allow(unused)]
fn main() {
pub struct SimulatedResponse {
pub status: u16,
pub headers: Vec<(String, String)>,
pub body: Bytes,
pub delay: Option<Duration>,
pub fault: Option<FaultType>,
}
pub enum FaultType {
ConnectionRefused, Timeout, Reset, RedirectLoop,
}
}
Six Adversarial Universes
| Universe | Domain | Tests |
|---|---|---|
| LinkMaze | linkmaze.sim | Deep graph traversal, configurable fan-out |
| EncodingHell | encoding.sim | UTF-8 edge cases, mixed encodings, BOM |
| MalformedDom | malformed.sim | Broken HTML, unclosed tags, invalid attributes |
| RedirectLabyrinth | redirect.sim | Redirect chains, loops, cross-domain redirects |
| ContentTrap | trap.sim | Infinite calendars, session IDs, spider traps |
| TemporalDrift | drift.sim | Content changes between fetches |
Verification Harness
#![allow(unused)]
fn main() {
pub async fn verify_determinism<F>(
web_factory: F,
seeds: &[Url],
max_depth: u32,
max_urls: usize,
) -> Result<VerificationResult, String>
where F: Fn() -> SimulatedWeb;
}
Crawls twice, compares URLs, blob hashes, and index entries. Any divergence = failure.
VerificationResult
#![allow(unused)]
fn main() {
pub struct VerificationResult {
pub urls: Vec<String>,
pub blob_hashes: Vec<String>,
pub index_entries: Vec<(String, String)>,
pub pages_fetched: usize,
pub errors: usize,
}
}
Scale
Proven deterministic at 1,000, 5,000, and 10,000 pages across all six universes with zero divergence.
palimpsest-cli
Command-line interface with 10 subcommands. Thin wrapper around the kernel crates.
crawl
Start a crawl with seed URLs.
palimpsest crawl <SEEDS>... [OPTIONS]
-d, --depth <N> Max crawl depth [default: 2]
-m, --max-urls <N> Max URLs to fetch [default: 100]
-s, --seed <N> Deterministic seed [default: 42]
-o, --output-dir <DIR> Persist to disk
--browser Headless Chrome capture
--user-agent <UA> User-Agent [default: PalimpsestBot/0.1]
--politeness-ms <N> Per-host delay in ms [default: 1000]
-c, --config <FILE> TOML config file
replay
palimpsest replay <URL> --data-dir <DIR>
history
palimpsest history <URL> --data-dir <DIR>
extract
palimpsest extract <URL> --data-dir <DIR> [--json]
shadow-compare
palimpsest shadow-compare --legacy <DIR> --palimpsest <DIR> [--json]
serve
Start a distributed frontier server.
palimpsest serve --port <PORT> --seed <N> --politeness-ms <N>
Default port: 8090.
worker
Connect to a frontier server and crawl.
palimpsest worker --server <URL> --output-dir <DIR> [--user-agent <UA>]
api
Start the retrieval API server.
palimpsest api --port <PORT> --data-dir <DIR>
Default port: 8080.
stats
Print workspace statistics.
palimpsest stats
migrate
Run storage migrations (JSON index to SQLite).
palimpsest migrate --data-dir <DIR>
Docker Deployment
Dockerfile
Multi-stage build: Rust 1.86 builder stage compiles a release binary, Debian slim runtime stage runs it.
# Build the image
docker build -t palimpsest .
# Single crawl
docker run -v ./output:/data palimpsest crawl https://example.com -d 2 -o /data
# View help
docker run palimpsest --help
The final image includes only the stripped binary and minimal runtime dependencies (ca-certificates, libssl3).
Docker Compose
The compose file runs four services sharing a named volume:
docker compose up
| Service | Command | Port | Purpose |
|---|---|---|---|
api | api -p 8080 --data-dir /data | 8080 | Retrieval API |
frontier | serve -p 8090 -s 42 --politeness-ms 500 | 8090 | Frontier server |
worker | worker --server http://frontier:8090 -o /data | — | Fetch worker |
crawl | crawl <URL> -d 2 -m 50 -o /data | — | One-shot crawl |
The crawl service uses the crawl profile — run it explicitly:
docker compose run --profile crawl crawl
Shared Volume
All services share the palimpsest-data named volume mounted at /data. This contains blobs, the SQLite index, WARC files, and frontier state.
Production Considerations
- Set resource limits (
mem_limit,cpus) per service - The frontier server is stateful — run a single instance
- Workers are stateless — scale horizontally with
docker compose up --scale worker=N - Mount the data volume to persistent storage for durability
- Expose only the
apiservice port externally; keepfrontierinternal
Distributed Crawling
Palimpsest supports horizontal scaling via an HTTP frontier server and N worker processes.
Architecture
┌──────────────┐
curl POST │ Frontier │ ◄── Deterministic ordering
/seeds ────────►│ Server │ (seed-driven)
│ :8090 │
└──┬───┬───┬──┘
│ │ │
POST /pop│ │ │POST /discovered
│ │ │
┌──┴┐ ┌┴──┐┌┴──┐
│W1 │ │W2 ││W3 │ ◄── Stateless workers
└───┘ └───┘└───┘
│ │ │
▼ ▼ ▼
┌──────────────┐
│ Shared Disk │ (blobs, index, WARC)
└──────────────┘
Start the Frontier Server
palimpsest serve --port 8090 --seed 42 --politeness-ms 500
The frontier maintains deterministic URL ordering and politeness enforcement across all workers.
Seed URLs
curl -X POST http://localhost:8090/seeds \
-H 'Content-Type: application/json' \
-d '{"urls": ["https://example.com/", "https://docs.example.com/"]}'
Start Workers
# Terminal 2
palimpsest worker --server http://localhost:8090 --output-dir ./data
# Terminal 3 (scale out)
palimpsest worker --server http://localhost:8090 --output-dir ./data
Each worker loops: pop URL -> fetch -> store artifacts -> push discovered URLs.
Worker Flow
POST /pop— receive next URL from frontier- Fetch the URL (HTTP or browser)
- Store blob to content-addressed storage
- Insert entry into temporal index
- Write WARC++ records
POST /discovered— push new URLs back to frontier- Repeat
Monitoring
# Check frontier status
curl http://localhost:8090/status
# Response:
# {"queue_size": 1234, "seen_count": 5678, "host_count": 42, "seed_value": 42}
Determinism Guarantee
The frontier server maintains the same seed-driven ordering regardless of how many workers connect or in what order they pop URLs. Same seed = same frontier ordering.
Retrieval API
The retrieval API serves captured content over HTTP for AI pipelines, RAG systems, and content auditing.
Start the Server
palimpsest api --port 8080 --data-dir ./output
Endpoints
GET /v1/content
Retrieve raw captured content for a URL.
curl "http://localhost:8080/v1/content?url=https://example.com/"
Returns the stored HTTP response body.
GET /v1/chunks
Retrieve RAG-ready chunks with full provenance.
curl "http://localhost:8080/v1/chunks?url=https://example.com/"
Response:
{
"url": "https://example.com/",
"chunks": [
{
"text": "Example Domain. This domain is for use in illustrative examples...",
"chunk_index": 0,
"total_chunks": 3,
"char_offset": 0,
"chunk_hash": "blake3:af13...",
"source_hash": "blake3:c7d2...",
"captured_at": "2026-04-12T10:30:00Z"
}
]
}
GET /v1/history
All captures of a URL with timestamps and content hashes.
curl "http://localhost:8080/v1/history?url=https://example.com/"
Response:
{
"url": "https://example.com/",
"captures": [
{"captured_at": "2026-04-12T10:30:00Z", "content_hash": "blake3:af13...", "crawl_context": 1},
{"captured_at": "2026-04-13T08:00:00Z", "content_hash": "blake3:b8e2...", "crawl_context": 2}
]
}
GET /v1/search
Search across captured content.
curl "http://localhost:8080/v1/search?q=example+domain"
GET /metrics
Prometheus-compatible metrics (see Monitoring).
GET /health
curl http://localhost:8080/health
# "ok"
Use Cases
- RAG pipelines —
/v1/chunksprovides pre-chunked text with provenance for embedding - Content auditing —
/v1/historyshows exactly when content changed - AI training —
/v1/contentserves raw captured pages - Search systems —
/v1/searchprovides full-text search across the archive
Monitoring & Metrics
Prometheus Endpoint
The API server exposes metrics at GET /metrics in Prometheus text exposition format:
curl http://localhost:8080/metrics
# HELP palimpsest_urls_fetched Total URLs successfully fetched.
# TYPE palimpsest_urls_fetched counter
palimpsest_urls_fetched 4521
# HELP palimpsest_urls_failed Total URLs that failed to fetch.
# TYPE palimpsest_urls_failed counter
palimpsest_urls_failed 12
# HELP palimpsest_urls_discovered Total URLs discovered via link extraction.
# TYPE palimpsest_urls_discovered counter
palimpsest_urls_discovered 15890
...
Available Metrics
| Metric | Type | Description |
|---|---|---|
palimpsest_urls_fetched | counter | Total URLs successfully fetched |
palimpsest_urls_failed | counter | Total fetch failures |
palimpsest_urls_discovered | counter | Total URLs discovered via links |
palimpsest_robots_blocked | counter | Total URLs blocked by robots.txt |
palimpsest_bytes_stored | counter | Total bytes written to blob storage |
palimpsest_blobs_stored | gauge | Unique blobs in storage |
palimpsest_api_requests | counter | Total API requests served |
palimpsest_frontier_pops | counter | Total frontier pop operations |
palimpsest_frontier_pushes | counter | Total frontier push operations |
All counters use AtomicU64 with Ordering::Relaxed — lock-free, thread-safe, no impact on crawl ordering (Law 1 safe).
Structured Logging
Palimpsest uses tracing with tracing-subscriber for structured logging:
# Set log level via environment
RUST_LOG=info palimpsest crawl https://example.com -o ./output
# Debug level for specific crate
RUST_LOG=palimpsest_frontier=debug palimpsest crawl ...
# JSON output for log aggregation
RUST_LOG=info palimpsest crawl ... 2>&1 | jq .
Grafana Dashboard Suggestions
| Panel | Query | Type |
|---|---|---|
| Throughput | rate(palimpsest_urls_fetched[1m]) | Graph |
| Error Rate | rate(palimpsest_urls_failed[1m]) / rate(palimpsest_urls_fetched[1m]) | Gauge |
| Discovery Ratio | palimpsest_urls_discovered / palimpsest_urls_fetched | Stat |
| Robots Blocked | rate(palimpsest_robots_blocked[1m]) | Graph |
| Storage Growth | palimpsest_bytes_stored | Graph |
| API Load | rate(palimpsest_api_requests[1m]) | Graph |
Alerting Suggestions
- Error rate > 5% — possible network or DNS issues
- Throughput drop > 50% — politeness starvation or backend slowdown
- Frontier pops = 0 — crawl may be stalled
- Storage growth flatline — dedup working well, or crawl stopped
Trust Boundaries
Untrusted Inputs
All fetched content is untrusted. HTTP responses, HTML, JavaScript, CSS, images — all of it. Never execute, eval, or interpret fetched content outside a sandbox.
All URLs are untrusted. Validate scheme, host, and port. Block private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), link-local (169.254.0.0/16), and loopback (127.0.0.0/8) unless explicitly configured.
All DNS responses are untrusted. Record them in the ExecutionEnvelope for forensic replay, but verify against policy before connecting. DNS rebinding attacks can redirect requests to internal infrastructure.
All TLS certificates are recorded. The full certificate chain is stored in the envelope’s TlsFingerprint (protocol, cipher, cert chain hash). This enables forensic analysis of TLS state at capture time.
Storage Integrity
All artifacts are content-addressed. Tampering is detectable by recomputing the BLAKE3 hash and comparing it against the stored ContentHash. This verification happens on every read.
Storage backends must support atomic writes. The FileSystemBlobStore uses temp-file-plus-rename to prevent partial artifacts from being visible.
Blob deletion requires an explicit garbage collection pass — never inline during normal operation.
Credential Safety
- No credentials in source code, configuration files committed to git, or artifact metadata
- HTTP auth credentials (for authenticated crawls) are injected via environment variables
- TLS client certificates are loaded from a configured path, never embedded
Fetch Safety
Resource Limits
| Limit | Default | Configurable |
|---|---|---|
| Maximum response body | 256 MiB | FetchConfig.max_body_size |
| Maximum redirect chain | 10 | FetchConfig.max_redirects |
| Connect timeout | 30 seconds | FetchConfig.connect_timeout |
| Total request timeout | 120 seconds | FetchConfig.total_timeout |
Decompression Bomb Protection
Responses with Content-Encoding: gzip (or brotli, deflate) are decompressed with size validation. The decompressed size is checked against Content-Length * reasonable_ratio to prevent zip bomb attacks.
Unsafe URL Schemes
Link extraction blocks unsafe URL schemes. These are logged but never followed:
javascript:— code executiondata:— embedded content (can be arbitrarily large)blob:— browser-internal references
HTML Sanitization
Before link extraction, <script> and <style> tag content is stripped entirely. This prevents extracting junk URLs from JavaScript source code (e.g., minified variable names that look like relative paths).
#![allow(unused)]
fn main() {
pub fn extract_links(html: &str, base_url: &Url) -> Vec<Url> {
let cleaned = strip_tag_content(html, &["script", "style"]);
// ... scan for href, src attributes
}
}
robots.txt Enforcement
Palimpsest respects robots.txt per RFC 9309:
- Fetches and caches
robots.txtper origin before crawling - Respects
Disallowdirectives for the configured user agent - Honors
Crawl-delaywhen specified - Blocked URLs are counted in metrics (
palimpsest_robots_blocked)
Browser Sandbox
Isolation Model
Headless Chrome runs in a sandboxed process with strict isolation:
- No persistent storage — each page load starts from a clean browser context. No cookies, localStorage, or IndexedDB carry over between pages.
- Controlled network — the browser communicates only through the fetch engine’s controlled proxy. Direct network access is blocked.
- Disabled exfiltration — WebRTC, geolocation, notifications, and clipboard APIs are disabled to prevent data leakage.
Timeout Enforcement
Every page load has a hard timeout (default: 30 seconds). If the page does not complete loading within the timeout, the browser process is killed.
Determinism Overrides
Before any page scripts execute, Palimpsest injects JavaScript overrides seeded from CrawlSeed:
// Time is frozen and advances deterministically
Date.now = function() { return 1700000000000 + (__date_offset += 1); };
// Math.random is seeded (xorshift)
Math.random = function() { /* seeded PRNG */ };
// performance.now advances in fixed increments
performance.now = function() { return (__perf_offset += 0.1); };
This prevents JavaScript on the page from introducing non-determinism. Same seed = same execution.
CDP Stealth Mode
When stealth: true is set on BrowserFetchConfig, a comprehensive anti-detection suite is applied on top of the determinism overrides.
Chrome Launch Hardening
--disable-blink-features=AutomationControlled
--disable-component-extensions-with-background-pages
--no-first-run
--no-default-browser-check
17 Stealth Evasion Patches
All patches injected via Page.addScriptToEvaluateOnNewDocument before navigation:
| Patch | Purpose |
|---|---|
| navigator.webdriver | Set to false (configurable via WebdriverValue enum) |
| window.chrome | Full Chrome object mock (app, csi, loadTimes, runtime) |
| navigator.plugins | 3 plugins matching real Chrome |
| navigator.mimeTypes | PDF + NaCl mime types |
| navigator.permissions | Fix Notification permission inconsistency |
| navigator.languages | ["en-US", "en"] |
| navigator.hardwareConcurrency | 8 cores |
| navigator.deviceMemory | 8 GB |
| WebGL vendor/renderer | Intel UHD Graphics 630 |
| Canvas fingerprint | Seeded sub-pixel noise |
| Window/screen dimensions | Match viewport + chrome UI offset |
| AudioContext | Seeded oscillator noise |
| ClientRect | Seeded sub-pixel noise |
| sourceURL markers | Strip automation stack traces |
| navigator.userAgent | Consistent with HTTP User-Agent header |
| navigator.maxTouchPoints | 0 |
Determinism Guarantee
All noise patches (canvas, audio, ClientRect) use deterministic xorshift PRNGs with sub-seeds derived from CrawlSeed. Same seed = same noise = same fingerprint. This is Law 1 compliant.
Verified Results
Tested against 5 public bot detection sites:
- Rebrowser Bot Detector: 10/10 pass
- Sannysoft: 55/56 pass (only PluginArray prototype)
- FingerprintJS BotD: No bot verdict
- CreepJS: No hard failures
Sub-Resource Capture
Chrome DevTools Protocol (CDP) network event listeners capture all sub-resources:
Network.requestWillBeSent— records every outgoing requestNetwork.responseReceived— captures response metadataNetwork.getResponseBody— retrieves response body for each sub-resource
Each sub-resource is recorded as a separate WARC record with its own ContentHash, and the full dependency graph is stored in the resource-graph record.
Testing Philosophy
The Hierarchy
Tests are prioritized by the strength of the guarantee they provide:
- Determinism tests — Same seed + same input = bit-identical output. These are the proof that the system works. Highest priority.
- Property-based tests —
proptestgenerates random inputs and verifies invariants hold for all of them. Catches edge cases humans miss. - Snapshot tests —
instafor serialization formats (WARC++, JSON, index entries). Snapshots are reviewed artifacts. - Integration tests — Real HTTP via
wiremock, real storage backends, real index queries. - Unit tests — Standard
#[test]for isolated logic.
No Mocking Core Interfaces
The storage layer, index, and envelope are the system’s integrity boundaries. Never mock them. Use real implementations with in-memory backends:
#![allow(unused)]
fn main() {
// Correct: real implementation, in-memory backend
let store = InMemoryBlobStore::new();
let index = InMemoryIndex::new();
// Wrong: mocked storage that always returns Ok
// let store = MockBlobStore::new(); // DON'T
}
Test Naming
test_{what}_{condition}_{expected_outcome}
Examples:
test_frontier_with_same_seed_produces_identical_ordertest_artifact_hash_changes_when_content_differstest_storage_put_get_roundtrip_preserves_content
Adversarial Testing
Every adversarial input must produce a classified error, never a panic or silent corruption:
- Malformed HTTP responses
- Truncated connections mid-transfer
- DNS resolution failures
- TLS certificate anomalies
- Content that attempts to exploit parsers (polyglot files, zip bombs)
Test Coverage
301 tests across 21 test files, covering all 15 crates. The simulation framework proves determinism at 10,000 pages with zero divergence.
Simulation Framework
The simulation framework (palimpsest-sim) provides a virtual internet for testing. It replaces real HTTP with deterministic, seed-driven responses — enabling proof that the Six Laws hold at scale.
SimulatedWeb
A SimulatedWeb hosts multiple “universes,” each owning a domain:
#![allow(unused)]
fn main() {
let mut web = SimulatedWeb::new(CrawlSeed::new(42));
web.add_universe(Box::new(LinkMaze { links_per_page: 500, total_pages: 100_000 }));
web.add_universe(Box::new(EncodingHell));
web.add_universe(Box::new(MalformedDom));
}
Calling web.fetch(&url) returns a SimulatedResponse generated deterministically from the seed and URL.
SimulatedServer
Wraps SimulatedWeb with wiremock to serve responses over real HTTP. The CrawlOrchestrator connects to it as if it were the real web.
verify_determinism
The core harness:
#![allow(unused)]
fn main() {
let result = verify_determinism(
|| build_web(seed), // Factory creates identical web each time
&seed_urls,
max_depth,
max_urls,
).await?;
}
This function:
- Creates a
SimulatedWebfrom the factory - Runs a full crawl (orchestrator + frontier + storage + index)
- Records all URLs, blob hashes, and index entries
- Creates a second
SimulatedWebfrom the same factory - Runs an identical crawl
- Asserts the two runs produced byte-identical results
Any divergence in URLs, blob hashes, or index entries causes a test failure.
verify_resumption_determinism
Tests crawl resumption:
- Crawl 500 pages, save frontier state
- Create new frontier, load saved state
- Continue crawling to 1000 pages
- Compare against a single 1000-page run
Same result = Law 1 holds across save/load boundaries.
Scale Tests
| Test | Pages | Universes | Result |
|---|---|---|---|
test_scale_1000_pages_deterministic | 1,000 | 5 | Zero divergence |
test_scale_5000_pages_linkmaze_only | 5,000 | 1 | Zero divergence |
test_stress_10k_pages_deterministic | 10,000 | 5 | Zero divergence |
Adversarial Universes
The simulation framework includes six adversarial universes, each designed to stress a specific aspect of the crawl kernel.
LinkMaze
Domain: linkmaze.sim
A deep, wide graph. Each page contains links_per_page links to other pages in the maze. Tests frontier scheduling, deduplication, and depth limiting at scale.
#![allow(unused)]
fn main() {
LinkMaze { links_per_page: 500, total_pages: 1_000_000 }
}
EncodingHell
Domain: encoding.sim
UTF-8 edge cases: mixed encodings, byte-order marks, surrogate pairs, right-to-left text, zero-width characters, overlong sequences. Tests that content hashing and text extraction handle encoding correctly.
MalformedDom
Domain: malformed.sim
Broken HTML: unclosed tags, deeply nested tables, invalid attributes, missing doctype, mixed content models. Tests link extraction robustness — the parser must not crash or produce junk URLs.
RedirectLabyrinth
Domain: redirect.sim
Redirect chains (301 -> 302 -> 301 -> 200), redirect loops, cross-domain redirects, redirect-to-self. Tests redirect chain depth enforcement and URL normalization.
ContentTrap
Domain: trap.sim
Spider traps: infinite calendars (every date links to the next), session IDs in URLs (creating infinite unique URLs), query parameter permutations. Tests that max_urls and deduplication prevent infinite crawls.
TemporalDrift
Domain: drift.sim
Content changes between fetches. The same URL returns different content depending on the logical clock value. Tests temporal integrity — the index must correctly record each version.
#![allow(unused)]
fn main() {
TemporalDrift::new(1) // Content changes every 1 logical tick
}
Composition
All six universes run simultaneously in scale tests:
#![allow(unused)]
fn main() {
let mut web = SimulatedWeb::new(seed);
web.add_universe(Box::new(LinkMaze { ... }));
web.add_universe(Box::new(EncodingHell));
web.add_universe(Box::new(MalformedDom));
web.add_universe(Box::new(RedirectLabyrinth));
web.add_universe(Box::new(ContentTrap));
web.add_universe(Box::new(TemporalDrift::new(1)));
}
The crawl must handle all six simultaneously — deterministic ordering across domains, correct error classification, and zero divergence between runs.
Development Setup
Prerequisites
| Tool | Version | Why |
|---|---|---|
| Rust | 1.86+ stable | rustup update stable |
| CMake | 3.x+ | BoringSSL compilation |
| Go | 1.19+ | BoringSSL compilation |
| C compiler | gcc, clang, or MSVC | BoringSSL compilation |
| Git | any | Source checkout |
See Installation for platform-specific setup (macOS, Linux, Windows).
Clone and Build
git clone https://github.com/copyleftdev/palimpsest.git
cd palimpsest
cargo build --workspace
First build takes 2-4 minutes (BoringSSL compiles from source). Subsequent builds are incremental.
Running Tests
# Full test suite (288 tests, excludes long-running scale tests)
cargo test --workspace
# Simulation tests only
cargo test -p palimpsest-sim --test simulation_tests
# Scale tests (1K + 5K pages, ~90 seconds)
cargo test -p palimpsest-sim --test scale_test
# Stress test (10K pages)
cargo test -p palimpsest-sim --test stress_test
# Stealth regression tests (requires Chrome + network access)
cargo test -p palimpsest-fetch --test stealth_test -- --ignored --nocapture --test-threads=1
# Single crate
cargo test -p palimpsest-frontier
Pre-Commit Checks
Before submitting a PR:
cargo fmt --check # Formatting
cargo clippy -- -D warnings # Lints (must be warning-free)
cargo test --workspace # All tests pass
IDE Setup
rust-analyzer is recommended for all editors. The workspace Cargo.toml at the project root configures all 15 crates automatically.
| Editor | Setup |
|---|---|
| VS Code | Install rust-analyzer extension |
| JetBrains (CLion/RustRover) | Built-in Rust support |
| Neovim | mason.nvim → install rust-analyzer |
| Emacs | lsp-mode + rustic |
Docker Testing
docker build -t palimpsest .
docker run palimpsest --help
Platform Notes
macOS
BoringSSL builds cleanly with Xcode command line tools + Homebrew CMake + Go. No special flags needed.
Windows (MSVC)
Requires Visual Studio Build Tools with the “Desktop development with C++” workload. CMake and Go must be in PATH. WSL2 is the recommended alternative for a smoother experience.
Linux
All major distributions work. Ensure cmake, go, and clang (or gcc) are installed. See Installation for distro-specific package commands.
Code Standards
Deterministic Concurrency
- BTreeMap over HashMap when iteration order is observable
- tokio for all concurrency — no
thread::spawn - No
randcrate — all randomness viaCrawlSeed->ChaCha8Rng - No
Instant::now()in core logic — time fromExecutionEnvelopeor caller - Atomics for counters/metrics only, never for control flow
Error Handling
- All errors typed as
PalimpsestErrorvariants — noanyhoworeyrein library crates - No
.unwrap()or.expect()in library code — binary crates may use them inmain()only - Every
?propagation must preserve the error taxonomy panic!is a bug report, not control flow
Memory and Performance
bytes::Bytesfor buffers crossing async boundaries- Zero-copy:
&[u8]>Vec<u8>>String - No
Cloneon large types — useArc<T>for shared ownership - Pre-allocate buffers in hot paths
Serialization
serdederive on all types crossing crate boundaries#[serde(rename_all = "snake_case")]on enum variants- JSON for human-readable formats
- Never change serialized field names without a migration plan
Type Design
- Newtypes for domain concepts:
ContentHash,CaptureInstant,CrawlSeed,CrawlContextId - Parse, don’t validate: constructors enforce invariants
Copyfor small values (hashes, timestamps, IDs)#[non_exhaustive]on public enums that may grow
Testing Requirements
- Every public function has at least one test
- Property-based tests (
proptest) for data transformation functions - Snapshot tests (
insta) for serialization formats - Determinism tests: same seed = byte-identical output
Commit & PR Conventions
Commit Messages
Format:
<type>(<scope>): <description>
<body — explains WHY, not what>
Types: feat, fix, refactor, test, docs, perf, chore
Scope: the crate name in parens: feat(frontier):, fix(envelope):, refactor(storage):
Examples:
feat(frontier): add crawl resumption via frontier save/load
Enables stopping and restarting crawls without losing state.
The frontier serializes its complete state (host queues, seen set,
politeness timestamps) to JSON and restores it on load.
fix(fetch): strip script/style content before link extraction
Link extraction was producing junk URLs from minified JavaScript.
Stripping <script> and <style> tags before scanning eliminates
false positives without affecting real links.
Breaking changes: prefix the body with BREAKING:
Special requirements:
- Commits touching
fetch/artifact/replaymust include a replay fidelity test - Commits touching
frontier/envelopemust include a determinism test
Pull Requests
- One concern per PR. No bundled drive-bys.
- Must include tests exercising the invariant being changed
- Benchmark before/after for performance-sensitive paths
cargo clippy -- -D warningsandcargo testmust pass- New dependencies require justification in the PR description
Dependency Policy
Minimize external dependencies — every dep is attack surface.
Approved:
tokio— async runtimereqwest/hyper— HTTPserde+serde_json— serializationblake3— content hashingchrono— temporal typestracing— structured observability
Forbidden in core crates:
rand— useCrawlSeedfor all randomnessanyhow/eyre— use typedPalimpsestError
Process:
- Pin all versions in
Cargo.lock(committed) - Run
cargo auditbefore merging new deps - No build scripts that download or execute external code
Error Taxonomy
Every failure in Palimpsest is classified into exactly one category. No silent retries. No swallowed errors. Failures are stored artifacts — they are part of the crawl record, not noise to discard.
PalimpsestError
The top-level error enum. Every error in the system ultimately maps to one of these seven variants.
Network
#![allow(unused)]
fn main() {
PalimpsestError::Network(String)
}
Connection failures, DNS resolution errors, TCP timeouts, TLS handshake failures. The fetch could not reach the server.
Examples: DNS NXDOMAIN, connection refused, connect timeout, TLS certificate expired.
Protocol
#![allow(unused)]
fn main() {
PalimpsestError::Protocol(String)
}
HTTP protocol violations. The server responded, but the response is malformed or violates HTTP semantics.
Examples: Invalid status line, malformed headers, truncated chunked encoding, invalid Content-Length.
Rendering
#![allow(unused)]
fn main() {
PalimpsestError::Rendering(String)
}
Browser/DOM errors. Chrome launched but could not render the page correctly.
Examples: JavaScript execution error, page load timeout, CDP connection lost, DOM snapshot failure.
Policy
#![allow(unused)]
fn main() {
PalimpsestError::Policy(String)
}
The system refused to process a URL based on configured policy.
Examples: robots.txt disallow, scope violation (URL outside configured domain), rate limit enforcement, max depth exceeded.
DeterminismViolation
#![allow(unused)]
fn main() {
PalimpsestError::DeterminismViolation {
context: String,
expected: String,
actual: String,
}
}
The nuclear option. This means a Law was broken. Two runs with the same seed produced different results. This should never happen in production — if it does, it’s a bug in the kernel.
Examples: Frontier produced different ordering for same seed, content hash mismatch for identical input, replay diverged from original.
Storage
#![allow(unused)]
fn main() {
PalimpsestError::Storage(String)
}
Blob store failures: write errors, read errors, integrity check failures, backend unavailability.
Examples: Disk full, permission denied, blob corrupted (hash mismatch on read), S3 connection error.
Replay
#![allow(unused)]
fn main() {
PalimpsestError::Replay(String)
}
Missing artifacts, incomplete capture groups, reconstruction failures. The stored data is insufficient to replay.
Examples: Missing blob for content hash, no envelope record in WARC, incomplete resource graph.
Other Error Types
| Error | Crate | Variants |
|---|---|---|
StorageError | palimpsest-storage | Backend, NotFound, IntegrityError |
EnvelopeError | palimpsest-envelope | MissingSeed, MissingTimestamp, MissingTargetUrl, MissingDnsSnapshot |
CaptureGroupError | palimpsest-artifact | MissingUrl, MissingTimestamp, MissingCrawlContext, MissingEnvelope, MissingRequest, MissingResponse |
FrontierPersistError | palimpsest-frontier | Wraps serialization/IO errors |
IndexError | palimpsest-index | Wraps SQLite errors |
WarcWriteError | palimpsest-artifact | Wraps IO/format errors |
VectorStoreError | palimpsest-embed | Wraps SQLite errors |
API Quick Reference
Frontier API (default port 8090)
POST /seeds
Seed the frontier with URLs to crawl.
curl -X POST http://localhost:8090/seeds \
-H 'Content-Type: application/json' \
-d '{"urls": ["https://example.com/", "https://docs.example.com/"]}'
Response: {"accepted": 2}
POST /pop
Pop the next URL from the frontier.
curl -X POST http://localhost:8090/pop \
-H 'Content-Type: application/json' \
-d '{}'
Response: {"url": "https://example.com/", "depth": 0, "priority": 0}
Returns {"url": null} when the frontier is empty.
POST /discovered
Push discovered URLs back to the frontier.
curl -X POST http://localhost:8090/discovered \
-H 'Content-Type: application/json' \
-d '{"urls": [{"url": "https://example.com/page", "depth": 1, "parent_hash": "af1349b9..."}]}'
Response: {"accepted": 1}
GET /status
curl http://localhost:8090/status
Response: {"queue_size": 1234, "seen_count": 5678, "host_count": 42, "seed_value": 42}
GET /health
curl http://localhost:8090/health
Response: "ok"
Retrieval API (default port 8080)
GET /v1/content
curl "http://localhost:8080/v1/content?url=https://example.com/"
Returns raw captured content.
GET /v1/chunks
curl "http://localhost:8080/v1/chunks?url=https://example.com/"
Returns RAG chunks with provenance (chunk_hash, source_hash, captured_at, char_offset).
GET /v1/history
curl "http://localhost:8080/v1/history?url=https://example.com/"
Returns all captures with timestamps and content hashes.
GET /v1/search
curl "http://localhost:8080/v1/search?q=example+domain"
Returns matching content across all captured pages.
GET /metrics
curl http://localhost:8080/metrics
Returns Prometheus text exposition format.
GET /health
curl http://localhost:8080/health
Response: "ok"
Glossary
Core Types
Crawl Kernel — The deterministic execution engine at the heart of Palimpsest. Schedules fetches, seals execution contexts, captures artifacts, stores blobs, indexes temporal state. Not a crawler — a kernel that crawlers are built on.
CrawlSeed — A 64-bit value that controls all randomness in the system. CrawlSeed::rng() returns a ChaCha8Rng PRNG. Same seed = identical behavior.
ContentHash — A 32-byte BLAKE3 hash. Used to address, store, retrieve, and verify every artifact. ContentHash::of(data) computes the hash.
CaptureInstant — A paired timestamp: wall clock (DateTime<Utc>) + logical clock (u64). Binds captures to both real-world time and crawl-internal ordering.
CrawlContextId — An opaque u64 identifier for a crawl session. Distinguishes captures from different runs.
ExecutionEnvelope — An immutable, sealed record of everything that affects a fetch: seed, timestamp, target URL, DNS snapshot, TLS fingerprint, browser config, and headers. Constructed via EnvelopeBuilder, frozen after build().
Frontier & Scheduling
Frontier — The deterministic URL scheduler. Maintains per-host priority queues in a BTreeMap, deduplicates by URL, and enforces politeness delays.
FrontierEntry — A URL in the frontier with depth, priority, and parent hash.
PolitenessPolicy — Configurable per-host rate limiting: minimum delay between requests and maximum concurrent hosts.
Artifacts & WARC
WARC++ — Palimpsest’s extension of the ISO 28500 WARC format. Adds envelope, dom-snapshot, resource-graph, and timing record types while maintaining backward compatibility.
WarcRecord — A single WARC record with type, record ID, content hash, and payload.
CaptureGroup — A bundle of related WARC records from a single fetch: envelope + request + response + optional DOM/resource graph/timing.
RecordType — Enum of WARC record types: 5 standard (warcinfo, request, response, resource, metadata) + 4 extensions (envelope, dom-snapshot, resource-graph, timing).
DomSnapshot — The rendered DOM state after JavaScript execution, captured via CDP.
ResourceGraph — The dependency graph of all sub-resources loaded for a page, with type, hash, initiator, and load ordering.
Storage
BlobStore — The trait interface for content-addressed storage. Implementations: InMemoryBlobStore, FileSystemBlobStore, ObjectStoreBlobStore.
Content-Addressed Storage — Storage where the key is the hash of the content. Same content = same key = stored once. Integrity is verifiable by recomputing the hash.
Deduplication — Structural dedup: if ContentHash::of(data_a) == ContentHash::of(data_b), the data is stored once. Not a post-process step — built into the storage model.
Index & Replay
Temporal Index — A multi-dimensional index mapping URL x time x hash x crawl_context. Not a lookup table — a queryable graph of web history.
IndexEntry — One capture record in the index: URL, CaptureInstant, ContentHash, CrawlContextId.
Replay Fidelity — The guarantee that stored artifacts are sufficient to reconstruct the original HTTP exchange, DOM state, and resource graph. Law 5.
Comparison & Analysis
Shadow Comparison — Side-by-side validation of Palimpsest output against legacy crawler WARC files (Heritrix, wget, Warcprox).
ContentChunk — A provenance-tagged text chunk for RAG pipelines. Carries source_url, captured_at, source_hash, chunk_hash, and char_offset.
Embedding — A vector of f32 values representing text semantics. Generated by an EmbeddingProvider.
EmbeddingProvider — The trait for embedding generation. HashEmbedder provides deterministic test embeddings via BLAKE3.
VectorStore — SQLite-backed storage for embeddings with brute-force cosine similarity search.
Cosine Similarity — The similarity metric between two embedding vectors. Range: -1.0 to 1.0. Used for semantic search.
Simulation
SimulatedWeb — A virtual internet for testing. Hosts multiple UniverseGenerator instances, each responding to URLs on its domain.
UniverseGenerator — The trait for generating deterministic responses. Implementations: LinkMaze, EncodingHell, MalformedDom, RedirectLabyrinth, ContentTrap, TemporalDrift.
Adversarial Universe — A simulation universe designed to stress a specific aspect of the crawl kernel (encoding, DOM parsing, redirects, spider traps, temporal changes).
Anti-Detection
JA3 — TLS fingerprinting method that hashes five fields from the ClientHello: TLS version, cipher suites, extensions, supported groups, EC point formats. Legacy but still deployed by WAFs.
JA4 — Current TLS fingerprinting standard (FoxIO). Sorts before hashing to defeat extension randomization. Three sections: header, sorted cipher hash, sorted extension hash.
BoringSSL — Google’s fork of OpenSSL used by Chrome. Palimpsest uses it (via wreq) for full ClientHello control, enabling browser-grade TLS impersonation.
Akamai h2 Fingerprint — Passive HTTP/2 fingerprint capturing SETTINGS frame values/order, WINDOW_UPDATE, PRIORITY frames, and pseudo-header ordering. Distinguishes browsers from automation clients.
CDP Stealth Mode — Anti-detection suite for headless Chrome. 17 evasion patches covering navigator.webdriver, window.chrome, plugins, WebGL, canvas noise, AudioContext noise, and more.
BrowserProfile — A unified, internally consistent browser identity tying TLS fingerprint + HTTP/2 settings + HTTP headers + JS surface into a single profile. Prevents cross-layer detection mismatches.
ProfileMode — Controls how browser profiles are selected: None (default), Fixed (one profile), Seeded (deterministic from CrawlSeed), RotatePerDomain (per-domain via BLAKE3).
WebdriverValue — Explicit config for navigator.webdriver in stealth mode. False (matches real Chrome, default) or Undefined (property deleted). Auditable, not hidden.