Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Flow

This chapter traces a single URL through the entire Palimpsest system, from seed to replay.

1. Seed URL Enters the Frontier

#![allow(unused)]
fn main() {
let seed = CrawlSeed::new(42);
let mut frontier = Frontier::new(seed, PolitenessPolicy::default_policy());
frontier.push_seed(Url::parse("https://example.com/").unwrap());
}

The frontier deduplicates by URL string and buckets entries by host.

2. Frontier Dequeues (Deterministic Ordering)

#![allow(unused)]
fn main() {
let entry: FrontierEntry = frontier.pop(now).unwrap();
// entry.url = "https://example.com/"
// entry.depth = 0
// entry.priority = 0
}

The dequeue order is deterministic: hosts are rotated via a seeded Fisher-Yates shuffle, and within each host, entries are ordered by priority then depth.

3. ExecutionEnvelope Seals the Context

#![allow(unused)]
fn main() {
let envelope = EnvelopeBuilder::new()
    .seed(seed)
    .timestamp(CaptureInstant::new(wall_time, logical_clock))
    .target_url(entry.url.clone())
    .dns_snapshot(DnsSnapshot { host: "example.com".into(), addrs: vec!["93.184.216.34".into()], ttl: 300 })
    .build()?;
}

The envelope is immutable after construction. It captures everything needed to reproduce this fetch.

4. Fetch Executes

#![allow(unused)]
fn main() {
let fetcher = HttpFetcher::with_defaults()?;
let result: FetchResult = fetcher.fetch(&envelope).await?;
}

For browser mode, BrowserFetcher launches headless Chrome with determinism overrides and captures DOM + sub-resources via CDP.

#![allow(unused)]
fn main() {
let links: Vec<Url> = extract_links(&html_body, &entry.url);
for link in links {
    frontier.push_discovered(link, entry.depth + 1, content_hash);
}
}

Links are extracted from HTML (after stripping <script> and <style> tags), normalized (fragments stripped, query params sorted, default ports removed), deduplicated, and sorted for determinism.

6. Artifact Creation

#![allow(unused)]
fn main() {
let record = WarcRecord::new(
    RecordType::Response,
    "application/http;msgtype=response".into(),
    response_bytes,
);
assert!(record.verify_integrity()); // BLAKE3 hash matches payload
}

The CaptureGroup bundles the envelope record, request record, response record, and optional DOM/resource-graph/timing records.

7. Content-Addressed Storage

#![allow(unused)]
fn main() {
let hash: ContentHash = store.put(response_bytes).await?;
// hash = blake3(response_bytes)
// Stored at: blobs/af/1349b9f5f9a1a6a0404dea36dcc949...
}

If a blob with the same hash already exists, the write is a no-op (structural deduplication).

8. Temporal Index Insert

#![allow(unused)]
fn main() {
index.insert(IndexEntry::new(
    entry.url.clone(),
    envelope.timestamp(),
    hash,
    CrawlContextId(1),
))?;
}

The index records this capture in four dimensions: URL, time, content hash, and crawl context.

9. WARC++ Output

#![allow(unused)]
fn main() {
write_warc_file(&path, &capture_group.all_records()).await?;
}

The WARC++ file contains standard ISO 28500 records plus Palimpsest extensions (envelope, dom-snapshot, resource-graph, timing). Standard WARC readers can parse the basic records; Palimpsest readers get the full execution context.

10. Replay

#![allow(unused)]
fn main() {
let content = store.get(&hash).await?;
let entries = index.query(&IndexQuery::for_url(&url))?;
}

Replay retrieves the stored blob and execution envelope, then reconstructs the original HTTP exchange, DOM state, and resource dependency graph. Same envelope + same artifacts = bit-identical output.