Data Flow
This chapter traces a single URL through the entire Palimpsest system, from seed to replay.
1. Seed URL Enters the Frontier
#![allow(unused)]
fn main() {
let seed = CrawlSeed::new(42);
let mut frontier = Frontier::new(seed, PolitenessPolicy::default_policy());
frontier.push_seed(Url::parse("https://example.com/").unwrap());
}
The frontier deduplicates by URL string and buckets entries by host.
2. Frontier Dequeues (Deterministic Ordering)
#![allow(unused)]
fn main() {
let entry: FrontierEntry = frontier.pop(now).unwrap();
// entry.url = "https://example.com/"
// entry.depth = 0
// entry.priority = 0
}
The dequeue order is deterministic: hosts are rotated via a seeded Fisher-Yates shuffle, and within each host, entries are ordered by priority then depth.
3. ExecutionEnvelope Seals the Context
#![allow(unused)]
fn main() {
let envelope = EnvelopeBuilder::new()
.seed(seed)
.timestamp(CaptureInstant::new(wall_time, logical_clock))
.target_url(entry.url.clone())
.dns_snapshot(DnsSnapshot { host: "example.com".into(), addrs: vec!["93.184.216.34".into()], ttl: 300 })
.build()?;
}
The envelope is immutable after construction. It captures everything needed to reproduce this fetch.
4. Fetch Executes
#![allow(unused)]
fn main() {
let fetcher = HttpFetcher::with_defaults()?;
let result: FetchResult = fetcher.fetch(&envelope).await?;
}
For browser mode, BrowserFetcher launches headless Chrome with determinism overrides and captures DOM + sub-resources via CDP.
5. Link Extraction
#![allow(unused)]
fn main() {
let links: Vec<Url> = extract_links(&html_body, &entry.url);
for link in links {
frontier.push_discovered(link, entry.depth + 1, content_hash);
}
}
Links are extracted from HTML (after stripping <script> and <style> tags), normalized (fragments stripped, query params sorted, default ports removed), deduplicated, and sorted for determinism.
6. Artifact Creation
#![allow(unused)]
fn main() {
let record = WarcRecord::new(
RecordType::Response,
"application/http;msgtype=response".into(),
response_bytes,
);
assert!(record.verify_integrity()); // BLAKE3 hash matches payload
}
The CaptureGroup bundles the envelope record, request record, response record, and optional DOM/resource-graph/timing records.
7. Content-Addressed Storage
#![allow(unused)]
fn main() {
let hash: ContentHash = store.put(response_bytes).await?;
// hash = blake3(response_bytes)
// Stored at: blobs/af/1349b9f5f9a1a6a0404dea36dcc949...
}
If a blob with the same hash already exists, the write is a no-op (structural deduplication).
8. Temporal Index Insert
#![allow(unused)]
fn main() {
index.insert(IndexEntry::new(
entry.url.clone(),
envelope.timestamp(),
hash,
CrawlContextId(1),
))?;
}
The index records this capture in four dimensions: URL, time, content hash, and crawl context.
9. WARC++ Output
#![allow(unused)]
fn main() {
write_warc_file(&path, &capture_group.all_records()).await?;
}
The WARC++ file contains standard ISO 28500 records plus Palimpsest extensions (envelope, dom-snapshot, resource-graph, timing). Standard WARC readers can parse the basic records; Palimpsest readers get the full execution context.
10. Replay
#![allow(unused)]
fn main() {
let content = store.get(&hash).await?;
let entries = index.query(&IndexQuery::for_url(&url))?;
}
Replay retrieves the stored blob and execution envelope, then reconstructs the original HTTP exchange, DOM state, and resource dependency graph. Same envelope + same artifacts = bit-identical output.