palimpsest-shadow

Shadow comparison engine for validating Palimpsest output against legacy crawlers (Heritrix, wget, Warcprox, Brozzler).

Purpose

During migration from legacy crawl infrastructure, shadow comparison proves that Palimpsest captures the same content. It reads .warc and .warc.gz files from any crawler, normalizes URLs for cross-format comparison, and reports matches, mismatches, and coverage gaps.

Usage

palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]

Comparison Logic

Read all WARC records from the legacy directory (.warc and .warc.gz)
Read all WARC records from the Palimpsest output
Normalize URLs: strip fragments, unify schemes (http/https), sort query params, strip angle brackets
Match records by normalized URL
For matched pairs: compare content size, report byte-level diffs
Report unmatched URLs in each direction (coverage gaps)

URL Normalization

Legacy crawlers store URLs differently:

wget uses <http://url> angle bracket syntax per WARC spec
wget stores post-redirect URLs (https), Palimpsest may store pre-redirect (http)
Fragment handling varies across tools

normalize_url_for_comparison() unifies all representations.

Output Format

Plain text by default, JSON with --json flag. Reports:

Total URLs in each dataset
Matched URLs with size comparison
Mismatches with byte-level size diffs
URLs present in legacy but missing from Palimpsest
URLs present in Palimpsest but missing from legacy