palimpsest-shadow
Shadow comparison engine for validating Palimpsest output against legacy crawlers (Heritrix, wget, Warcprox, Brozzler).
Purpose
During migration from legacy crawl infrastructure, shadow comparison proves that Palimpsest captures the same content. It reads .warc and .warc.gz files from any crawler, normalizes URLs for cross-format comparison, and reports matches, mismatches, and coverage gaps.
Usage
palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]
Comparison Logic
- Read all WARC records from the legacy directory (
.warcand.warc.gz) - Read all WARC records from the Palimpsest output
- Normalize URLs: strip fragments, unify schemes (http/https), sort query params, strip angle brackets
- Match records by normalized URL
- For matched pairs: compare content size, report byte-level diffs
- Report unmatched URLs in each direction (coverage gaps)
URL Normalization
Legacy crawlers store URLs differently:
- wget uses
<http://url>angle bracket syntax per WARC spec - wget stores post-redirect URLs (https), Palimpsest may store pre-redirect (http)
- Fragment handling varies across tools
normalize_url_for_comparison() unifies all representations.
Output Format
Plain text by default, JSON with --json flag. Reports:
- Total URLs in each dataset
- Matched URLs with size comparison
- Mismatches with byte-level size diffs
- URLs present in legacy but missing from Palimpsest
- URLs present in Palimpsest but missing from legacy