Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

palimpsest-shadow

Shadow comparison engine for validating Palimpsest output against legacy crawlers (Heritrix, wget, Warcprox, Brozzler).

Purpose

During migration from legacy crawl infrastructure, shadow comparison proves that Palimpsest captures the same content. It reads .warc and .warc.gz files from any crawler, normalizes URLs for cross-format comparison, and reports matches, mismatches, and coverage gaps.

Usage

palimpsest shadow-compare --legacy ./heritrix-warcs --palimpsest ./output [--json]

Comparison Logic

  1. Read all WARC records from the legacy directory (.warc and .warc.gz)
  2. Read all WARC records from the Palimpsest output
  3. Normalize URLs: strip fragments, unify schemes (http/https), sort query params, strip angle brackets
  4. Match records by normalized URL
  5. For matched pairs: compare content size, report byte-level diffs
  6. Report unmatched URLs in each direction (coverage gaps)

URL Normalization

Legacy crawlers store URLs differently:

  • wget uses <http://url> angle bracket syntax per WARC spec
  • wget stores post-redirect URLs (https), Palimpsest may store pre-redirect (http)
  • Fragment handling varies across tools

normalize_url_for_comparison() unifies all representations.

Output Format

Plain text by default, JSON with --json flag. Reports:

  • Total URLs in each dataset
  • Matched URLs with size comparison
  • Mismatches with byte-level size diffs
  • URLs present in legacy but missing from Palimpsest
  • URLs present in Palimpsest but missing from legacy