Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

batch

batch runs parallel analysis across all JSON files in a directory. It produces aggregated statistics, per-file summaries, and batch-level observations — processing hundreds or thousands of files in seconds via Rayon-based parallelism.

Where single-document commands analyze one file, batch analyzes the population.


Usage

vajra batch <directory> [flags]

Arguments:

ArgumentDescription
<directory>Path to a directory containing JSON files

Flags:

FlagDescriptionDefault
--format <fmt>Output format: text, json, markdown, compact-aitext
--profile <name>Concern profile for essence generationengineer
--input-format <fmt>Override auto-detected input formatauto
--streamingForce streaming mode for each fileoff
--redactApply built-in redaction before outputoff
--quietSuppress progress outputoff

What It Does

  1. Discovers files. Scans the directory for all supported files (.json, .yaml, .csv, .ndjson, etc.).

  2. Parallel analysis. Each file is analyzed independently using Rayon’s work-stealing thread pool. On an 8-core machine, 8 files are analyzed simultaneously.

  3. Per-file statistics. For each file: node count, path count, depth, fingerprint, anomaly count.

  4. Aggregated statistics. Across the entire batch: merged frequency distributions, merged DDSketch quantiles, population-level entropy, cross-file type stability.

  5. Batch-level observations. Structural families (via clustering), population anomalies, files that deviate from the batch norm.


Example: Text Output

vajra batch ./claims/
=== Batch Analysis ===
Directory: ./claims/
Files processed: 247
Total nodes: 208,729
Processing time: 1.4s (148,378 nodes/s)

=== Per-File Summary ===
  FILE                  NODES  PATHS  DEPTH  ANOMALIES  FINGERPRINT
  claim_001.json          847     23      6          0  a1b2c3d4...
  claim_002.json          891     23      6          0  a1b2c3d4...
  claim_003.json          723     23      6          1  a1b2c3d4...
  claim_048.json         1102     27      7          0  e5f6a7b8...
  claim_199.json          412     18      5          3  c9d0e1f2...
  ... (242 more files)

=== Structural Families ===
  Family 1: 198 files (80.2%) — 23 paths, signature a1b2c3d4...
  Family 2:  41 files (16.6%) — 27 paths, signature e5f6a7b8...
  Family 3:   8 files ( 3.2%) — 18 paths, signature c9d0e1f2...

=== Aggregated Statistics ===
  $.claims[*].service_lines[*].charge_amount
    Population median: 285.00
    Population MAD: 195.00
    Population p95: 1,420.00
    Cross-file consistency: high (coefficient of variation = 0.12)

  $.claims[*].service_lines[*].status
    Population entropy: 1.45 bits
    Dominant value: "adjudicated" (72.3%)
    Cardinality: 5 values across batch

=== Batch-Level Anomalies ===
  claim_199.json: structural outlier (Jaccard distance 0.31 from dominant family)
  claim_201.json: structural outlier (Jaccard distance 0.28 from dominant family)
  claim_834.json: contains numeric outlier (charge_amount = 47,250.00, z_MAD = 6.3)

Example: JSON Output

vajra batch ./claims/ --format json
{
  "directory": "./claims/",
  "files_processed": 247,
  "total_nodes": 208729,
  "processing_time_ms": 1400,
  "per_file": [
    {
      "file": "claim_001.json",
      "nodes": 847,
      "paths": 23,
      "depth": 6,
      "anomaly_count": 0,
      "fingerprint": "a1b2c3d4..."
    }
  ],
  "structural_families": [
    {
      "id": 0,
      "count": 198,
      "percentage": 80.2,
      "distinct_paths": 23,
      "signature": "a1b2c3d4..."
    },
    {
      "id": 1,
      "count": 41,
      "percentage": 16.6,
      "distinct_paths": 27,
      "signature": "e5f6a7b8..."
    },
    {
      "id": 2,
      "count": 8,
      "percentage": 3.2,
      "distinct_paths": 18,
      "signature": "c9d0e1f2..."
    }
  ],
  "aggregated_stats": {
    "$.claims[*].service_lines[*].charge_amount": {
      "population_median": 285.0,
      "population_mad": 195.0,
      "population_p95": 1420.0
    }
  },
  "batch_anomalies": [
    {
      "file": "claim_199.json",
      "type": "structural_outlier",
      "jaccard_distance": 0.31
    },
    {
      "file": "claim_834.json",
      "type": "numeric_outlier",
      "path": "$.claims[*].service_lines[*].charge_amount",
      "value": 47250.0,
      "z_mad": 6.3
    }
  ]
}

Parallelism and Performance

Batch uses Rayon’s work-stealing thread pool. The number of threads defaults to the number of CPU cores.

Performance targets:

Batch SizeTarget
100 files, ~1 MB each< 5 seconds
1,000 files, ~1 MB each< 30 seconds
10,000 files, ~100 KB each< 30 seconds

DDSketch instances are computed per-file and merged globally with no accuracy loss — this is the key property that makes parallel batch processing exact rather than approximate.


When to Use It

  • Daily batch monitoring. Run batch on each day’s incoming data. Track structural families, anomaly counts, and distribution shifts over time.
  • Pre-processing audit. Before feeding a batch to a downstream system, run batch to verify structural consistency and flag outliers.
  • Population baselines. Establish population-level statistics (median charge amount, expected null rates, typical structural signature) that individual-file analysis can compare against.
  • Quick directory survey. “What is in this folder?” — batch answers in seconds.

Pairs Well With

  • cluster — batch includes lightweight clustering; cluster provides detailed similarity analysis
  • anomalies — batch flags files with anomalies; drill into specific files for details
  • drift — compare today’s batch aggregates to yesterday’s for population-level drift
  • essence — run essence on specific files that batch identified as notable