Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

stats

stats computes the statistical profile of a JSON document. Entropy, frequency distributions, numeric summaries, null rates, cardinality — the quantitative foundation that every other analysis depends on.

Where inspect tells you what exists, stats tells you how it behaves.


Usage

vajra stats <input> [flags]

Arguments:

ArgumentDescription
<input>Path to a JSON file, - for stdin, or an HTTP URL

Flags:

FlagDescriptionDefault
--format <fmt>Output format: text, json, markdown, compact-aitext
--input-format <fmt>Override auto-detected input formatauto
--streamingForce streaming mode (sketch-based approximations)off
--redactApply built-in redaction before outputoff
--quietSuppress progress outputoff
--window <period>Temporal windowing: month, week, or dayoff
--time-field <path>JSONPath to timestamp field (e.g., '$.date'). Auto-detected if omitted.auto

Temporal Windowing

When --window is specified, stats partitions records by time period and computes per-window statistics. Cross-window trend lines are included in the output, showing how distributions shift over time.

The --time-field flag tells Vajra which field contains the timestamp. If omitted, Vajra auto-detects by scanning for fields with date/time patterns (ISO 8601, Unix timestamps, common date formats).

vajra stats commits.ndjson --window month --time-field '$.date'
=== Statistical Summary (windowed: month) ===
Document: commits.ndjson (1,247 records, 8 paths)

--- Window: 2026-01 (312 records) ---
  $.files_changed
    Mean: 4.2   Median: 3.0   p95: 12.0

--- Window: 2026-02 (298 records) ---
  $.files_changed
    Mean: 5.1   Median: 4.0   p95: 15.0

--- Window: 2026-03 (337 records) ---
  $.files_changed
    Mean: 6.8   Median: 5.0   p95: 19.0

--- Cross-Window Trends ---
  $.files_changed  mean: 4.2 -> 5.1 -> 6.8 (upward, +62% over 3 months)
  $.type           "fix" share: 0.18 -> 0.24 -> 0.31 (increasing)

Windowing works with any multi-record input: NDJSON, CSV, multi-document YAML, or directories.


What It Reports

For every wildcard path in the document:

Frequency and Cardinality

  • Count — total observations at this path
  • Cardinality — number of distinct values
  • Top values — the most frequent values with their counts

Entropy

  • Shannon entropy — H(X) in bits. Measures information content.
  • Normalized entropy — H(X) / log2(|support|). Scales to [0, 1] regardless of cardinality.

The entropy pair is one of the most powerful signals in the system:

EntropyNormalizedInterpretation
00Constant — single value, pure boilerplate
LowLowEnum-like — few distinct states
LowHighNear-uniform over tiny support
HighModerateMeaningful variation — identifiers, dates, codes
HighHighNear-uniform over large support — free text, UUIDs

Missingness

  • Null rate — fraction of observations that are JSON null
  • Absent rate — fraction of parent records where this path does not appear
  • Empty rate — fraction of values that are empty strings, empty arrays, or empty objects

Numeric Distributions (for numeric paths)

  • Min, max, mean, median
  • Percentiles — p01, p05, p25, p50, p75, p95, p99
  • MAD — Median Absolute Deviation (robust dispersion)
  • Skewness proxy — (mean - median) / MAD

Type Distribution

  • Breakdown of JSON types observed at each path (e.g., 98% number, 2% string)
  • Type instability score — fraction of observations deviating from the dominant type

Example: Text Output

vajra stats claim.json
=== Statistical Summary ===
Document: claim.json (847 nodes, 23 paths)

--- $.claims[*].service_lines[*].charge_amount ---
  Count:       14
  Cardinality: 12
  Entropy:     3.41 bits (normalized: 0.88)
  Type:        number (100%)
  Min:         45.00    Max: 1250.00
  Mean:        312.50   Median: 285.00
  MAD:         195.00
  p25:         125.00   p75: 425.00
  p95:         890.00   p99: 1125.00

--- $.claims[*].service_lines[*].status ---
  Count:       14
  Cardinality: 3
  Entropy:     1.22 bits (normalized: 0.77)
  Type:        string (100%)
  Top values:
    "adjudicated"  10 (71.4%)
    "pending"        3 (21.4%)
    "denied"         1 (7.1%)

--- $.claims[*].service_lines[*].allowed_amount ---
  Count:       11
  Cardinality: 9
  Entropy:     3.12 bits (normalized: 0.93)
  Type:        number (100%)
  Null rate:   0.000
  Absent rate: 0.214  ** notable: missing in 3 of 14 service lines **
  Min:         32.00    Max: 875.00
  Mean:        245.30   Median: 210.00
  MAD:         142.00

--- $.claims[*].diagnosis[*].code ---
  Count:       2
  Cardinality: 2
  Entropy:     1.00 bits (normalized: 1.00)
  Type:        string (100%)
  Top values:
    "E11.9"  1 (50.0%)
    "J44.1"  1 (50.0%)

--- $.claims[*].service_lines[*].adjustment.reason ---
  Count:       14
  Cardinality: 4
  Entropy:     1.56 bits (normalized: 0.78)
  Type:        string (100%)
  Top values:
    "CO-45"   8 (57.1%)
    "CO-97"   3 (21.4%)
    "PR-1"    2 (14.3%)
    "OA-23"   1 (7.1%)

Example: JSON Output

vajra stats claim.json --format json
{
  "document": "claim.json",
  "total_nodes": 847,
  "distinct_paths": 23,
  "paths": {
    "$.claims[*].service_lines[*].charge_amount": {
      "count": 14,
      "cardinality": 12,
      "entropy": 3.41,
      "normalized_entropy": 0.88,
      "types": {"number": 14},
      "null_rate": 0.0,
      "absent_rate": 0.0,
      "numeric": {
        "min": 45.0,
        "max": 1250.0,
        "mean": 312.5,
        "median": 285.0,
        "mad": 195.0,
        "percentiles": {
          "p01": 45.0, "p05": 52.0, "p25": 125.0,
          "p50": 285.0, "p75": 425.0, "p95": 890.0, "p99": 1125.0
        }
      },
      "top_values": [
        {"value": "285.00", "count": 2},
        {"value": "125.00", "count": 2}
      ]
    },
    "$.claims[*].service_lines[*].status": {
      "count": 14,
      "cardinality": 3,
      "entropy": 1.22,
      "normalized_entropy": 0.77,
      "types": {"string": 14},
      "null_rate": 0.0,
      "absent_rate": 0.0,
      "top_values": [
        {"value": "adjudicated", "count": 10},
        {"value": "pending", "count": 3},
        {"value": "denied", "count": 1}
      ]
    }
  }
}

When to Use It

  • Understanding data distributions. What does the charge_amount field actually look like? What are the common status values? How much entropy does this field carry?
  • Finding hidden nulls and absences. A field with 21% absent rate across service lines is operationally significant — stats surfaces this.
  • Establishing baselines. Run stats on today’s batch. Run it again tomorrow. Compare the distributions manually or feed them to drift.
  • Identifying enum-like fields. Low cardinality + low entropy = enum. High cardinality + high entropy = identifier. stats makes this distinction quantitative.

Pairs Well With

  • inspect — structural overview before statistical deep dive
  • anomaliesstats computes the distributions; anomalies flags what deviates from them
  • essence — the essence builder uses stats internally to score observation importance
  • invariants — cross-field analysis builds on per-field statistics