invariants

invariants discovers cross-field relationships from observed data. It finds fields that predict other fields, fields that always co-occur, and fields that are functionally dependent — all without prior knowledge of the schema.

This is data archaeology. Vajra examines the statistical co-occurrence of fields and extracts the latent rules that the data obeys.

Usage

vajra invariants <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, NDJSON batch, `-` for stdin, or directory

Flags:

Flag	Description	Default
`--top-k <N>`	Maximum number of field pairs to consider	50
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

The Mathematics

Conditional Entropy

For field pairs (X, Y):

H(Y|X) = -sum p(x,y) * log2(p(y|x))

Low H(Y|X) means X strongly predicts Y. If H(Y|X) approaches 0, Y is functionally determined by X — knowing X tells you Y with near-certainty.

Pointwise Mutual Information (PMI)

PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))

Positive PMI means x and y co-occur more than chance predicts. Negative PMI means they avoid each other. Zero means independence.

PMI is the information-theoretic standard for measuring association strength.

Discovery Procedure

Screen: consider only paths with observation count > 30 (configurable). This filters noise.
Compute: for all pairs among the top-k most frequent paths, calculate conditional entropy and PMI.
Rank: ascending H(Y|X) for dependency strength, descending |PMI| for association strength.
Report: the strongest relationships with examples from the data.

With k = 50, this is 2,500 pairs — trivial even on large datasets. Unlike general association rule mining (which explores an exponential itemset space), this approach is bounded by design.

Example: Text Output

vajra invariants claims_batch.ndjson

=== Cross-Field Invariants ===
Records analyzed: 1,247
Field pairs screened: 1,225 (top 50 paths)

--- Functional Dependencies (H(Y|X) < 0.1) ---
  $.claims[*].subscriber.id -> $.claims[*].subscriber.name
    H(name|id) = 0.00
    subscriber.id fully determines subscriber.name
    Example: id "SUB-4421" -> name "Martinez, Elena" (47 records)

  $.claims[*].provider.npi -> $.claims[*].provider.name
    H(name|npi) = 0.03
    provider.npi nearly determines provider.name (3 exceptions in 1,247)
    Example: npi "1234567890" -> name "Valley Medical Group" (312 records)

--- Strong Co-occurrence (PMI > 2.0) ---
  $.claims[*].status = "denied" <-> $.claims[*].denial_reason present
    PMI = 3.8
    When status is "denied", denial_reason is present 97% of the time.
    When status is not "denied", denial_reason is present 2% of the time.

  $.claims[*].service_lines[*].procedure_code <-> $.claims[*].service_lines[*].service_date
    PMI = 3.2
    These fields co-occur in 99.8% of service lines. Effectively always together.

--- Conditional Presence ---
  $.claims[*].service_lines[*].modifier_codes
    Present in 100% of records where procedure_code starts with "9921"
    Present in 12% of records where procedure_code starts with "9939"
    Modifier presence is conditionally dependent on procedure type.

--- Anti-Correlation (PMI < -1.0) ---
  $.claims[*].status = "adjudicated" <-> $.claims[*].hold_reason present
    PMI = -2.1
    These rarely co-occur. Adjudicated claims almost never have hold reasons.

Example: JSON Output

vajra invariants claims_batch.ndjson --format json

{
  "records_analyzed": 1247,
  "pairs_screened": 1225,
  "functional_dependencies": [
    {
      "source": "$.claims[*].subscriber.id",
      "target": "$.claims[*].subscriber.name",
      "conditional_entropy": 0.0,
      "strength": "exact",
      "example": {
        "source_value": "SUB-4421",
        "target_value": "Martinez, Elena",
        "count": 47
      }
    },
    {
      "source": "$.claims[*].provider.npi",
      "target": "$.claims[*].provider.name",
      "conditional_entropy": 0.03,
      "strength": "near_exact",
      "exceptions": 3,
      "example": {
        "source_value": "1234567890",
        "target_value": "Valley Medical Group",
        "count": 312
      }
    }
  ],
  "co_occurrences": [
    {
      "field_a": "$.claims[*].status",
      "value_a": "denied",
      "field_b": "$.claims[*].denial_reason",
      "pmi": 3.8,
      "conditional_presence": 0.97
    }
  ],
  "anti_correlations": [
    {
      "field_a": "$.claims[*].status",
      "value_a": "adjudicated",
      "field_b": "$.claims[*].hold_reason",
      "pmi": -2.1
    }
  ]
}

What Invariants Reveal

Functional dependencies are the strongest signal. When subscriber.id fully determines subscriber.name, that is not an accident — it reflects a real-world constraint. If that constraint breaks (a subscriber ID mapping to two different names), you have a data quality issue.

Co-occurrence patterns reveal implicit business rules. “When status is denied, denial_reason is present” is a rule that lives in the data, not in a schema. Vajra discovers it empirically.

Anti-correlations reveal mutual exclusions. Fields that never co-occur often represent different branches of a state machine — knowing which branch you are on determines which fields exist.

Conditional presence reveals fields whose existence depends on the value of another field. This is where JSON schemas fall short — they cannot express “this field exists only when that field equals X.”

When to Use It

Schema documentation. Discover the implicit rules that the data already obeys. Document them before they are lost.
Data quality rules. Turn discovered invariants into validation rules. If subscriber.id always determines subscriber.name, alert when it does not.
Onboarding. New to a dataset? invariants shows you the relationships between fields faster than reading documentation (which may not exist).
Audit evidence. Demonstrate that field dependencies are consistent across a batch.

Pairs Well With

stats — invariants build on per-field statistics (entropy, frequency, null rates)
anomalies — broken invariants (a dependency that holds 99% of the time but not in record 662) are anomalies
essence — discovered relationships appear in the essence as notable observations
drift — if an invariant holds in the baseline but breaks in the candidate, that is a significant drift signal

Keyboard shortcuts

Vajra