Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Information Theory

Vajra’s analytical core is an information-theoretic pipeline. Every measure of diversity, anomaly, drift, similarity, and dependency traces back to a concept from information theory. This chapter covers the full stack — from foundational primitives through composite metrics to the scoring model that turns bits into insights.


Foundation: Shannon Entropy

The starting point. Shannon entropy measures the average surprise per observation:

H(X) = -sum p(x) * log2(p(x))
  • 0 bits: constant field (no information)
  • log2(k) bits: uniform distribution over k values (maximum information)
  • Between: the interesting space where identifiers, dates, and codes live

Normalized entropy scales to [0, 1]:

H_norm(X) = H(X) / log2(|support|)

This is the single most important signal in Vajra. A field with H_norm near 0 is noise. A field with H_norm near 1 is unstructured randomness. Meaningful variation lives in the middle.

Files: vajra-stats/src/entropy.rs


The Renyi Spectrum

Shannon entropy is one point on a continuous family parameterized by alpha:

H_alpha(X) = (1 / (1 - alpha)) * log2(sum p(x)^alpha)
alphaNameWhat It Measures
0Hartleylog2(support size) — how many distinct values exist
1Shannonaverage surprise (limit as alpha approaches 1)
2Collision-log2(sum p^2) — probability two random draws match
infinityMin-entropy-log2(max p) — worst-case unpredictability

Why a spectrum? A single entropy number hides the shape of the distribution. The Renyi spectrum reveals it:

  • High Shannon, low min-entropy: long tail with one dominant value
  • All orders equal: near-uniform distribution
  • Large divergence (H0 - H_inf): heavy concentration with many rare values

Security application: Min-entropy is the correct measure for cryptographic key strength — not Shannon. A key with high Shannon but low min-entropy has a predictable most-likely value.

Spectrum divergence (H0 - H_inf) is itself a signal: it quantifies how far the distribution is from uniform. Zero divergence = uniform. High divergence = concentrated.

Complexity: O(n), same as Shannon. Computed from the same frequency counts.

Files: vajra-stats/src/renyi.rs


Structural Complexity: Lempel-Ziv

Shannon entropy measures average information per symbol. It cannot distinguish:

InputShannon EntropyLZ Complexity
Random UUIDsHighHigh
PROJ-001, PROJ-002, …HighLow
Repeated "active"LowLow

Lempel-Ziv complexity (LZ76) measures the number of distinct subpatterns needed to describe a sequence. The LZ76 algorithm scans left-to-right, extending the current phrase until it hasn’t been seen before:

Normalized C_LZ = phrase_count / (n / log2(n))

The entropy-complexity plane has four quadrants:

Low LZHigh LZ
High entropyStructured (patterned identifiers)Random (UUIDs, hashes)
Low entropyConstant (repeated values)Anomalous (theoretically unlikely)

A field in the “structured” quadrant (high entropy, low complexity) is a generated identifier with a pattern. A field in the “random” quadrant is truly unpredictable. Shannon alone cannot tell them apart.

Complexity: O(n) single pass. No external dependencies.

Files: vajra-stats/src/lz_complexity.rs


Relationships: Conditional Entropy and PMI

Conditional Entropy H(Y|X)

How much knowing X reduces uncertainty about Y:

H(Y|X) = -sum p(x,y) * log2(p(y|x))
  • H(Y|X) = 0: X completely determines Y (functional dependency)
  • H(Y|X) = H(Y): X tells you nothing about Y (independence)

Relationship strength normalizes this:

strength = 1 - H(Y|X) / H(Y)

Clamped to [0, 1]. Zero = independent. One = deterministic.

Pointwise Mutual Information

Measures co-occurrence strength between specific value pairs:

PMI(x, y) = log2(P(x,y) / (P(x) * P(y)))

Positive = co-occur more than chance. Negative = avoid each other. Zero = independent.

Total Correlation

Pairwise measures miss higher-order structure. Three fields can be independent in pairs but jointly constrained (city + state + zip). Total correlation captures this:

TC(X1,...,Xn) = sum H(Xi) - H(X1,...,Xn)
  • TC = 0: all fields are independent
  • High TC: the schema has deep internal structure
  • TC / sum H(Xi): normalized to [0, 1]

Total correlation answers: “how much redundancy exists across all fields simultaneously?” This is the gap between pairwise analysis and true multivariate dependency.

Complexity: O(n) for marginals. Joint entropy estimated via binning, bounded to 8-field subsets for tractability.

Files: vajra-stats/src/relationships.rs, vajra-stats/src/total_correlation.rs


Distributional Drift: JSD and Wasserstein

Jensen-Shannon Divergence

Symmetric, bounded, and a proper metric (via square root):

JSD(P, Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)

where M = 0.5 * (P + Q) and KL is Kullback-Leibler divergence.

  • JSD in [0, 1] with log base 2
  • sqrt(JSD) satisfies the triangle inequality (Endres & Schindelin 2003)
  • Used for categorical distribution drift

1D Wasserstein Distance

For numeric distributions, measures the “earth mover’s distance”:

W1 = integral |CDF_a(x) - CDF_b(x)| dx

JSD tells you the distributions changed. Wasserstein tells you by how much — it captures the magnitude of the shift, not just its existence.

When to use which:

Data TypeMetricWhy
Categorical (strings, enums)JSDProbability mass redistribution
Numeric (amounts, counts)WassersteinShift magnitude in original units

Files: vajra-drift/src/jsd.rs, vajra-drift/src/wasserstein.rs


Directed Information Flow: Transfer Entropy

Transfer entropy measures how much knowing the past of X reduces uncertainty about Y’s future, beyond what Y’s own past already tells you:

TE(X->Y) = H(Y_t | Y_{t-1}^k) - H(Y_t | Y_{t-1}^k, X_{t-1}^l)

Key properties:

  • Directional: TE(X->Y) != TE(Y->X) — reveals causal flow
  • Non-negative: information can only help prediction
  • Granger causality generalized: captures nonlinear dependencies

This transforms cascade detection from temporal pattern matching into rigorous directed information flow quantification. Instead of “A happened before B,” transfer entropy says “A’s past carries 2.3 bits of information about B’s future that B’s own history doesn’t contain.”

Net information flow = TE(X->Y) - TE(Y->X). Positive means X drives Y. Negative means Y drives X.

Complexity: O(n * k) where k is history depth. Deterministic with fixed binning.

Files: vajra-stats/src/transfer_entropy.rs


Universal Similarity: NCD

Normalized Compression Distance approximates the normalized information distance — provably the most general similarity metric:

NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

where C is a real compressor (zstd at fixed level 3).

Why NCD is strictly more powerful than feature-based similarity: MinHash captures set overlap. SimHash captures angular proximity. Both require choosing features. NCD captures all computable regularities — structure, patterns, naming conventions, content — with zero feature engineering.

Two documents that share structural patterns but zero literal values will have low NCD. Two documents with random shared tokens but different structure will have high NCD.

  • NCD(x, x) approaches 0 (self-similarity)
  • NCD(x, random) approaches 1 (dissimilarity)
  • Symmetric: NCD(x, y) = NCD(y, x)
  • Deterministic given fixed compressor and level

Complexity: O(n) per compression. O(n^2) for all-pairs matrix with C(x) caching.

Files: vajra-fingerprint/src/ncd.rs


Anomaly Scoring

Self-Information (Surprisal)

The rarity of a single observation:

I(x) = -log2(p(x))

A value seen once in 10,000 observations carries 13.3 bits of rarity. This is the information-theoretic foundation of rare value detection.

MAD-Based Outlier Detection

Median Absolute Deviation with modified z-scores:

z_MAD = 0.6745 * (x - median) / MAD

Values with |z_MAD| > 3.5 are flagged. MAD has a 50% breakdown point — half the data can be corrupted before it gives misleading results.

Benford’s Law

Leading digit distribution for numeric fields:

P(d) = log10(1 + 1/d)

Conformity tested via chi-squared and Nigrini’s MAD score. Non-conformity (MAD > 0.015) signals potentially fabricated or unusual numeric data.


The Six-Dimensional Scoring Model

Every observation is scored across six information-theoretic dimensions:

DimensionSourceRange
raritySelf-information, cardinality[0, 1]
instabilityType distribution: 1 - (dominant/total)[0, 1]
entropy_signalNormalized Shannon entropy[0, 1]
structural_coverageNull rate, enum-like patterns[0, 1]
anomaly_strengthMAD z-scores, rarity magnitude[0, 1]
concern_relevanceDomain-specific importance[0, 1]

The composite score is a weighted sum:

score = sum weight_i * dimension_i

Weights depend on the concern profile:

ProfileRarityInstabilityEntropyCoverageAnomalyConcern
Engineer0.150.150.150.150.150.15
Staff0.100.100.100.250.300.15
Fraud0.250.100.100.100.350.10

The Integration Pipeline

JSON Document
  |
  v
[Stats Analyzer] --- entropy, Renyi spectrum, LZ complexity, cardinality, rarity
  |
  v
[Anomaly Analyzer] --- rare values (surprisal), type instabilities, MAD outliers
  |
  v
[Relationship Discovery] --- conditional entropy, PMI, total correlation
  |
  v
[Drift Analyzer] --- JSD (categorical), Wasserstein (numeric), severity
  |
  v
[Cascade Analyzer] --- transfer entropy, directed information flow
  |
  v
[Feature Store] --- PathFeatures with all information-theoretic signals
  |
  v
[Essence Builder] --- ScoredObservations across 6 dimensions
  |
  v
[Profile Scorer] --- Weighted composite score
  |
  v
[EssenceData] --- Prioritized findings for humans and AI

Every anomaly signal, every drift measurement, every relationship discovery, and every cascade detection is rooted in information theory. The entire system is fundamentally an information-theoretic lens on structured data.