Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Algorithms

This is the technical heart of Vajra. Every algorithm here was selected against three gates. Any algorithm that failed any gate was cut.


The Three Gates

Gate 1: Scale

O(n) or O(n log n) time complexity. Bounded or streaming-compatible memory. If an algorithm cannot handle a billion nodes without choking, it does not enter.

Gate 2: Battle-Tested

Published, peer-reviewed, deployed in production systems at scale. No novel algorithms. No research prototypes. No “clever tricks” that have not survived contact with real data.

Gate 3: Deterministic

Same input, same output. If an algorithm requires random sampling without seed control, or produces platform-dependent results, or depends on iteration order of an unordered collection — it does not enter.


The Algorithms

BLAKE3 Hashing

Provenance: O’Connor, Aumasson, Neves, Wilcox-O’Hearn, 2020. Rust-native reference implementation.

What it does: All hashing in Vajra. Path set fingerprints, typed path fingerprints, Merkle subtree hashing, content hashing, MinHash hash functions.

Why BLAKE3 over alternatives:

ContenderWhy Rejected
SHA-2563-7x slower on modern hardware. No parallelism.
SHA-3Slower than BLAKE3 on all platforms. No parallel tree structure.
xxHash / FNVNot cryptographic. Collision resistance matters for fingerprinting.
SipHashDesigned for hash tables, not content addressing. Slower for bulk data.

Why BLAKE3 wins: 3-7x faster than SHA-256. Internally parallelizable via Bao tree structure. 256-bit output with cryptographic strength. Rust-native. Deterministic. One algorithm for every hashing need in the system.

Complexity: O(n) time, O(1) memory per hash. Internally parallel for large inputs.


simd-json Parsing

Provenance: Langdale & Lemire, 2019. Based on the simdjson C++ library, ported to Rust.

What it does: DOM-mode JSON parsing at 2+ GB/s using SIMD instructions for structural character classification and string validation.

Why simd-json over alternatives:

ContenderWhy Rejected
serde_json400-800 MB/s. Adequate but not operational speed for large documents.
sonic-rsYoung ecosystem. Less battle-tested.
Manual SAX parserNecessary for streaming mode, but DOM mode needs random access.

Why simd-json wins: Measured 2+ GB/s on modern hardware. Uses SIMD for structural indexing. Operates on mutable borrowed slices for zero-copy access to string values.

Complexity: O(n) time. O(n) memory for the DOM.


RFC 8785 Canonicalization (JCS)

Provenance: IETF RFC 8785, published 2020. The standard for deterministic JSON serialization.

What it does: Removes irrelevant representational variance. Lexicographic key ordering by UTF-16 code unit sequence. Deterministic number formatting. No whitespace.

Extensions beyond the RFC:

  • Unicode NFC normalization (UAX #15) for string comparison stability
  • Null vs. absent distinction preservation (RFC 8785 does not address this; Vajra tracks it explicitly)
  • Configurable array order policy: preserve (default), set (unordered deduplicated), multiset (unordered with duplicates)

Complexity: O(n log k) where k = maximum keys per object. Memory: O(n).


Shannon Entropy

Provenance: Shannon, 1948. The foundation of information theory. Sixty-eight years of deployment in every field that measures information.

What it does: For each path, measures the information content of observed values.

H(X) = -sum p(x) * log2(p(x))

Normalized:

H_norm(X) = H(X) / log2(|support|)

Why entropy is the strongest universal primitive: It distinguishes boilerplate from signal without domain knowledge. A constant field (H = 0) is noise. A uniform random field (H_norm = 1) is unstructured. Meaningful variation lives in the middle — identifiers, dates, codes, status values.

Streaming computation: Maintained via running counts per value per path. When the value space exceeds memory, entropy is estimated from Count-Min Sketch frequency approximations.

Complexity: O(n) time, O(v) space where v = distinct values per path.


Count-Min Sketch (CMS) with Conservative Update

Provenance: Cormode & Muthukrishnan, 2005 (original CMS). Conservative update: Estan & Varghese, 2002.

What it does: Streaming frequency estimation when exact counts would exceed memory. Maintains a 2D array of counters with multiple hash functions.

Conservative update improvement: Instead of incrementing all d counters, only increment counters currently equal to the minimum. Provably reduces over-estimation error without changing the data structure.

Parameters:

  • Width w = ceil(e / epsilon) where epsilon = desired error rate (default 0.001)
  • Depth d = ceil(ln(1 / delta)) where delta = failure probability (default 0.01)
  • Default: w = 2,718, d = 5

Guarantees: Estimated count satisfies: true count <= estimate <= true count + epsilon * N with probability >= 1 - delta, where N = total count.

What it replaced:

ContenderWhy Rejected
Exact hash mapsUnbounded memory on high-cardinality paths.
Bloom filtersCannot count — only membership testing.
Count SketchReturns negative estimates. CMS guarantees non-negative.

Complexity: O(d) per update. O(w * d) memory. Both constants independent of data size.


Space-Saving Algorithm

Provenance: Metwally, Agrawal, El Abbadi, 2005.

What it does: Identifies the top-k most frequent elements in a stream using exactly k counters.

Mechanism: When a new element arrives that is not tracked, evict the element with the smallest count, replace it, and increment. Despite its simplicity, every element whose true frequency exceeds N/k is guaranteed to be in the summary.

What it replaced:

ContenderWhy Rejected
Frequent algorithm (Misra-Gries)Weaker error bounds for the same space.
Lossy CountingHigher space complexity. More complex implementation.
Full sortingO(n log n) and requires all data in memory.

Complexity: O(1) amortized per update with a min-heap. O(k) memory.


DDSketch

Provenance: Masson, Rim, Lee, 2019. Developed at Datadog. Deployed in production across billions of data points per second.

What it does: Streaming quantile estimation with relative error guarantees. For any quantile q, the returned value satisfies |estimate - true| <= alpha * |true| where alpha is the relative accuracy parameter.

Why DDSketch over alternatives:

ContenderWhy Rejected
t-digest (Dunning, 2019)No formal error guarantees. Accuracy is empirically good but theoretically unbounded.
Fixed-width histogramsAbsolute error is meaningless when values span orders of magnitude (cents to millions).
Random samplingNo guarantees on tail quantiles — precisely where anomalies live.
GK sketch (Greenwald-Khanna)Provides absolute error, not relative. DDSketch adapts to data scale.

Critical property: mergeability. DDSketch instances can be merged exactly, preserving accuracy guarantees. This enables parallel batch processing: analyze partitions independently, merge sketches for global statistics with zero accuracy loss.

Parameters: alpha = 0.01 (1% relative error) by default. Memory: O(log(max/min) / log(1 + alpha)) buckets — typically a few hundred for financial data spanning cents to millions.

Complexity: O(1) per insertion. O(1) per quantile query.


Median Absolute Deviation (MAD)

Provenance: Hampel, 1974. Standard robust statistics.

What it does: Robust measure of dispersion.

MAD = median(|x_i - median(X)|)

Modified z-score:

z_MAD = 0.6745 * (x_i - median(X)) / MAD

Why MAD over standard deviation: Standard deviation has a 0% breakdown point — a single extreme value inflates sigma enough to mask every other anomaly. MAD has a 50% breakdown point — half the data can be arbitrarily corrupted before MAD gives a misleading result. This is the strongest possible breakdown point for any location/scale estimator.

Anomaly threshold: |z_MAD| > 3.5 flags an anomaly candidate (Iglewicz & Hoaglin, 1993). Configurable per profile.

Streaming computation: Exact MAD requires sorted data. Running approximate median via DDSketch enables streaming MAD estimation with bounded relative error.

Complexity: O(n) with sorting, or O(n) streaming via DDSketch.


MinHash

Provenance: Broder, 1997. The foundation of modern near-duplicate detection and similarity search.

What it does: Estimates Jaccard similarity between sets in constant time per comparison.

Vajra computes MinHash signatures over wildcard path sets using k independent hash functions (k = 128 by default). For memory efficiency, the b-bit MinHash variant (Li & Konig, 2011) stores only the lowest b bits of each hash value.

What it replaced:

ContenderWhy Rejected
Exact pairwise JaccardO(n^2) for batch comparison. Fine for < 1,000 docs, breaks above that.
Random projection (SimHash)Better for cosine similarity. MinHash is optimal for Jaccard.

Complexity: O(n) for signature computation. O(k) for pairwise comparison. O(k) memory per document.


SimHash

Provenance: Charikar, 2002.

What it does: Fixed-width fingerprints where Hamming distance approximates cosine distance. Used for near-motif detection — subtrees that are semantically the same but differ in one or two fields.

SimHash operates over (key, value_type) feature pairs within each subtree. Subtrees whose SimHash values have Hamming distance <= t (default t = 3 out of 64 bits) are grouped as near-motifs.

Complexity: O(n) time. O(1) per fingerprint comparison.


Locality-Sensitive Hashing (LSH)

Provenance: Indyk & Motwani, 1998. Banded variant as described by Leskovec, Rajaraman & Ullman.

What it does: Partitions MinHash signatures into b bands of r rows, hashing each band into buckets. Documents sharing a bucket in any band are candidate pairs.

The S-curve probability: P(candidate) = 1 - (1 - s^r)^b

With k = 128, b = 16, r = 8: documents with Jaccard similarity > 0.5 have > 98% chance of being found. Documents with similarity < 0.2 have < 2% false positive rate.

What it replaced:

ContenderWhy Rejected
Hierarchical agglomerative clusteringO(n^2 log n) time, O(n^2) memory. Breaks on > 10K documents.
k-means / k-medoidsRequires specifying k in advance. Vajra cannot know the cluster count.

Complexity: O(n) amortized for LSH indexing.


Jensen-Shannon Divergence (JSD)

Provenance: Lin, 1991. Square root metric property: Endres & Schindelin, 2003; Osterreicher & Vajda, 2003.

What it does: Measures distributional drift between two value distributions.

JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)

where M = 0.5 * (P + Q).

Why JSD over alternatives:

ContenderWhy Rejected
KL divergenceAsymmetric. Infinite when Q(x) = 0 and P(x) > 0.
Chi-squared testSensitive to bin size. Not a proper metric.
Kolmogorov-SmirnovMeasures maximum deviation only, not overall distribution shape.
Total variation distanceDoes not account for similarity between nearby values.

Why JSD wins: Symmetric. Always finite. Bounded to [0, 1] with log base 2. Square root is a proper metric satisfying the triangle inequality. Drift magnitudes can be meaningfully compared.

Complexity: O(v) where v = union of value supports.


1D Wasserstein Distance (Earth Mover’s Distance)

Provenance: Kantorovich, 1942. Computational formulation: O(n log n) via CDF sorting.

What it does: For numeric paths, measures “how far did values move” — not just that the distribution changed, but the magnitude of the shift.

Why included alongside JSD: JSD measures probability mass redistribution. Wasserstein captures the distance of the shift. A distribution that shifts entirely from $100 to $100.01 has low Wasserstein but potentially high JSD. A distribution that shifts from $100 to $10,000 has high Wasserstein. Both measures together give a complete picture.

Complexity: O(n log n) via sorting.


Pointwise Mutual Information (PMI)

Provenance: Church & Hanks, 1989 (in NLP). Rooted in Shannon, 1948.

What it does: Measures association strength between field pairs.

PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))

Positive = co-occur more than chance. Negative = avoid each other. Zero = independent.

Complexity: O(1) per pair given precomputed frequencies.


Conditional Entropy

Provenance: Shannon, 1948.

What it does: Measures how much knowing field X reduces uncertainty about field Y.

H(Y|X) = -sum p(x,y) * log2(p(y|x))

Low H(Y|X) means X predicts Y. H(Y|X) near 0 means functional dependency.

Complexity: O(n) to compute from co-occurrence counts.


Benford’s Law Analysis

Provenance: Newcomb, 1881. Benford, 1938. Formalized by Hill, 1995. Forensic application: Nigrini, 1996.

What it does: Tests whether leading digit distribution matches the expected logarithmic distribution:

P(d) = log10(1 + 1/d)

Departure measured via chi-squared goodness-of-fit. Effective for financial amounts, counts, and quantities. Not applicable to identifiers, codes, or constrained-range values — Vajra applies this only to paths classified as numeric with high cardinality and range spanning at least one order of magnitude.

Complexity: O(n) — single pass to count leading digits.


The Rejection List

These algorithms were evaluated and rejected. Each had a specific reason.

AlgorithmReason for Rejection
Isolation Forest (Liu et al., 2008)Non-deterministic without careful seeding. O(n log n) per tree. Contamination parameter requires tuning. MAD + rarity + structural distance cover the same space with stronger interpretability.
Local Outlier Factor (Breunig et al., 2000)O(n^2) naive, O(n log n) with spatial indexing. Sensitive to k parameter. Breaks universality on large datasets.
t-digest (Dunning, 2019)No formal error guarantees. Accuracy is empirically good but theoretically unbounded. DDSketch provides provable bounds.
Hierarchical agglomerative clusteringO(n^2 log n) time, O(n^2) memory. Replaced by LSH-based component detection at O(n).
k-means / k-medoidsRequires specifying k in advance. Also O(n^2) per iteration for k-medoids.
Any method requiring training dataVajra operates on cold data with no prior. Every method must work on a single document or batch with no history.
Any method requiring labeled examplesSame constraint. Unsupervised only.
Any method without deterministic outputThe determinism guarantee is non-negotiable.