VAJRA
Deterministic Semantic Reduction Engine
What Vajra Does
Feed it any structured data. Get back shape, signal, anomalies, and truth.
Vajra analyzes JSON, YAML, CSV, NDJSON, Markdown, and PDF. It extracts structural fingerprints, computes entropy and statistical profiles, detects anomalies and schema drift, discovers cross-field relationships, and renders deterministic essences tuned for humans, auditors, or AI pipelines.
Inspect
vajra inspect claim.json
Full structural analysis — paths, types, fingerprints, domain recognition.
Essence
vajra essence data.json --profile staff
Concern-oriented reduction. 7 profiles. Token budgets. Compact-AI output for LLMs.
Drift
vajra drift v1.json v2.json
Schema drift detection with JSD, Wasserstein distance, severity classification.
Anomalies
vajra anomalies batch.ndjson
MAD-based outliers, rarity scoring, type instability. Deterministic. Explainable.
Query
vajra query data.json 'entropy($.status) > 0.5'
Path expressions with analysis functions. Entropy, rarity, null rate, instability.
Cluster
vajra cluster batch/*.json
MinHash + LSH similarity clustering. Finds payload families in seconds.
Forged for the Agent Gods
Vajra was not designed for casual use. It was forged as a weapon — an instrument of precision for AI systems that need to understand structured data at scale.
The compact-ai output compresses a 1000-node JSON document into a token-efficient essence that preserves every anomaly, every structural motif, every statistical signal — in a format an LLM can parse in a single pass.
The chain-ready drill section tells the downstream model exactly which paths have deeper analysis available, enabling multi-turn investigation without re-processing.
The determinism guarantee means the same input always produces the same output. No drift. No randomness. No surprises. An AI pipeline that depends on Vajra can depend on Vajra.
vajra essence massive.json --profile ai --format compact-ai --budget 500
{
"v": "vajra/1",
"doc": {"nodes": 847, "paths": 23, "depth": 6},
"anomalies": [
{"p": "$.claims[*].allowed", "t": "type_instability", "s": 0.4},
{"p": "$.claims[*].charge", "t": "numeric_outlier", "v": 350, "z": 4.2}
],
"drill": [
{"path": "$.claims[*].service_lines", "available": ["stats", "anomalies", "motifs"]}
],
"meta": {"profile": "ai", "truncated": false}
}
The Engine
BLAKE3 Fingerprinting
Merkle subtree hashing. Path set signatures. Motif detection falls out for free. O(n).
Shannon Entropy
Distinguishes boilerplate from signal without domain knowledge. The strongest universal primitive.
MAD Outliers
50% breakdown point. Half the data can be corrupted before MAD gives a misleading result.
Jensen-Shannon Divergence
Symmetric. Bounded. A proper metric via sqrt. The right way to measure distribution drift.
DDSketch
Relative-error quantile estimation. Mergeable. O(1) per insert. Streams terabytes in megabytes of RAM.
MinHash + LSH
Sublinear similarity search. Cluster 10K documents in seconds. No O(n^2) anywhere.
Install
cargo install vajra-cli
Or from source:
git clone https://github.com/copyleftdev/vajra
cd vajra
cargo build --release
First useful output in under 30 seconds:
echo '{"hello": "world"}' | vajra inspect -
Quickstart
You have 60 seconds. Let us not waste them.
Install
From crates.io:
cargo install vajra-cli
From source:
git clone https://github.com/copyleftdev/vajra
cd vajra
cargo build --release
# Binary lands at ./target/release/vajra
Verify:
vajra --help
Four Commands That Prove the Point
1. Inspect a JSON document
Feed Vajra a medical claim. Get back its skeleton — every path, every type, every fingerprint.
vajra inspect claim.json
=== Document Metadata ===
Total nodes: 847
Max depth: 6
Distinct paths: 23
Raw size: 14208 bytes
=== Wildcard Paths ===
PATH TYPE COUNT INSTABILITY NULL_RATE
$ object 1 0.0000 0.0000
$.claims array 1 0.0000 0.0000
$.claims[*] object 1 0.0000 0.0000
$.claims[*].patient.id string 1 0.0000 0.0000
$.claims[*].diagnosis[*].code string 2 0.0000 0.0000
$.claims[*].service_lines[*].procedure_code string 14 0.0000 0.0000
$.claims[*].service_lines[*].charge_amount number 14 0.0000 0.0000
$.claims[*].service_lines[*].allowed_amount number 11 0.0000 0.2143
$.claims[*].service_lines[*].status string 14 0.0000 0.0000
=== Fingerprints ===
Path set: a1b2c3d4e5f6...
Typed path: f7e8d9c0b1a2...
Shape: 1234abcd5678...
=== Domain Type Recognition ===
$.claims[*].diagnosis[*].code E11.9 ICD-10-CM
$.claims[*].service_lines[*].procedure_code 99213 CPT
Every path. Every type. Every structural fingerprint. Domain-specific codes recognized automatically. Zero configuration.
2. Generate an essence
Compress the entire document into what matters, shaped for a specific audience.
vajra essence claim.json --profile staff
=== Essence (staff profile) ===
Document Summary:
1 claim with 14 service lines, 1 patient, 2 diagnosis codes.
Primary status: partially adjudicated.
What Stands Out:
- 3 service lines are missing allowed amounts (lines 2, 7, 11).
This field is present in 79% of service lines — its absence is notable.
- Adjustment reason code "CO-45" repeats across 8 of 14 lines.
Repetition at this frequency suggests a systematic pattern, not random variation.
- 1 diagnosis structure differs from the other.
The second diagnosis carries an extra "qualifier" field.
What This Likely Means:
- Most of the claim is consistent and well-formed.
- A subset of service lines appears incomplete or differently processed.
- The repeated adjustment code points to a systematic issue.
Same command, different audience:
vajra essence claim.json --profile ai --format json --budget 500
{
"vajra_essence": {
"version": "0.1.0",
"profile": "ai",
"structure": {
"root_type": "object",
"total_nodes": 847,
"distinct_paths": 23,
"max_depth": 6
},
"dominant_motif": {
"path": "$.claims[0].service_lines[*]",
"count": 14,
"shape_hash": "f2c1..."
},
"anomalies": [
{"path": "$.claims[0].service_lines[2,7,11].allowed_amount", "type": "missing", "severity": 4.2},
{"path": "$.claims[0].diagnosis[1]", "type": "structural_deviation", "severity": 3.1}
]
}
}
3. Detect drift between versions
Compare yesterday’s API response to today’s. Find what changed and how much it changed.
vajra drift baseline.json current.json
Drift Report: baseline.json -> current.json
Structural similarity: 0.94 (Jaccard)
Added paths (2):
$.response.metadata.processing_flags [array of strings]
$.response.metadata.api_version [string]
Removed paths (0): none
Type changes (1):
$.response.items[*].quantity string -> number
Distribution shifts (1):
$.response.items[*].status JSD: 0.34
before: {"active": 0.82, "pending": 0.15, "error": 0.03}
after: {"active": 0.61, "pending": 0.12, "error": 0.27}
note: "error" rate increased 9x
Overall severity: MEDIUM
Two paths added. One type migrated. The error rate in status jumped ninefold. Vajra found all of it in one pass.
4. Surface anomalies
Find what deviates from the population — without defining what “normal” looks like.
vajra anomalies claims_batch.ndjson
=== Anomaly Report ===
Records analyzed: 1,247
Anomalies found: 8
Numeric outliers:
$.claims[*].service_lines[*].charge_amount
Record 834: value 47,250.00 (z_MAD = 6.3, median = 285.00, MAD = 195.00)
Record 1102: value 0.01 (z_MAD = -4.8)
Rarity outliers:
$.claims[*].status
Record 419: value "voided" (self-information = 10.3 bits, seen 1/1247)
Structural deviations:
Record 662: missing 4 paths present in 99%+ of records
- $.claims[*].subscriber.group_number
- $.claims[*].subscriber.member_id
- $.claims[*].provider.npi
- $.claims[*].provider.taxonomy
Type instability:
$.claims[*].service_lines[*].quantity
Records 88, 204, 917: string where number expected (instability = 0.002)
Eight anomalies across four dimensions. Every one carries its score, its evidence, and the statistical context that makes it interpretable.
What Just Happened
You did not configure a schema. You did not define rules. You did not train a model.
Vajra read the raw structure, computed its statistical profile, and surfaced what deviates from the population — deterministically, explainably, in seconds.
That is the point.
Next Steps
- Philosophy — why Vajra exists and what it refuses to be
- Commands — all 11 commands at a glance
- Profiles — tune the lens for your audience
- Algorithms — the mathematics behind every score
Philosophy
Vajra exists because JSON is lying to you — not about its content, but about its complexity.
A 14,000-line medical claim is not 14,000 lines of information. It is a handful of structural motifs repeated dozens of times, wrapped in representational noise, carrying a few critical signals buried at unpredictable depths. The humans who depend on this data cannot see the signal. The AI systems consuming it waste tokens on the noise. The auditors verifying it have no tools that operate at the right level of abstraction.
Vajra was forged to solve this. Not by transforming the data. Not by summarizing it probabilistically. By analyzing it — deterministically, mathematically, and completely — and rendering the result as a compressed, faithful essence tuned to the concern of whoever is reading it.
The Three Views of JSON
This is the foundational insight. Every JSON document is three things simultaneously.
A Tree
The literal parse tree. Parent-child relationships, nesting depth, sibling structure, array indices. This is what JSON.parse() gives you. It is necessary but not sufficient.
The tree tells you what is here. It does not tell you what matters.
A Graph
Repeated structures create implicit references. Co-occurring keys form relationships. A diagnosis[*].code that appears alongside a diagnosis[*].system and a diagnosis[*].display is not three independent strings — it is a coded concept. A subscriber.id that functionally determines a subscriber.name is a dependency edge, invisible in the tree but real in the data.
The graph tells you how things relate. It reveals structure that the tree hides.
A Distribution
Every key name, every value, every type, every path, every null, every length — all form measurable statistical distributions. Shannon entropy distinguishes boilerplate from signal. Frequency reveals what is common and what is rare. MAD scores expose outliers that standard deviation would mask. The distribution of leading digits (Benford’s Law) separates naturally occurring financial data from fabricated numbers.
The distribution tells you what is normal and what deviates. It does this without rules, without schemas, without training data.
Raw JSON exposes only the tree. Vajra reads all three simultaneously.
The Six Design Principles
These are not aspirations. They are constraints. Every design decision in Vajra was tested against all six. Anything that violated even one was cut.
1. Universal
Any JSON. Any size. Any schema. Any nesting depth. No required schema definition, no required domain knowledge, no assumption about structure. If it parses as JSON, Vajra handles it.
This means: the core engine cannot contain a single line of code that assumes the data is a medical claim, or a financial transaction, or an API response. Domain intelligence enters only through plugins and profiles — never through the engine.
2. Deterministic
Same input + same config + same version = same output. Always. Fingerprints, scores, orderings, essence text, anomaly rankings — all reproducible to the byte.
This is not a nice-to-have. It is the foundation that makes Vajra trustworthy in pipelines, audits, and CI. An AI system that depends on Vajra can depend on Vajra. A compliance team that runs it twice gets the same answer twice.
The cost of this constraint is real: HashMap is banned from all externally-visible orderings (replaced by BTreeMap). Floating-point formatting uses ryu for platform independence. Every randomized algorithm is seeded. These costs are paid gladly.
3. Honest
Every inference is labeled as inference. Every score is decomposable. Every anomaly is explainable. Vajra never silently asserts a heuristic conclusion as truth.
When Vajra infers that a string is a date, it tells you the confidence level: definite (100% of values matched the DFA), dominant (>80%), heuristic (entropy-based), or unclassified (no inference applied). When it flags an anomaly, it shows the z-score, the median, the MAD, and the path. When it ranks an observation in an essence, --explain decomposes the score into its six contributing dimensions.
Magic is the enemy of trust. Vajra does not do magic.
4. Fast
Operational speed. Not batch-overnight speed. Seconds on typical payloads, minutes on gigabyte-scale files. Fast enough to use interactively in a terminal. Fast enough to gate a CI pipeline. Fast enough that reaching for Vajra is faster than opening the file.
The engine achieves this through simd-json for 2+ GB/s parsing throughput, O(n) single-pass analysis wherever possible, arena allocation for ephemeral analysis memory, and Rayon-based parallelism for batch operations.
5. Composable
The CLI, the Rust library, and the plugin system are each independently useful. Analyzers compose. Outputs chain. Profiles combine with formats and budgets.
vajra stats feeds vajra essence. vajra fingerprint feeds vajra drift. vajra anomalies can read from stdin in a pipeline. The library API exposes the same analyzers as the CLI, composable in Rust code without the CLI overhead.
6. Minimal Assumption
The core engine assumes nothing about the domain, the schema, or the purpose of the data. It analyzes structure, statistics, and deviation from population norms. It does not know what a “claim” is. It does not know what “E11.9” means. It does not know that allowed_amount should never be null.
Domain intelligence is real and valuable — but it enters through plugins (vajra-domain-med) and concern profiles (--profile auditor), never through hardcoded logic in the analysis pipeline.
This separation is what makes Vajra universal. The same engine that analyzes medical claims also analyzes IoT sensor payloads, financial transactions, API responses, and configuration files — because it never assumed it was analyzing any of them.
What Vajra Is NOT
Precision requires boundaries. Vajra is not:
-
A replacement for jq. jq transforms JSON. Vajra analyzes and reduces it. They are complementary, not competitive. Use jq to reshape; use Vajra to understand.
-
A probabilistic summarizer. Every reduction Vajra performs is deterministic and explainable. There is no language model in the pipeline. There is no sampling. There is no “approximately.”
-
A database or data store. Vajra is ephemeral. It reads, analyzes, and emits. It does not persist data, cache results, or maintain state between runs.
-
A schema registry. Vajra infers schema characteristics — it does not define or enforce them. It tells you what shape the data has, not what shape it should have.
-
A GUI or BI platform. Vajra is a CLI and a library. It renders text, JSON, Markdown, and compact-AI output. Visualization is left to tools that specialize in it.
-
A data transformation tool. Vajra never rewrites source data. It reads. It analyzes. It emits results. The input is sacred.
-
A validator or linter. Vajra does not check against rules you define. It discovers what the data is and what deviates from what the data normally is. The difference is fundamental.
The Category Vajra Creates
There is no existing category that accurately describes Vajra. The closest neighbors are:
Structured-data observability. Like application observability (metrics, traces, logs) but for the data itself. What is the shape of this payload? What changed since yesterday? What is anomalous in this batch?
Semantic reduction. Not summarization (which loses information probabilistically) but reduction (which compresses information deterministically, preserving all signal above a configurable threshold).
Operational cognition tooling. Tools that make the shape of complex data legible to the humans and AI systems that depend on it.
Vajra sits at the intersection of these three. It is the first tool built specifically to occupy this space.
The Mantra
Break noise. Preserve truth.
Every decision in Vajra flows from these four words. Noise is representational redundancy, structural boilerplate, repeated motifs, and cognitive overhead. Truth is anomalies, deviations, relationships, and operational signal.
The essence is what remains when the noise is broken and the truth is preserved.
Commands
Vajra ships 11 commands. Each does one thing. They compose.
Reference Table
| Command | Purpose | Input | Key Output |
|---|---|---|---|
inspect | Full structural analysis | Single document | Paths, types, fingerprints, domain hints |
stats | Statistical summary | Single document | Entropy, frequency, distributions, null rates |
anomalies | Anomaly detection | Single or batch | Outliers, rarity, structural deviations |
fingerprint | Structural fingerprints | Single document | BLAKE3 hashes, MinHash signature |
essence | Concern-oriented reduction | Single document | Compressed, ranked, profile-shaped output |
drift | Schema drift detection | Two documents | Added/removed paths, type changes, JSD |
cluster | Similarity clustering | Multiple documents | Cluster assignments, centroids, outliers |
invariants | Cross-field relationships | Single or batch | Conditional entropy, PMI, dependencies |
query | Path-based query with analysis functions | Single document | Filtered analysis results |
batch | Parallel batch analysis | Directory | Aggregated stats, per-file summaries |
profiles | List available profiles | None | Built-in and custom profile descriptions |
Global Flags
Every command accepts these flags:
--format <text|json|markdown|compact-ai> Output format (default: text)
--profile <name> Concern profile (default: engineer)
--config <path> Path to TOML config with custom profiles
--budget <N> Token budget for essence output
--streaming Force streaming mode (bounded memory)
--input-format <format> Override input format auto-detection
--redact Apply built-in redaction patterns
--quiet Suppress progress output
--explain Include score decomposition in output
Quick Examples
Inspect
vajra inspect claim.json
vajra inspect claim.json --format json
cat payload.json | vajra inspect -
Stats
vajra stats claim.json
vajra stats claim.json --format json
Anomalies
vajra anomalies claim.json
vajra anomalies claims_batch.ndjson --format json
Fingerprint
vajra fingerprint claim.json
vajra fingerprint claim.json --format json
Essence
vajra essence claim.json --profile staff
vajra essence claim.json --profile ai --format compact-ai --budget 500
vajra essence claim.json --profile auditor --format markdown
Drift
vajra drift v1.json v2.json
vajra drift baseline.json candidate.json --format json
Cluster
vajra cluster batch/*.json
vajra cluster file1.json file2.json file3.json --format json
Invariants
vajra invariants claims_batch.ndjson
vajra invariants claims_batch.ndjson --top-k 100
Query
vajra query claim.json 'entropy($.claims[*].status) > 0.5'
vajra query claim.json '$.claims[*].service_lines[*].charge_amount'
Batch
vajra batch ./claims_directory/
vajra batch ./claims_directory/ --format json --profile auditor
Profiles
vajra profiles
vajra profiles --config custom.toml
Input Conventions
All commands that accept <input> understand:
- File path:
claim.json,./data/payload.yaml - Stdin:
-(pipe data in) - Directory:
./batch/(processes all supported files) - Compressed:
.json.gz,.json.zst(auto-decompressed) - HTTP URL:
https://api.example.com/data.json(fetched, then analyzed)
Format is auto-detected from extension and content. Override with --input-format.
See Input Formats for the full list.
Output Conventions
All commands emit to stdout. All commands support --format json for machine-readable output. Diagnostics and errors go to stderr.
The --explain flag adds score decomposition to essence and anomaly output — showing exactly which dimensions contributed to each observation’s ranking.
The --redact flag applies built-in pattern redaction (SSN, email, phone, credit card) before any output is rendered. The essence never sees unredacted values.
inspect
The foundational command. inspect performs full structural analysis of a JSON document and reports every path, every type, every fingerprint, and every domain-recognized value it finds.
This is the command you reach for first. Before you know what you are looking for, inspect tells you what is there.
Usage
vajra inspect <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, - for stdin, or an HTTP URL |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--streaming | Force streaming mode (bounded memory) | off |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
What It Reports
Document Metadata
Total node count, maximum nesting depth, number of distinct wildcard paths, raw byte size.
Wildcard Path Table
Every distinct path in the document, normalized with [*] for array indices. For each path:
- Dominant type — the most common JSON type at that path
- Count — how many times that path appears across the document
- Type instability — fraction of observations where the type differs from the dominant type (0.0 = perfectly stable)
- Null rate — fraction of observations that are null
Structural Fingerprints
Three BLAKE3-based fingerprints:
- Path set fingerprint — hash of the sorted set of distinct wildcard paths. Captures what fields exist.
- Typed path fingerprint — hash of sorted (path, dominant_type) pairs. Captures what fields exist and what types they carry.
- Shape fingerprint — Merkle subtree hash computed bottom-up. Captures the full structural shape including nesting.
Domain Type Recognition
Values matched against domain-specific type recognizers (e.g., the medical plugin recognizes ICD-10-CM codes, CPT codes, NPI numbers). Each match reports the path, the value, and the recognized type.
Example: Text Output
vajra inspect claim.json
=== Document Metadata ===
Total nodes: 847
Max depth: 6
Distinct paths: 23
Raw size: 14208 bytes
=== Wildcard Paths ===
PATH TYPE COUNT INSTABILITY NULL_RATE
$ object 1 0.0000 0.0000
$.claims array 1 0.0000 0.0000
$.claims[*] object 1 0.0000 0.0000
$.claims[*].claim_id string 1 0.0000 0.0000
$.claims[*].patient object 1 0.0000 0.0000
$.claims[*].patient.id string 1 0.0000 0.0000
$.claims[*].patient.name string 1 0.0000 0.0000
$.claims[*].diagnosis array 1 0.0000 0.0000
$.claims[*].diagnosis[*] object 2 0.0000 0.0000
$.claims[*].diagnosis[*].code string 2 0.0000 0.0000
$.claims[*].diagnosis[*].system string 2 0.0000 0.0000
$.claims[*].service_lines array 1 0.0000 0.0000
$.claims[*].service_lines[*] object 14 0.0000 0.0000
$.claims[*].service_lines[*].procedure_code string 14 0.0000 0.0000
$.claims[*].service_lines[*].charge_amount number 14 0.0000 0.0000
$.claims[*].service_lines[*].allowed_amount number 11 0.0000 0.2143
$.claims[*].service_lines[*].status string 14 0.0000 0.0000
$.claims[*].service_lines[*].service_date string 14 0.0000 0.0000
$.claims[*].service_lines[*].adjustment object 14 0.0000 0.0000
$.claims[*].service_lines[*].adjustment.reason string 14 0.0000 0.0000
$.claims[*].service_lines[*].adjustment.amount number 14 0.0000 0.0000
$.claims[*].provider.npi string 1 0.0000 0.0000
$.claims[*].subscriber.member_id string 1 0.0000 0.0000
=== Fingerprints ===
Path set: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
Typed path: f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8
Shape: 1234abcd56781234abcd56781234abcd56781234abcd56781234abcd56781234abcd
=== Domain Type Recognition ===
$.claims[*].diagnosis[*].code E11.9 ICD-10-CM
$.claims[*].diagnosis[*].code J44.1 ICD-10-CM
$.claims[*].service_lines[*].procedure_code 99213 CPT
$.claims[*].service_lines[*].procedure_code 99214 CPT
$.claims[*].provider.npi 1234567890 NPI
Example: JSON Output
vajra inspect claim.json --format json
{
"metadata": {
"total_nodes": 847,
"max_depth": 6,
"distinct_paths": 23,
"raw_size_bytes": 14208
},
"paths": [
{
"path": "$.claims[*].service_lines[*].charge_amount",
"dominant_type": "number",
"count": 14,
"type_instability": 0.0,
"null_rate": 0.0
},
{
"path": "$.claims[*].service_lines[*].allowed_amount",
"dominant_type": "number",
"count": 11,
"type_instability": 0.0,
"null_rate": 0.2143
}
],
"fingerprints": {
"path_set": "a1b2c3d4...",
"typed_path": "f7e8d9c0...",
"shape": "1234abcd..."
},
"domain_hints": [
{
"path": "$.claims[*].diagnosis[*].code",
"value": "E11.9",
"recognized_type": "ICD-10-CM"
}
]
}
When to Use It
- First contact with unfamiliar data. You just received a JSON payload and need to know its shape.
- Schema exploration. What paths exist? What types do they carry? How stable are those types?
- Domain validation. Does the medical plugin recognize the codes in this claim? Is the NPI present?
- Regression gating. Fingerprint the output of an API endpoint. If the fingerprint changes, the schema changed.
Pairs Well With
stats— once you know the structure,statstells you the statistical profilefingerprint— if you only need the fingerprints (faster, less output)drift— compare twoinspectsnapshots to find what changedessence— when you want the compressed version, not the full inventory
stats
stats computes the statistical profile of a JSON document. Entropy, frequency distributions, numeric summaries, null rates, cardinality — the quantitative foundation that every other analysis depends on.
Where inspect tells you what exists, stats tells you how it behaves.
Usage
vajra stats <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, - for stdin, or an HTTP URL |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--streaming | Force streaming mode (sketch-based approximations) | off |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
--window <period> | Temporal windowing: month, week, or day | off |
--time-field <path> | JSONPath to timestamp field (e.g., '$.date'). Auto-detected if omitted. | auto |
Temporal Windowing
When --window is specified, stats partitions records by time period and computes per-window statistics. Cross-window trend lines are included in the output, showing how distributions shift over time.
The --time-field flag tells Vajra which field contains the timestamp. If omitted, Vajra auto-detects by scanning for fields with date/time patterns (ISO 8601, Unix timestamps, common date formats).
vajra stats commits.ndjson --window month --time-field '$.date'
=== Statistical Summary (windowed: month) ===
Document: commits.ndjson (1,247 records, 8 paths)
--- Window: 2026-01 (312 records) ---
$.files_changed
Mean: 4.2 Median: 3.0 p95: 12.0
--- Window: 2026-02 (298 records) ---
$.files_changed
Mean: 5.1 Median: 4.0 p95: 15.0
--- Window: 2026-03 (337 records) ---
$.files_changed
Mean: 6.8 Median: 5.0 p95: 19.0
--- Cross-Window Trends ---
$.files_changed mean: 4.2 -> 5.1 -> 6.8 (upward, +62% over 3 months)
$.type "fix" share: 0.18 -> 0.24 -> 0.31 (increasing)
Windowing works with any multi-record input: NDJSON, CSV, multi-document YAML, or directories.
What It Reports
For every wildcard path in the document:
Frequency and Cardinality
- Count — total observations at this path
- Cardinality — number of distinct values
- Top values — the most frequent values with their counts
Entropy
- Shannon entropy — H(X) in bits. Measures information content.
- Normalized entropy — H(X) / log2(|support|). Scales to [0, 1] regardless of cardinality.
The entropy pair is one of the most powerful signals in the system:
| Entropy | Normalized | Interpretation |
|---|---|---|
| 0 | 0 | Constant — single value, pure boilerplate |
| Low | Low | Enum-like — few distinct states |
| Low | High | Near-uniform over tiny support |
| High | Moderate | Meaningful variation — identifiers, dates, codes |
| High | High | Near-uniform over large support — free text, UUIDs |
Missingness
- Null rate — fraction of observations that are JSON
null - Absent rate — fraction of parent records where this path does not appear
- Empty rate — fraction of values that are empty strings, empty arrays, or empty objects
Numeric Distributions (for numeric paths)
- Min, max, mean, median
- Percentiles — p01, p05, p25, p50, p75, p95, p99
- MAD — Median Absolute Deviation (robust dispersion)
- Skewness proxy — (mean - median) / MAD
Type Distribution
- Breakdown of JSON types observed at each path (e.g., 98% number, 2% string)
- Type instability score — fraction of observations deviating from the dominant type
Example: Text Output
vajra stats claim.json
=== Statistical Summary ===
Document: claim.json (847 nodes, 23 paths)
--- $.claims[*].service_lines[*].charge_amount ---
Count: 14
Cardinality: 12
Entropy: 3.41 bits (normalized: 0.88)
Type: number (100%)
Min: 45.00 Max: 1250.00
Mean: 312.50 Median: 285.00
MAD: 195.00
p25: 125.00 p75: 425.00
p95: 890.00 p99: 1125.00
--- $.claims[*].service_lines[*].status ---
Count: 14
Cardinality: 3
Entropy: 1.22 bits (normalized: 0.77)
Type: string (100%)
Top values:
"adjudicated" 10 (71.4%)
"pending" 3 (21.4%)
"denied" 1 (7.1%)
--- $.claims[*].service_lines[*].allowed_amount ---
Count: 11
Cardinality: 9
Entropy: 3.12 bits (normalized: 0.93)
Type: number (100%)
Null rate: 0.000
Absent rate: 0.214 ** notable: missing in 3 of 14 service lines **
Min: 32.00 Max: 875.00
Mean: 245.30 Median: 210.00
MAD: 142.00
--- $.claims[*].diagnosis[*].code ---
Count: 2
Cardinality: 2
Entropy: 1.00 bits (normalized: 1.00)
Type: string (100%)
Top values:
"E11.9" 1 (50.0%)
"J44.1" 1 (50.0%)
--- $.claims[*].service_lines[*].adjustment.reason ---
Count: 14
Cardinality: 4
Entropy: 1.56 bits (normalized: 0.78)
Type: string (100%)
Top values:
"CO-45" 8 (57.1%)
"CO-97" 3 (21.4%)
"PR-1" 2 (14.3%)
"OA-23" 1 (7.1%)
Example: JSON Output
vajra stats claim.json --format json
{
"document": "claim.json",
"total_nodes": 847,
"distinct_paths": 23,
"paths": {
"$.claims[*].service_lines[*].charge_amount": {
"count": 14,
"cardinality": 12,
"entropy": 3.41,
"normalized_entropy": 0.88,
"types": {"number": 14},
"null_rate": 0.0,
"absent_rate": 0.0,
"numeric": {
"min": 45.0,
"max": 1250.0,
"mean": 312.5,
"median": 285.0,
"mad": 195.0,
"percentiles": {
"p01": 45.0, "p05": 52.0, "p25": 125.0,
"p50": 285.0, "p75": 425.0, "p95": 890.0, "p99": 1125.0
}
},
"top_values": [
{"value": "285.00", "count": 2},
{"value": "125.00", "count": 2}
]
},
"$.claims[*].service_lines[*].status": {
"count": 14,
"cardinality": 3,
"entropy": 1.22,
"normalized_entropy": 0.77,
"types": {"string": 14},
"null_rate": 0.0,
"absent_rate": 0.0,
"top_values": [
{"value": "adjudicated", "count": 10},
{"value": "pending", "count": 3},
{"value": "denied", "count": 1}
]
}
}
}
When to Use It
- Understanding data distributions. What does the
charge_amountfield actually look like? What are the common status values? How much entropy does this field carry? - Finding hidden nulls and absences. A field with 21% absent rate across service lines is operationally significant —
statssurfaces this. - Establishing baselines. Run
statson today’s batch. Run it again tomorrow. Compare the distributions manually or feed them todrift. - Identifying enum-like fields. Low cardinality + low entropy = enum. High cardinality + high entropy = identifier.
statsmakes this distinction quantitative.
Pairs Well With
inspect— structural overview before statistical deep diveanomalies—statscomputes the distributions;anomaliesflags what deviates from themessence— the essence builder uses stats internally to score observation importanceinvariants— cross-field analysis builds on per-field statistics
anomalies
anomalies surfaces records, fields, and structural elements that deviate meaningfully from the population. It does this across four dimensions — numeric outliers, rarity, structural deviation, and type instability — using only deterministic, interpretable methods.
No training data. No labeled examples. No rules to configure. Feed it cold data and it finds what deviates from what the data says is normal.
Usage
vajra anomalies <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, NDJSON batch, - for stdin, or directory |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--streaming | Force streaming mode | off |
--redact | Apply built-in redaction before output | off |
--explain | Include score decomposition for each anomaly | off |
--quiet | Suppress progress output | off |
The Four Dimensions
Dimension 1: Numeric Outliers
Method: MAD-based modified z-scores.
For every numeric path, Vajra computes the median and the Median Absolute Deviation (MAD). Values where the modified z-score exceeds the threshold (default 3.5) are flagged.
z_MAD = 0.6745 * (value - median) / MAD
MAD has a 50% breakdown point — half the data can be arbitrarily corrupted before it gives a misleading result. Standard deviation has a 0% breakdown point. This distinction matters when the data you are analyzing might contain the very outliers you are trying to detect.
Dimension 2: Rarity Outliers
Method: self-information scoring.
For each (path, value) pair:
rarity = -log2(frequency / total)
A value seen once in 10,000 records scores ~13.3 bits. A value seen in half the records scores 1 bit. The threshold adapts per path: values exceeding mean_rarity + 2 * MAD_of_rarity are flagged.
Dimension 3: Structural Deviations
Method: Jaccard distance from the dominant path set.
In batch analysis, Vajra computes the most common set of paths (the structural mode). Each document is compared:
structural_anomaly = 1 - Jaccard(doc_paths, mode_paths)
Documents with structural anomaly > 0.2 are flagged, with the specific missing and extra paths listed.
Dimension 4: Type Instability
Method: per-path type instability score.
instability = 1 - (dominant_type_count / total_observations)
Paths with instability > 0.01 are flagged. Individual records contributing the minority type are identified.
Example: Text Output
vajra anomalies claims_batch.ndjson
=== Anomaly Report ===
Records analyzed: 1,247
Anomalies found: 8
--- Numeric Outliers ---
$.claims[*].service_lines[*].charge_amount
Record 834: 47,250.00 (z_MAD = 6.3, median = 285.00, MAD = 195.00)
Record 1102: 0.01 (z_MAD = -4.8, median = 285.00, MAD = 195.00)
$.claims[*].service_lines[*].allowed_amount
Record 834: 45,000.00 (z_MAD = 5.9, median = 210.00, MAD = 142.00)
--- Rarity Outliers ---
$.claims[*].status
Record 419: "voided" (10.3 bits, 1 of 1,247 records)
$.claims[*].service_lines[*].adjustment.reason
Record 77: "N-832" (9.1 bits, 2 of 17,458 service lines)
--- Structural Deviations ---
Record 662: Jaccard distance 0.31 from structural mode
Missing paths:
$.claims[*].subscriber.group_number
$.claims[*].subscriber.member_id
$.claims[*].provider.npi
$.claims[*].provider.taxonomy
--- Type Instability ---
$.claims[*].service_lines[*].quantity
Records 88, 204, 917: string where number expected
Instability: 0.002 (3 of 1,247 records)
Example: JSON Output
vajra anomalies claims_batch.ndjson --format json
{
"records_analyzed": 1247,
"anomaly_count": 8,
"numeric_outliers": [
{
"path": "$.claims[*].service_lines[*].charge_amount",
"record": 834,
"value": 47250.0,
"z_mad": 6.3,
"median": 285.0,
"mad": 195.0
},
{
"path": "$.claims[*].service_lines[*].charge_amount",
"record": 1102,
"value": 0.01,
"z_mad": -4.8,
"median": 285.0,
"mad": 195.0
}
],
"rarity_outliers": [
{
"path": "$.claims[*].status",
"record": 419,
"value": "voided",
"self_information_bits": 10.3,
"frequency": 1,
"total": 1247
}
],
"structural_deviations": [
{
"record": 662,
"jaccard_distance": 0.31,
"missing_paths": [
"$.claims[*].subscriber.group_number",
"$.claims[*].subscriber.member_id",
"$.claims[*].provider.npi",
"$.claims[*].provider.taxonomy"
],
"extra_paths": []
}
],
"type_instability": [
{
"path": "$.claims[*].service_lines[*].quantity",
"records": [88, 204, 917],
"expected_type": "number",
"actual_type": "string",
"instability": 0.002
}
]
}
Example: With –explain
vajra anomalies claim.json --explain
--- Numeric Outliers ---
$.claims[*].service_lines[*].charge_amount
Record 834: 47,250.00
z_MAD: 6.3
median: 285.00
MAD: 195.00
threshold: 3.5
score decomposition:
rarity: 0.82
instability: 0.00
entropy_signal: 0.34
structural_coverage: 0.15
anomaly_strength: 0.95
concern_relevance: 0.40
composite: 0.71
When to Use It
- Cold data triage. You received a batch of claims and need to know what is unusual before reading any of them.
- Fraud screening. The
--profile fraudvariant amplifies rarity and numeric outlier weights. Unusual charge amounts, rare status values, and missing provider fields all surface. - Data quality monitoring. Run
anomalieson each day’s batch in CI. If the anomaly count spikes, something changed upstream. - Pre-audit preparation. Give auditors the anomaly report alongside the raw data. They know where to look.
Pairs Well With
stats— anomalies are scored against the statistical baseline thatstatscomputesessence— anomalies feed into the essence as high-priority observationsdrift— anomalies detect deviations within a batch;driftdetects changes between batchescluster— structural deviations often indicate documents that belong to different clusters
fingerprint
fingerprint computes structural fingerprints for a JSON document — cryptographic hashes that capture what the document looks like independently of its values.
Two documents with the same fingerprint have the same structure. If the fingerprint changes, the schema changed. This is the fastest possible regression check.
Usage
vajra fingerprint <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, - for stdin, or an HTTP URL |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--streaming | Force streaming mode | off |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
Fingerprint Types
Path Set Fingerprint
BLAKE3 hash of the sorted set of distinct wildcard paths. Captures what fields exist, ignoring their types and values.
Two documents with the same path set fingerprint have identical field structures — the same keys at the same nesting levels, even if every value differs.
Typed Path Fingerprint
BLAKE3 hash of sorted (path, dominant_type) pairs. Captures what fields exist and what types they carry.
This is strictly more specific than the path set fingerprint. A type migration (e.g., quantity changing from string to number) changes the typed path fingerprint but not the path set fingerprint.
Shape Fingerprint (Merkle)
Bottom-up hash computed via Merkle subtree hashing:
- Leaf nodes hash their type
- Objects hash the sorted concatenation of
(key, child_hash)pairs - Arrays hash the concatenation of child hashes
The root hash is the shape fingerprint. This captures the full structural shape including nesting hierarchy.
A critical secondary benefit: subtree hashes at every node enable motif detection as a byproduct. Identical subtrees produce identical hashes. This falls out of a single O(n) traversal.
MinHash Signature
A 128-hash MinHash signature over the path set, enabling constant-time Jaccard similarity estimation between documents. Used internally by cluster and drift, but exposed here for direct access.
Example: Text Output
vajra fingerprint claim.json
=== Fingerprints ===
Path set: a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
Typed path: f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8
Shape: 1234abcd56781234abcd56781234abcd56781234abcd56781234abcd56781234abcd
MinHash: [64 x u64 values]
=== Subtree Motifs ===
Hash d4e5f6a1... appears 14 times (service line object)
Hash b2c3d4e5... appears 2 times (diagnosis object)
Example: JSON Output
vajra fingerprint claim.json --format json
{
"path_set": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2",
"typed_path": "f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8",
"shape": "1234abcd56781234abcd56781234abcd56781234abcd56781234abcd56781234abcd",
"minhash": [18446744073709551615, 12345678901234567890, "..."],
"motifs": [
{
"hash": "d4e5f6a1...",
"count": 14,
"node_count": 8,
"representative_path": "$.claims[*].service_lines[*]"
},
{
"hash": "b2c3d4e5...",
"count": 2,
"node_count": 3,
"representative_path": "$.claims[*].diagnosis[*]"
}
]
}
Use Cases
CI Regression Check
Store the fingerprint of your API’s response format. On every deploy, compare:
# Capture baseline
vajra fingerprint api_response.json --format json > baseline_fp.json
# On each CI run
vajra fingerprint today_response.json --format json > current_fp.json
diff baseline_fp.json current_fp.json
If the path set fingerprint changed, fields were added or removed. If the typed path fingerprint changed, a type migrated. If only the shape fingerprint changed, the nesting structure shifted.
Quick Structural Comparison
vajra fingerprint file_a.json --format json | jq .path_set
vajra fingerprint file_b.json --format json | jq .path_set
Same hash? Same structure. Different hash? Feed them to drift for the details.
Motif Discovery
The motif section reveals repeated substructures. In a medical claim, you will see the service line object repeated 14 times with the same hash — proof that those 14 elements are structurally identical.
When to Use It
- Schema regression gating. The fastest way to detect structural changes.
- Deduplication. Documents with identical shape fingerprints are structurally identical.
- Batch pre-screening. Fingerprint a batch before clustering to quickly identify structural families.
- Motif identification. What substructures repeat, and how many times?
Pairs Well With
drift— when fingerprints differ,drifttells you exactly what changedcluster— uses MinHash signatures internally for similarity estimationinspect—fingerprintis the focused subset of whatinspectcomputesessence— motif discovery feeds directly into essence compression
essence
essence is the command Vajra was built for. It takes a JSON document, runs the full analysis pipeline, scores every observation against a concern profile’s weight vector, and renders a compressed, ranked, faithful representation — shaped for whoever is reading it.
An essence is not a summary. A summary loses information probabilistically. An essence compresses information deterministically, preserving everything above a configurable importance threshold while collapsing structural noise.
Usage
vajra essence <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, - for stdin, directory, or HTTP URL |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--profile <name> | Concern profile: staff, engineer, auditor, ai, fraud, or custom | engineer |
--budget <N> | Approximate token budget for output | unlimited |
--config <path> | Path to TOML file with custom profile definitions | none |
--input-format <fmt> | Override auto-detected input format | auto |
--streaming | Force streaming mode | off |
--redact | Apply built-in redaction before rendering | off |
--explain | Include score decomposition for each observation | off |
--quiet | Suppress progress output | off |
How Essence Construction Works
-
Collect candidates. All observations from the analysis pipeline — notable fields, motifs, anomalies, relationship discoveries — become candidates.
-
Score each candidate using the active profile’s six-dimensional weight vector:
rarity— self-information of the observationinstability— type instability at the pathentropy_signal— distance from 0.5 normalized entropy (both constants and noise score high)structural_coverage— fraction of total nodes under this pathanomaly_strength— maximum anomaly score across dimensionsconcern_relevance— profile-specific boost for this path or observation type
-
Collapse motifs. Repeated structural patterns are represented once with a count and specific variations noted.
-
Rank by composite score with deterministic tie-breaking (shallower paths first, then lexicographic).
-
Apply token budget (if
--budgetis set). Greedy selection by score-per-token — the fractional knapsack approximation. -
Render using the profile’s vocabulary and rendering style.
Profiles at a Glance
| Profile | Vocabulary | Rendering | Emphasizes |
|---|---|---|---|
staff | Plain language | Narrative sections | Anomalies, structural coverage |
engineer | Technical, JSONPath | Tabular, list-based | Type instability, all dimensions balanced |
auditor | Formal | Completeness-focused | Instability, concern relevance, missingness |
ai | Compact, terse | Machine-readable | Entropy signal, structural coverage, anomalies |
fraud | Investigative | Outlier-focused | Rarity, anomaly strength |
See Profiles for full weight vectors and customization.
Example: Staff Profile
vajra essence claim.json --profile staff
=== Essence (staff profile) ===
Document Summary:
1 claim with 14 service lines, 1 patient, 2 diagnosis codes.
Primary status: partially adjudicated.
What Stands Out:
- 3 service lines are missing allowed amounts (lines 2, 7, 11).
This field is present in 79% of service lines — its absence is notable.
- Adjustment reason code "CO-45" repeats across 8 of 14 lines.
Repetition at this frequency suggests a systematic pattern, not random variation.
- 1 diagnosis structure differs from the other.
The second diagnosis carries an extra "qualifier" field.
- Provider taxonomy code is absent.
This field is expected in 94% of claims in typical batches.
What This Likely Means:
- Most of the claim is consistent and well-formed.
- A subset of service lines appears incomplete or differently processed.
- The repeated adjustment code points to a systematic issue.
No JSONPath. No z-scores. No jargon. The staff member gets what they need to act.
Example: Engineer Profile
vajra essence claim.json --profile engineer
=== Essence (engineer profile) ===
Structure: 847 nodes, 23 distinct paths, max depth 6
Fingerprint (path set): a1b2c3d4...
Dominant motif: $.claims[*].service_lines[*] (14 instances, 8 fields each)
Notable paths:
$.claims[*].service_lines[*].allowed_amount
null_rate: 0.214, entropy: 3.12, type: number (100%)
absent in 3 of 14 service lines (indices 2, 7, 11)
$.claims[*].service_lines[*].adjustment.reason
entropy: 1.56, cardinality: 4
dominant value: "CO-45" (57.1%, 8 of 14)
$.claims[*].diagnosis[1]
structural deviation: extra field "qualifier" (not in diagnosis[0])
Type stability: 100% across all paths
Array homogeneity: service_lines 100% (1 shape hash), diagnosis 50% (2 shape hashes)
Example: AI Profile with Token Budget
vajra essence claim.json --profile ai --format json --budget 500
{
"vajra_essence": {
"version": "0.1.0",
"profile": "ai",
"input_hash": "b3a7f2c1d4e5...",
"structure": {
"root_type": "object",
"total_nodes": 847,
"distinct_paths": 23,
"max_depth": 6
},
"dominant_motif": {
"path": "$.claims[0].service_lines[*]",
"count": 14,
"shape_hash": "f2c1d4e5...",
"fields": ["procedure_code", "service_date", "charge_amount", "allowed_amount", "status", "adjustment"]
},
"anomalies": [
{
"path": "$.claims[0].service_lines[2,7,11].allowed_amount",
"type": "missing",
"severity": 4.2
},
{
"path": "$.claims[0].diagnosis[1]",
"type": "structural_deviation",
"severity": 3.1
}
],
"notable": [
{
"path": "$.claims[0].service_lines[*].adjustment.reason_code",
"observation": "value 'CO-45' in 8/14 instances (57%)"
}
],
"meta": {
"budget_tokens": 500,
"truncated": false,
"observations_included": 4,
"observations_total": 7
}
}
}
The AI profile collapses aggressively. Motifs are represented once with counts. Observations are sorted by score-per-token. The meta.truncated field tells the downstream model whether anything was cut.
Example: Compact-AI Format
vajra essence claim.json --profile ai --format compact-ai --budget 300
{"v":"vajra/1","n":847,"p":23,"d":6,"motif":{"p":"$.claims[0].service_lines[*]","c":14},"a":[{"p":"$.claims[0].service_lines[2,7,11].allowed_amount","t":"miss","s":4.2},{"p":"$.claims[0].diagnosis[1]","t":"struct","s":3.1}],"drill":[{"p":"$.claims[*].service_lines","avail":["stats","anomalies","motifs"]}]}
Maximum compression. Every key shortened. The drill section tells the LLM which paths have deeper analysis available for follow-up queries.
Example: With –explain
vajra essence claim.json --profile engineer --explain
Notable paths:
$.claims[*].service_lines[*].allowed_amount
null_rate: 0.214, entropy: 3.12
[score: 0.68]
rarity: 0.42 x weight 0.15 = 0.063
instability: 0.00 x weight 0.25 = 0.000
entropy_signal: 0.24 x weight 0.15 = 0.036
structural_coverage: 0.18 x weight 0.15 = 0.027
anomaly_strength: 0.89 x weight 0.15 = 0.134
concern_relevance: 0.75 x weight 0.15 = 0.113
Every score decomposed into its six dimensions. Nothing hidden. Nothing magic.
The Token Budget
When --budget N is specified, Vajra estimates the token cost of each observation (word count x 1.3) and selects greedily by score-per-token until the budget is exhausted. This is the fractional knapsack approximation — optimal for the greedy case.
The budget is approximate, not exact. It prevents bloated output without requiring precise token counting.
When to Use It
- Non-technical stakeholders.
--profile stafftranslates the data into plain language. - AI pipelines.
--profile ai --format compact-ai --budget 500compresses a 1000-node document into a token-efficient context. - Audits.
--profile auditoremphasizes completeness, missingness, and traceability. - Fraud screening.
--profile fraudamplifies anomalies and rare patterns. - Documentation.
--format markdownrenders the essence as publishable documentation.
Pairs Well With
stats— the statistical baseline that feeds scoringanomalies— anomalies are the highest-priority candidates in most profilesdrift— drift observations appear in the essence when a baseline is available- Profiles — full control over what gets emphasized and how it gets rendered
drift
drift detects and quantifies structural, type, and distributional changes between two JSON documents. It answers the question every engineer asks when something breaks: what changed?
Not what changed in the values — what changed in the shape, types, and statistical behavior of the data.
Usage
vajra drift <baseline> <candidate> [flags]
Arguments:
| Argument | Description |
|---|---|
<baseline> | The reference document (the “before”) |
<candidate> | The comparison document (the “after”) |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--profile <name> | Concern profile for severity weighting | engineer |
--input-format <fmt> | Override auto-detected input format | auto |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
--group-by <path> | JSONPath for population-level comparison (e.g., '$.author_type') | off |
Population-Level Comparison
When --group-by is specified, drift partitions records by the field value and computes pairwise drift between all groups. Instead of comparing two documents, you compare two (or more) subpopulations within the same dataset.
vajra drift prs.ndjson --group-by '$.author_type'
Drift Report (grouped by $.author_type)
Groups: bot (412 records), human (835 records)
Pairwise drift: bot vs human
Structural similarity: 0.91 (Jaccard)
Distribution shifts:
$.files_changed JSD: 0.42 (high)
bot: median 1.0, p95 3.0
human: median 4.0, p95 18.0
$.review_comments JSD: 0.38 (moderate)
bot: median 0.0, p95 1.0
human: median 2.0, p95 8.0
Overall severity: HIGH (significant distributional divergence)
This is useful for comparing behavioral subgroups — bot vs. human PRs, different teams, production vs. staging, before vs. after a policy change — without needing separate files.
Drift Dimensions
Structural Drift
Path set symmetric difference:
added_paths = paths(candidate) \ paths(baseline)
removed_paths = paths(baseline) \ paths(candidate)
New fields appearing. Old fields disappearing. The most visible form of schema evolution.
Type Drift
For each path present in both documents, the dominant type is compared. Any path where the type changed (e.g., string to number, array to object) is flagged.
Distributional Drift
Jensen-Shannon Divergence (JSD) measures how much value distributions shifted between baseline and candidate:
JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)
where M = 0.5 * (P + Q).
JSD is symmetric, always finite, bounded to [0, 1], and its square root is a proper metric. This means drift magnitudes can be meaningfully compared and accumulated across paths.
For numeric paths, Vajra also computes the 1D Wasserstein distance (earth mover’s distance), which captures how far values moved, not just that they moved.
Drift Classification
Each drifted path receives a classification:
| Class | Meaning |
|---|---|
additive | New path appeared in candidate |
subtractive | Path present in baseline, absent in candidate |
type-mutative | Dominant type changed |
distributional | Value distribution shifted (JSD > threshold) |
cardinality-shift | Array lengths changed significantly |
null-rate-shift | Null/missing ratio changed significantly |
Severity Scoring
The overall drift severity is a weighted sum of drift dimensions, tuned by the active profile:
- Auditor profiles weight subtractive drift highest (missing data is critical for compliance)
- Engineer profiles weight type-mutative drift highest (breaking changes)
- Fraud profiles weight distributional drift highest (behavioral shifts)
Example: Text Output
vajra drift yesterday.json today.json
Drift Report: yesterday.json -> today.json
Structural similarity: 0.94 (Jaccard)
Added paths (2):
$.response.metadata.processing_flags [array of strings]
$.response.metadata.api_version [string]
Removed paths (0): none
Type changes (1):
$.response.items[*].quantity string -> number (clean type migration)
Distribution shifts (1):
$.response.items[*].status JSD: 0.34 (moderate)
before: {"active": 0.82, "pending": 0.15, "error": 0.03}
after: {"active": 0.61, "pending": 0.12, "error": 0.27}
note: "error" rate increased 9x
Null rate changes (0): none
Overall severity: MEDIUM (structural additions + significant distribution shift)
Example: JSON Output
vajra drift yesterday.json today.json --format json
{
"baseline": "yesterday.json",
"candidate": "today.json",
"jaccard_similarity": 0.94,
"overall_severity": "medium",
"added_paths": [
{
"path": "$.response.metadata.processing_flags",
"type": "array"
},
{
"path": "$.response.metadata.api_version",
"type": "string"
}
],
"removed_paths": [],
"type_changes": [
{
"path": "$.response.items[*].quantity",
"baseline_type": "string",
"candidate_type": "number",
"jsd": 0.0
}
],
"distribution_shifts": [
{
"path": "$.response.items[*].status",
"jsd": 0.34,
"baseline_distribution": {
"active": 0.82,
"pending": 0.15,
"error": 0.03
},
"candidate_distribution": {
"active": 0.61,
"pending": 0.12,
"error": 0.27
}
}
],
"null_rate_changes": []
}
Example: Medical Claim Drift
vajra drift baseline_claim.json updated_claim.json --profile auditor
Drift Report: baseline_claim.json -> updated_claim.json
Structural similarity: 0.87 (Jaccard)
Added paths (3):
$.claims[*].service_lines[*].modifier_codes [array of strings]
$.claims[*].rendering_provider [object]
$.claims[*].rendering_provider.npi [string]
Removed paths (1):
$.claims[*].provider.taxonomy [string]
** SUBTRACTIVE: field present in baseline, absent in candidate **
Type changes (0): none
Distribution shifts (2):
$.claims[*].service_lines[*].status JSD: 0.22
before: {"adjudicated": 0.85, "pending": 0.15}
after: {"adjudicated": 0.64, "pending": 0.21, "denied": 0.15}
note: new value "denied" appeared
$.claims[*].service_lines[*].charge_amount Wasserstein: 125.40
before: median 285.00, p95 890.00
after: median 410.00, p95 1350.00
note: charges shifted upward
Overall severity: HIGH (subtractive drift in auditor profile)
The auditor profile flags the removed taxonomy path as high severity because subtractive drift — data that was present and is now absent — is the most dangerous form of schema evolution for compliance.
When to Use It
- API version migration. Compare the response shape before and after a deploy.
- Vendor data monitoring. Compare this week’s feed to last week’s. Detect undocumented schema changes before they break your pipeline.
- Regulatory compliance. Prove that the data structure has not drifted outside acceptable bounds.
- CI integration. Gate deploys on drift severity. If drift exceeds a threshold, fail the build and require review.
Pairs Well With
fingerprint— quick structural same-or-different check before detailed drift analysisinspect— understand each document’s structure before comparinganomalies— drift detects changes between versions; anomalies detect deviations within a versionessence— drift observations feed into essence generation when a baseline is provided
cluster
cluster groups similar JSON documents by structural similarity. Feed it a batch of files and it tells you how many structural families exist, which documents belong to each, and which documents are structural outliers that fit nowhere.
No predefined cluster count. No training. The algorithm discovers the natural grouping from the data.
Usage
vajra cluster <inputs...> [flags]
Arguments:
| Argument | Description |
|---|---|
<inputs...> | One or more JSON files, glob patterns, or directories |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
How It Works
Small Batches (< 1,000 documents)
Exact pairwise Jaccard similarity over wildcard path sets:
J(A, B) = |paths(A) intersection paths(B)| / |paths(A) union paths(B)|
O(n^2) pairwise but tractable at small scale. Results are exact and deterministic.
Large Batches
MinHash + Locality-Sensitive Hashing (LSH).
- During fingerprinting, each document receives a 128-hash MinHash signature.
- LSH partitions each signature into bands, hashing each band into buckets.
- Documents sharing a bucket in any band are candidate pairs.
- Connected components in the candidate graph form initial clusters.
- Within each component, exact pairwise similarity refines the grouping.
The probability curve is tuned so that documents with Jaccard similarity > 0.5 have > 98% chance of being found as candidates, while documents with similarity < 0.2 have < 2% false positive rate.
This achieves near-linear time clustering: O(n) for MinHash, O(n) amortized for LSH indexing.
Example: Text Output
vajra cluster claims_batch/*.json
=== Cluster Report ===
Documents: 247
Clusters: 3
--- Cluster 0 (198 documents, 80.2%) ---
Representative: claim_001.json
Distinct paths: 23
Structural signature: a1b2c3d4...
Members: claim_001.json, claim_002.json, claim_003.json, ... (+195 more)
--- Cluster 1 (41 documents, 16.6%) ---
Representative: claim_048.json
Distinct paths: 27
Structural signature: e5f6a7b8...
Additional paths vs Cluster 0:
$.claims[*].service_lines[*].modifier_codes
$.claims[*].rendering_provider
$.claims[*].rendering_provider.npi
$.claims[*].rendering_provider.taxonomy
Members: claim_048.json, claim_052.json, claim_067.json, ... (+38 more)
--- Cluster 2 (8 documents, 3.2%) ---
Representative: claim_199.json
Distinct paths: 18
Structural signature: c9d0e1f2...
Missing paths vs Cluster 0:
$.claims[*].subscriber.group_number
$.claims[*].subscriber.member_id
$.claims[*].provider.taxonomy
$.claims[*].service_lines[*].adjustment
$.claims[*].service_lines[*].adjustment.reason
Members: claim_199.json, claim_201.json, claim_215.json, ... (+5 more)
** Potential structural anomalies — missing common fields **
=== Similarity Matrix (cluster centroids) ===
Cluster 0 Cluster 1 Cluster 2
Cluster 0 1.000 0.852 0.783
Cluster 1 0.852 1.000 0.667
Cluster 2 0.783 0.667 1.000
Example: JSON Output
vajra cluster claims_batch/*.json --format json
{
"document_count": 247,
"cluster_count": 3,
"clusters": [
{
"id": 0,
"size": 198,
"representative": "claim_001.json",
"distinct_paths": 23,
"structural_signature": "a1b2c3d4...",
"members": ["claim_001.json", "claim_002.json", "..."]
},
{
"id": 1,
"size": 41,
"representative": "claim_048.json",
"distinct_paths": 27,
"structural_signature": "e5f6a7b8...",
"additional_paths": [
"$.claims[*].service_lines[*].modifier_codes",
"$.claims[*].rendering_provider",
"$.claims[*].rendering_provider.npi",
"$.claims[*].rendering_provider.taxonomy"
],
"members": ["claim_048.json", "claim_052.json", "..."]
},
{
"id": 2,
"size": 8,
"representative": "claim_199.json",
"distinct_paths": 18,
"structural_signature": "c9d0e1f2...",
"missing_paths": [
"$.claims[*].subscriber.group_number",
"$.claims[*].subscriber.member_id",
"$.claims[*].provider.taxonomy"
],
"members": ["claim_199.json", "claim_201.json", "..."]
}
],
"similarity_matrix": [
[1.0, 0.852, 0.783],
[0.852, 1.0, 0.667],
[0.783, 0.667, 1.0]
]
}
Interpreting the Results
Large dominant cluster + small outlier clusters is the most common pattern. It means most documents share a structural template, and the outliers represent schema variants, incomplete records, or data from a different source.
Many clusters of similar size suggests multiple payload families — perhaps different message types, different API versions, or different upstream sources mixed in a single directory.
High similarity between clusters (> 0.8) means the clusters differ by only a few fields. This often indicates optional fields that are sometimes present and sometimes absent.
Low similarity between clusters (< 0.5) means fundamentally different structural families. These probably should not be processed by the same pipeline.
When to Use It
- Batch triage. Before analyzing 10,000 claims, cluster them to understand how many structural families you are dealing with.
- Schema variant discovery. A vendor says they send one format. Clustering reveals three.
- Outlier isolation. The smallest cluster often contains the documents with missing fields or unusual structure — the ones that need manual review.
- Pipeline routing. Different structural families may need different processing logic. Clustering reveals the routing keys.
Pairs Well With
fingerprint— clustering uses MinHash signatures from the fingerprinting layerdrift— compare cluster representatives to understand how the families differanomalies— documents in small outlier clusters are strong anomaly candidatesbatch— batch analysis with clustering to segment results by structural family
invariants
invariants discovers cross-field relationships from observed data. It finds fields that predict other fields, fields that always co-occur, and fields that are functionally dependent — all without prior knowledge of the schema.
This is data archaeology. Vajra examines the statistical co-occurrence of fields and extracts the latent rules that the data obeys.
Usage
vajra invariants <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, NDJSON batch, - for stdin, or directory |
Flags:
| Flag | Description | Default |
|---|---|---|
--top-k <N> | Maximum number of field pairs to consider | 50 |
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
The Mathematics
Conditional Entropy
For field pairs (X, Y):
H(Y|X) = -sum p(x,y) * log2(p(y|x))
Low H(Y|X) means X strongly predicts Y. If H(Y|X) approaches 0, Y is functionally determined by X — knowing X tells you Y with near-certainty.
Pointwise Mutual Information (PMI)
PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))
Positive PMI means x and y co-occur more than chance predicts. Negative PMI means they avoid each other. Zero means independence.
PMI is the information-theoretic standard for measuring association strength.
Discovery Procedure
- Screen: consider only paths with observation count > 30 (configurable). This filters noise.
- Compute: for all pairs among the top-k most frequent paths, calculate conditional entropy and PMI.
- Rank: ascending H(Y|X) for dependency strength, descending |PMI| for association strength.
- Report: the strongest relationships with examples from the data.
With k = 50, this is 2,500 pairs — trivial even on large datasets. Unlike general association rule mining (which explores an exponential itemset space), this approach is bounded by design.
Example: Text Output
vajra invariants claims_batch.ndjson
=== Cross-Field Invariants ===
Records analyzed: 1,247
Field pairs screened: 1,225 (top 50 paths)
--- Functional Dependencies (H(Y|X) < 0.1) ---
$.claims[*].subscriber.id -> $.claims[*].subscriber.name
H(name|id) = 0.00
subscriber.id fully determines subscriber.name
Example: id "SUB-4421" -> name "Martinez, Elena" (47 records)
$.claims[*].provider.npi -> $.claims[*].provider.name
H(name|npi) = 0.03
provider.npi nearly determines provider.name (3 exceptions in 1,247)
Example: npi "1234567890" -> name "Valley Medical Group" (312 records)
--- Strong Co-occurrence (PMI > 2.0) ---
$.claims[*].status = "denied" <-> $.claims[*].denial_reason present
PMI = 3.8
When status is "denied", denial_reason is present 97% of the time.
When status is not "denied", denial_reason is present 2% of the time.
$.claims[*].service_lines[*].procedure_code <-> $.claims[*].service_lines[*].service_date
PMI = 3.2
These fields co-occur in 99.8% of service lines. Effectively always together.
--- Conditional Presence ---
$.claims[*].service_lines[*].modifier_codes
Present in 100% of records where procedure_code starts with "9921"
Present in 12% of records where procedure_code starts with "9939"
Modifier presence is conditionally dependent on procedure type.
--- Anti-Correlation (PMI < -1.0) ---
$.claims[*].status = "adjudicated" <-> $.claims[*].hold_reason present
PMI = -2.1
These rarely co-occur. Adjudicated claims almost never have hold reasons.
Example: JSON Output
vajra invariants claims_batch.ndjson --format json
{
"records_analyzed": 1247,
"pairs_screened": 1225,
"functional_dependencies": [
{
"source": "$.claims[*].subscriber.id",
"target": "$.claims[*].subscriber.name",
"conditional_entropy": 0.0,
"strength": "exact",
"example": {
"source_value": "SUB-4421",
"target_value": "Martinez, Elena",
"count": 47
}
},
{
"source": "$.claims[*].provider.npi",
"target": "$.claims[*].provider.name",
"conditional_entropy": 0.03,
"strength": "near_exact",
"exceptions": 3,
"example": {
"source_value": "1234567890",
"target_value": "Valley Medical Group",
"count": 312
}
}
],
"co_occurrences": [
{
"field_a": "$.claims[*].status",
"value_a": "denied",
"field_b": "$.claims[*].denial_reason",
"pmi": 3.8,
"conditional_presence": 0.97
}
],
"anti_correlations": [
{
"field_a": "$.claims[*].status",
"value_a": "adjudicated",
"field_b": "$.claims[*].hold_reason",
"pmi": -2.1
}
]
}
What Invariants Reveal
Functional dependencies are the strongest signal. When subscriber.id fully determines subscriber.name, that is not an accident — it reflects a real-world constraint. If that constraint breaks (a subscriber ID mapping to two different names), you have a data quality issue.
Co-occurrence patterns reveal implicit business rules. “When status is denied, denial_reason is present” is a rule that lives in the data, not in a schema. Vajra discovers it empirically.
Anti-correlations reveal mutual exclusions. Fields that never co-occur often represent different branches of a state machine — knowing which branch you are on determines which fields exist.
Conditional presence reveals fields whose existence depends on the value of another field. This is where JSON schemas fall short — they cannot express “this field exists only when that field equals X.”
When to Use It
- Schema documentation. Discover the implicit rules that the data already obeys. Document them before they are lost.
- Data quality rules. Turn discovered invariants into validation rules. If
subscriber.idalways determinessubscriber.name, alert when it does not. - Onboarding. New to a dataset?
invariantsshows you the relationships between fields faster than reading documentation (which may not exist). - Audit evidence. Demonstrate that field dependencies are consistent across a batch.
Pairs Well With
stats— invariants build on per-field statistics (entropy, frequency, null rates)anomalies— broken invariants (a dependency that holds 99% of the time but not in record 662) are anomaliesessence— discovered relationships appear in the essence as notable observationsdrift— if an invariant holds in the baseline but breaks in the candidate, that is a significant drift signal
query
query runs path-based expressions with analysis functions against a document. It lets you ask specific questions — what is the entropy at this path? Which values at this path are anomalous? What is the null rate for this field?
Where other commands analyze everything and present results, query lets you target a specific path and a specific measurement.
Usage
vajra query <input> '<expression>' [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON file, - for stdin, or an HTTP URL |
<expression> | Query expression (path filter or analysis function) |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
Expression Language
Vajra defines its own expression language inspired by JSONPath with analysis extensions. This is not JSONAta — it is a purpose-built query system for structural analysis.
Path Filtering
Select values at a specific path:
vajra query claim.json '$.claims[*].service_lines[*].charge_amount'
Path: $.claims[*].service_lines[*].charge_amount
Values (14):
125.00, 285.00, 45.00, 890.00, 310.00, 425.00, 285.00,
1250.00, 175.00, 520.00, 95.00, 680.00, 340.00, 410.00
Analysis Functions
Apply analysis functions to a path:
vajra query claim.json 'entropy($.claims[*].service_lines[*].status)'
entropy($.claims[*].service_lines[*].status)
Shannon entropy: 1.22 bits
Normalized entropy: 0.77
Cardinality: 3
Interpretation: enum-like, few distinct states
Available Functions
| Function | Returns | Description |
|---|---|---|
entropy(path) | Shannon entropy and normalized entropy | Information content at this path |
rarity(path, value) | Self-information in bits | How rare a specific value is at this path |
instability(path) | Type instability ratio | Fraction of values deviating from dominant type |
null_rate(path) | Null and absent rates | Missingness profile at this path |
stats(path) | Full statistical summary | Entropy, frequency, numeric distribution |
anomaly_score(path) | Composite anomaly score | Maximum anomaly strength across dimensions |
motif(path) | Dominant motif description | Repeated structural pattern at an array path |
Conditional Expressions
Filter by analysis thresholds:
vajra query claim.json 'entropy($.claims[*].service_lines[*].status) > 0.5'
entropy($.claims[*].service_lines[*].status) = 1.22
Condition: > 0.5
Result: TRUE
vajra query claim.json 'anomaly_score($.claims[*].service_lines[*].charge_amount) > 3.5'
anomaly_score($.claims[*].service_lines[*].charge_amount)
Max z_MAD across values: 6.3 (at value 47,250.00)
Condition: > 3.5
Result: TRUE
Flagged values:
47,250.00 (z_MAD = 6.3)
Example: Text Output
vajra query claim.json 'stats($.claims[*].service_lines[*].charge_amount)'
stats($.claims[*].service_lines[*].charge_amount)
Count: 14
Cardinality: 12
Entropy: 3.41 bits (normalized: 0.88)
Type: number (100%)
Min: 45.00
Max: 1250.00
Mean: 312.50
Median: 285.00
MAD: 195.00
p25: 125.00
p75: 425.00
p95: 890.00
p99: 1125.00
Example: JSON Output
vajra query claim.json 'entropy($.claims[*].status)' --format json
{
"function": "entropy",
"path": "$.claims[*].status",
"result": {
"shannon_entropy": 1.22,
"normalized_entropy": 0.77,
"cardinality": 3,
"support": ["adjudicated", "pending", "denied"]
}
}
Example: Rarity Check
vajra query claims_batch.ndjson 'rarity($.claims[*].status, "voided")'
rarity($.claims[*].status, "voided")
Self-information: 10.3 bits
Frequency: 1 of 1,247
Interpretation: extremely rare (> 10 bits)
Example: Null Rate Investigation
vajra query claim.json 'null_rate($.claims[*].service_lines[*].allowed_amount)'
null_rate($.claims[*].service_lines[*].allowed_amount)
Null rate: 0.000 (0 of 14 are JSON null)
Absent rate: 0.214 (3 of 14 parent records lack this field)
Empty rate: 0.000
Total missingness: 0.214
When to Use It
- Targeted investigation. You saw an anomaly in the essence. Now drill into the specific path.
- Threshold checks in CI.
vajra query data.json 'instability($.status) > 0.01'— fail the build if type instability exceeds tolerance. - Statistical spot-checks. What is the entropy of this field? What is the null rate? How rare is this value?
- Script integration. The
--format jsonoutput is machine-readable. Parse it in your pipeline.
Pairs Well With
stats—querytargets a single path;statsgives you everythinganomalies—querylets you drill into a specific anomaly withanomaly_score(path)essence— the AI profile’sdrillsection tells downstream models which paths toqueryinspect—inspectreveals the paths;queryinterrogates them
batch
batch runs parallel analysis across all JSON files in a directory. It produces aggregated statistics, per-file summaries, and batch-level observations — processing hundreds or thousands of files in seconds via Rayon-based parallelism.
Where single-document commands analyze one file, batch analyzes the population.
Usage
vajra batch <directory> [flags]
Arguments:
| Argument | Description |
|---|---|
<directory> | Path to a directory containing JSON files |
Flags:
| Flag | Description | Default |
|---|---|---|
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--profile <name> | Concern profile for essence generation | engineer |
--input-format <fmt> | Override auto-detected input format | auto |
--streaming | Force streaming mode for each file | off |
--redact | Apply built-in redaction before output | off |
--quiet | Suppress progress output | off |
What It Does
-
Discovers files. Scans the directory for all supported files (
.json,.yaml,.csv,.ndjson, etc.). -
Parallel analysis. Each file is analyzed independently using Rayon’s work-stealing thread pool. On an 8-core machine, 8 files are analyzed simultaneously.
-
Per-file statistics. For each file: node count, path count, depth, fingerprint, anomaly count.
-
Aggregated statistics. Across the entire batch: merged frequency distributions, merged DDSketch quantiles, population-level entropy, cross-file type stability.
-
Batch-level observations. Structural families (via clustering), population anomalies, files that deviate from the batch norm.
Example: Text Output
vajra batch ./claims/
=== Batch Analysis ===
Directory: ./claims/
Files processed: 247
Total nodes: 208,729
Processing time: 1.4s (148,378 nodes/s)
=== Per-File Summary ===
FILE NODES PATHS DEPTH ANOMALIES FINGERPRINT
claim_001.json 847 23 6 0 a1b2c3d4...
claim_002.json 891 23 6 0 a1b2c3d4...
claim_003.json 723 23 6 1 a1b2c3d4...
claim_048.json 1102 27 7 0 e5f6a7b8...
claim_199.json 412 18 5 3 c9d0e1f2...
... (242 more files)
=== Structural Families ===
Family 1: 198 files (80.2%) — 23 paths, signature a1b2c3d4...
Family 2: 41 files (16.6%) — 27 paths, signature e5f6a7b8...
Family 3: 8 files ( 3.2%) — 18 paths, signature c9d0e1f2...
=== Aggregated Statistics ===
$.claims[*].service_lines[*].charge_amount
Population median: 285.00
Population MAD: 195.00
Population p95: 1,420.00
Cross-file consistency: high (coefficient of variation = 0.12)
$.claims[*].service_lines[*].status
Population entropy: 1.45 bits
Dominant value: "adjudicated" (72.3%)
Cardinality: 5 values across batch
=== Batch-Level Anomalies ===
claim_199.json: structural outlier (Jaccard distance 0.31 from dominant family)
claim_201.json: structural outlier (Jaccard distance 0.28 from dominant family)
claim_834.json: contains numeric outlier (charge_amount = 47,250.00, z_MAD = 6.3)
Example: JSON Output
vajra batch ./claims/ --format json
{
"directory": "./claims/",
"files_processed": 247,
"total_nodes": 208729,
"processing_time_ms": 1400,
"per_file": [
{
"file": "claim_001.json",
"nodes": 847,
"paths": 23,
"depth": 6,
"anomaly_count": 0,
"fingerprint": "a1b2c3d4..."
}
],
"structural_families": [
{
"id": 0,
"count": 198,
"percentage": 80.2,
"distinct_paths": 23,
"signature": "a1b2c3d4..."
},
{
"id": 1,
"count": 41,
"percentage": 16.6,
"distinct_paths": 27,
"signature": "e5f6a7b8..."
},
{
"id": 2,
"count": 8,
"percentage": 3.2,
"distinct_paths": 18,
"signature": "c9d0e1f2..."
}
],
"aggregated_stats": {
"$.claims[*].service_lines[*].charge_amount": {
"population_median": 285.0,
"population_mad": 195.0,
"population_p95": 1420.0
}
},
"batch_anomalies": [
{
"file": "claim_199.json",
"type": "structural_outlier",
"jaccard_distance": 0.31
},
{
"file": "claim_834.json",
"type": "numeric_outlier",
"path": "$.claims[*].service_lines[*].charge_amount",
"value": 47250.0,
"z_mad": 6.3
}
]
}
Parallelism and Performance
Batch uses Rayon’s work-stealing thread pool. The number of threads defaults to the number of CPU cores.
Performance targets:
| Batch Size | Target |
|---|---|
| 100 files, ~1 MB each | < 5 seconds |
| 1,000 files, ~1 MB each | < 30 seconds |
| 10,000 files, ~100 KB each | < 30 seconds |
DDSketch instances are computed per-file and merged globally with no accuracy loss — this is the key property that makes parallel batch processing exact rather than approximate.
When to Use It
- Daily batch monitoring. Run
batchon each day’s incoming data. Track structural families, anomaly counts, and distribution shifts over time. - Pre-processing audit. Before feeding a batch to a downstream system, run
batchto verify structural consistency and flag outliers. - Population baselines. Establish population-level statistics (median charge amount, expected null rates, typical structural signature) that individual-file analysis can compare against.
- Quick directory survey. “What is in this folder?” —
batchanswers in seconds.
Pairs Well With
cluster— batch includes lightweight clustering;clusterprovides detailed similarity analysisanomalies— batch flags files with anomalies; drill into specific files for detailsdrift— compare today’s batch aggregates to yesterday’s for population-level driftessence— run essence on specific files that batch identified as notable
cascade
cascade detects temporal cause-effect chains in event data. Given a stream of timestamped events grouped by entity, it identifies sequences where one event type triggers another — and measures how reliably that pattern holds.
Where anomalies finds single-record outliers, cascade finds multi-record temporal patterns: event A happens to entity X, then event B follows within a window.
Usage
vajra cascade <input> [flags]
Arguments:
| Argument | Description |
|---|---|
<input> | Path to a JSON/NDJSON file, - for stdin, or an HTTP URL |
Flags:
| Flag | Description | Default |
|---|---|---|
--entity-field <path> | JSONPath to the entity identifier (e.g., '$.author') | required |
--time-field <path> | JSONPath to the timestamp field (e.g., '$.date') | required |
--event-field <path> | JSONPath to the event type field (e.g., '$.type') | required |
--response-values <vals> | Comma-separated list of event values that count as responses (e.g., fix,revert) | required |
--format <fmt> | Output format: text, json, markdown, compact-ai | text |
--input-format <fmt> | Override auto-detected input format | auto |
--quiet | Suppress progress output | off |
What It Reports
Cascade Rate
The fraction of trigger events that are followed by a response event from the same entity within the detection window. A high cascade rate means the cause-effect pattern is reliable.
Self-Fix Rate
The fraction of cascades where the same entity that caused the trigger also produced the response. Measures whether entities clean up their own problems.
Hot Entities
Entities that appear disproportionately in cascade chains. These are the nexus points — the authors, services, or components that most frequently participate in cause-and-effect sequences.
Cascade Chains
The full chain detail: trigger event, response event, entity, timestamps, and time delta between cause and effect.
Algorithm
O(n log n). Records are grouped by entity using a BTreeMap (ordered map), sorted by timestamp within each group, then scanned linearly to detect trigger-response pairs. The BTreeMap ensures deterministic iteration order regardless of input ordering.
Example: Commit Cascade Analysis
vajra cascade commits.ndjson \
--entity-field '$.author' \
--time-field '$.date' \
--event-field '$.type' \
--response-values 'fix,revert'
=== Cascade Report ===
Records: 1,247
Entities: 34
Trigger events: 312
Response events: 89
Cascade rate: 0.285 (89 of 312 triggers followed by a response)
Self-fix rate: 0.742 (66 of 89 responses by the same entity)
Hot entities:
alice 23 cascades (25.8%)
bob 14 cascades (15.7%)
charlie 9 cascades (10.1%)
Cascade chains (top 5 by frequency):
bug -> fix 62 occurrences, median delta: 2.3 days
bug -> revert 18 occurrences, median delta: 0.4 days
regression -> fix 9 occurrences, median delta: 4.1 days
Example: JSON Output
vajra cascade commits.ndjson \
--entity-field '$.author' \
--time-field '$.date' \
--event-field '$.type' \
--response-values 'fix,revert' \
--format json
{
"records": 1247,
"entities": 34,
"trigger_events": 312,
"response_events": 89,
"cascade_rate": 0.285,
"self_fix_rate": 0.742,
"hot_entities": [
{"entity": "alice", "cascades": 23, "fraction": 0.258},
{"entity": "bob", "cascades": 14, "fraction": 0.157},
{"entity": "charlie", "cascades": 9, "fraction": 0.101}
],
"chains": [
{"trigger": "bug", "response": "fix", "count": 62, "median_delta_days": 2.3},
{"trigger": "bug", "response": "revert", "count": 18, "median_delta_days": 0.4},
{"trigger": "regression", "response": "fix", "count": 9, "median_delta_days": 4.1}
]
}
When to Use It
- Incident response analysis. Which errors lead to fixes, and how quickly? Which lead to reverts?
- Developer workflow. Who introduces bugs and who fixes them? Is there a self-fix pattern?
- Service dependency. Event A in service X triggers event B in service Y — cascade reveals the coupling.
- Repository health. Measure how reliably bugs get resolved and how long the resolution takes.
Pairs Well With
stats— statistical profile of the event fields before cascade analysisanomalies— unusual cascade chains (an entity that never self-fixes) are anomaly candidatesinvariants— cascade patterns are temporal invariants; invariants discovers structural onesessence— cascade metrics feed into essence generation for project health assessments
Profiles
Profiles are the lens. They do not change what Vajra analyzes — they change how results are scored, ranked, and rendered.
The same document analyzed with --profile staff and --profile engineer produces the same underlying statistics. The difference is which observations surface, what language describes them, and what gets collapsed as noise.
The Scoring Model
Every observation in the analysis pipeline receives a composite importance score:
score = sum(weight_i * signal_i)
Six signal dimensions, each normalized to [0, 1]:
| Dimension | What It Measures |
|---|---|
rarity | Self-information of the observation. Rare things score high. |
instability | Type instability at the path. Mixed types score high. |
entropy_signal | Distance from 0.5 normalized entropy. Constants and noise both score high. Meaningful variation scores low. |
structural_coverage | Fraction of total nodes under this path. Wide-reaching paths score high. |
anomaly_strength | Maximum anomaly score across all four dimensions. |
concern_relevance | Profile-specific boost for certain paths or observation types. |
The profile defines the weights. The weights determine what rises to the top.
Built-in Profiles
staff
For: Non-technical operations staff who need “what is this and what stands out.”
| Dimension | Weight |
|---|---|
| rarity | 0.10 |
| instability | 0.05 |
| entropy_signal | 0.10 |
| structural_coverage | 0.25 |
| anomaly_strength | 0.30 |
| concern_relevance | 0.20 |
Rendering: Plain language. No JSONPath. No technical jargon. Anomalies described in terms of business impact. Structural boilerplate hidden.
Section headers: “Document Summary,” “What Stands Out,” “What This Likely Means.”
vajra essence claim.json --profile staff
Document Summary:
1 claim with 14 service lines, 1 patient, 2 diagnosis codes.
What Stands Out:
- 3 service lines are missing allowed amounts.
- Adjustment reason "CO-45" repeats across 8 of 14 lines.
What This Likely Means:
- A subset of service lines appears incomplete.
- The repeated adjustment code suggests a systematic issue.
engineer
For: Engineers who need schema details, structural analysis, and regression signals.
| Dimension | Weight |
|---|---|
| rarity | 0.15 |
| instability | 0.25 |
| entropy_signal | 0.15 |
| structural_coverage | 0.15 |
| anomaly_strength | 0.15 |
| concern_relevance | 0.15 |
Rendering: Technical. JSONPath paths, type annotations, cardinalities. Diff-style output for drift. Fingerprints displayed.
vajra essence claim.json --profile engineer
Structure: 847 nodes, 23 distinct paths, max depth 6
Fingerprint (path set): a1b2c3d4...
Notable paths:
$.claims[*].service_lines[*].allowed_amount
null_rate: 0.214, entropy: 3.12, type: number (100%)
$.claims[*].service_lines[*].adjustment.reason
entropy: 1.56, cardinality: 4, dominant: "CO-45" (57.1%)
auditor
For: Auditors and compliance teams who need completeness, traceability, and consistency evidence.
| Dimension | Weight |
|---|---|
| rarity | 0.10 |
| instability | 0.20 |
| entropy_signal | 0.10 |
| structural_coverage | 0.10 |
| anomaly_strength | 0.20 |
| concern_relevance | 0.30 |
Rendering: Formal vocabulary. Missing fields listed with full paths. Type inconsistencies documented with examples. Drift metrics with severity scores.
Concern relevance boosts: completeness, traceability, required-field absence.
vajra essence claim.json --profile auditor --format markdown
## Audit Essence
### Completeness Assessment
- **21.4%** of service lines are missing `allowed_amount`
(3 of 14 service line records; field path: `$.claims[*].service_lines[*].allowed_amount`)
- Provider `taxonomy` field: absent
(expected presence rate in comparable data: 94%)
### Type Consistency
- All paths exhibit 100% type stability. No type inconsistencies detected.
### Pattern Observations
- Adjustment reason code `CO-45` appears in 57.1% of service lines (8 of 14).
This concentration exceeds typical variance for this field.
ai
For: Downstream LLM consumption. Maximum information density per token.
| Dimension | Weight |
|---|---|
| rarity | 0.15 |
| instability | 0.10 |
| entropy_signal | 0.20 |
| structural_coverage | 0.20 |
| anomaly_strength | 0.20 |
| concern_relevance | 0.15 |
Rendering: Compact, machine-readable. Motifs collapsed aggressively. Repeated structures represented once with count. Explicit caveats on inferences.
vajra essence claim.json --profile ai --format compact-ai --budget 300
{"v":"vajra/1","n":847,"p":23,"d":6,"motif":{"p":"$.claims[0].service_lines[*]","c":14,"f":["procedure_code","charge_amount","allowed_amount","status","adjustment"]},"a":[{"p":"$.claims[0].service_lines[2,7,11].allowed_amount","t":"miss","s":4.2}],"drill":[{"p":"$.claims[*].service_lines","avail":["stats","anomalies"]}]}
fraud
For: Fraud and risk analysts who need suspicious patterns, outliers, and unusual combinations.
| Dimension | Weight |
|---|---|
| rarity | 0.25 |
| instability | 0.10 |
| entropy_signal | 0.10 |
| structural_coverage | 0.05 |
| anomaly_strength | 0.35 |
| concern_relevance | 0.15 |
Rendering: Investigative framing. Outliers with full context. Benford’s Law departures. Suspicious value repetition. Unusual co-occurrence patterns.
Concern relevance boosts: numeric anomalies, identifier patterns, timing irregularities.
vajra essence claims_batch.ndjson --profile fraud
=== Fraud Screening Essence ===
Flagged Patterns:
- charge_amount outlier: $47,250.00 in record 834
(z_MAD = 6.3, population median = $285.00)
This value is 165x the median. Review recommended.
- Status value "voided" in record 419
(seen once in 1,247 records, self-information = 10.3 bits)
Extremely rare status. May warrant investigation.
- Benford's Law departure for charge_amount leading digits
Chi-squared: 14.2 (p = 0.028)
Observed leading digit "1": 18% (expected: 30%)
Observed leading digit "5": 22% (expected: 8%)
Suggestive of non-natural distribution.
- Identical charge_amount ($285.00) in 47 records from same provider
Exact-value concentration: 3.8% of population
Pattern is unusual for this field's typical variance.
health
For: Project and repository health assessment. Identifies risks, governance patterns, and sustainability signals.
| Dimension | Weight |
|---|---|
| entropy_signal | 0.25 |
| concern_relevance | 0.25 |
| anomaly_strength | 0.20 |
| rarity | 0.15 |
| instability | 0.10 |
| structural_coverage | 0.05 |
Rendering: Assessment-oriented. Sections organized around risk, governance, and sustainability. Designed for repository and project analysis.
Section headers: “Key Risks,” “Governance Signals,” “Sustainability Assessment.”
vajra essence ./my-repo --profile health
Key Risks:
- Bus factor: 2 contributors account for 78% of commits.
- Fix rate declining: 31% of bugs fixed in March vs 18% in January.
- Mean time to fix increasing: 2.3 days -> 4.1 days over 3 months.
Governance Signals:
- Review coverage: 64% of PRs received at least one review.
- Bot contribution: 33% of PRs from automated tools.
- Consistent commit cadence: 4.2 commits/day (low variance).
Sustainability Assessment:
- Moderate risk. High contributor concentration and declining fix rates
suggest capacity constraints. Review coverage is below recommended
thresholds for projects of this activity level.
Custom Profiles
Define custom profiles in TOML. Load with --config path/to/profiles.toml.
Full TOML Example
[profile.claims_review]
name = "claims-review"
description = "Internal review for claims processing teams"
[profile.claims_review.weights]
rarity = 0.15
instability = 0.20
entropy_signal = 0.10
structural_coverage = 0.10
anomaly_strength = 0.25
concern_relevance = 0.20
[profile.claims_review.rendering]
vocabulary = "plain" # plain | technical | formal
show_paths = false # hide JSONPath in output
show_scores = false # hide numeric scores
motif_collapse_threshold = 3 # collapse motifs repeated > N times
anomaly_threshold = 3.5 # MAD z-score threshold for flagging
[profile.claims_review.concern_boosts]
paths_containing = ["denied", "adjustment", "override", "void"]
observation_types = ["missingness", "type_instability"]
boost_factor = 1.5
Loading Custom Profiles
vajra essence claim.json --profile claims-review --config ./profiles.toml
Multiple Custom Profiles in One File
[profile.claims_review]
name = "claims-review"
description = "Internal claims processing review"
# ... weights, rendering, boosts ...
[profile.vendor_audit]
name = "vendor-audit"
description = "Vendor data feed quality assessment"
# ... weights, rendering, boosts ...
[profile.ml_preprocessing]
name = "ml-preprocessing"
description = "Data quality check before ML pipeline ingestion"
# ... weights, rendering, boosts ...
Listing Available Profiles
vajra profiles
=== Built-in Profiles ===
staff Plain vocabulary, narrative rendering; emphasizes anomalies and structural coverage
engineer Technical vocabulary, list-based rendering; balanced scoring
auditor Formal vocabulary, completeness-focused; emphasizes instability and concern relevance
ai Compact terse rendering optimized for machine consumption
fraud Investigative framing; emphasizes outliers, rarity, and suspicious patterns
health Assessment-oriented; emphasizes risks, governance, and sustainability
=== Custom Profiles ===
claims-review Internal claims processing review
vajra profiles --config ./profiles.toml --format json
[
{"name": "staff", "description": "...", "source": "built-in"},
{"name": "engineer", "description": "...", "source": "built-in"},
{"name": "auditor", "description": "...", "source": "built-in"},
{"name": "ai", "description": "...", "source": "built-in"},
{"name": "fraud", "description": "...", "source": "built-in"},
{"name": "claims-review", "description": "Internal claims processing review", "source": "custom"}
]
Rendering Vocabulary
| Level | Description | Example |
|---|---|---|
plain | No jargon, no paths, business-oriented language | “3 service lines are missing allowed amounts” |
technical | JSONPath, type annotations, statistical measures | “$.claims[*].service_lines[2,7,11].allowed_amount: null_rate=0.21, anomaly_score=4.2” |
formal | Full sentences, compliance-appropriate language | “Observations 2, 7, and 11 in the service line array exhibit absent allowed_amount fields.” |
Deterministic Tie-Breaking
When two observations have identical composite scores, ties are broken by:
- Path depth — shallower paths first (broader impact)
- Lexicographic path order — alphabetical by wildcard path
This ensures identical scores always resolve in the same order, regardless of platform or run.
Input Formats
Vajra reads more than JSON. It reads anything that can be interpreted as structured data — and it auto-detects the format so you do not have to tell it.
Supported Formats
| Format | Extensions | Detection | Notes |
|---|---|---|---|
| JSON | .json | Content starts with { or [ | Primary format. Full DOM and streaming support. |
| NDJSON | .ndjson, .jsonl | Multiple JSON objects separated by newlines | Each line is a separate document. Batch analysis native. |
| YAML | .yaml, .yml | Content starts with --- or key-colon pattern | Multi-document YAML supported (separated by ---). |
| CSV | .csv | Comma-separated with consistent column count | First row treated as headers. Each row becomes a JSON object. |
| TSV | .tsv | Tab-separated with consistent column count | Same as CSV but tab-delimited. |
| Markdown | .md | Markdown structure with tables or code blocks | Tables extracted as arrays of objects. Code blocks parsed if JSON/YAML. |
.pdf | PDF magic bytes | Text extracted and parsed for structured content. | |
| Gzip | .gz, .json.gz | Gzip magic bytes (1f 8b) | Decompressed transparently. Inner format auto-detected. |
| Zstd | .zst, .json.zst | Zstd magic bytes | Decompressed transparently. Inner format auto-detected. |
| HTTP URL | http://, https:// | URL scheme prefix | Fetched via blocking HTTP GET. Response body auto-detected. |
| Source Code | .rs, .py, .js, .ts, .go, .java, .c, .cpp, .rb | File extension matches known language | Parsed via tree-sitter into AST. Requires vajra-source feature. |
| Git Repository | (directory) | Directory contains .git/ | Reads commit history directly. See flags below. |
| V8 CPU Profile | .cpuprofile | File extension | Parses V8 .cpuprofile JSON into analyzable structure. |
| strace Summary | — | Content contains % time header | Parses strace -c summary output into structured records. |
| Stdin | - | Explicit - argument | Content auto-detected from first bytes. |
Auto-Detection Logic
When no --input-format is specified, Vajra detects the format in this order:
-
Check the argument. If it is
-, read from stdin. If it starts withhttp://orhttps://, fetch via HTTP. If it is a directory containing.git/, treat as a git repository. -
Check the extension.
.json-> JSON..ndjson/.jsonl-> NDJSON..yaml/.yml-> YAML..csv-> CSV..tsv-> TSV..md-> Markdown..pdf-> PDF..cpuprofile-> V8 CPU Profile..rs/.py/.js/.go/etc. -> Source Code (via tree-sitter). -
Check for compression. If the extension is
.gzor.zst, decompress and re-detect the inner format from the next extension (e.g.,.json.gz-> decompress -> JSON). -
Check content. If the extension is ambiguous or missing, read the first bytes:
- Starts with
{or[after whitespace -> JSON - Multiple
{...}\nsequences -> NDJSON - Starts with
---or matcheskey: valuepattern -> YAML - Consistent comma-separated columns -> CSV
- PDF magic bytes (
%PDF) -> PDF - Contains
% timecolumn header -> strace summary
- Starts with
-
Fall back to JSON. If nothing else matches, attempt JSON parsing.
Format Override
Force a specific format with --input-format:
vajra inspect data.txt --input-format json
vajra stats records.log --input-format ndjson
vajra inspect data.bin --input-format yaml
This overrides all auto-detection. Useful when files have nonstandard extensions.
Format Details
JSON
The primary format. Parsed by simd-json in DOM mode (full random access, rich analysis) or streaming mode (bounded memory, SAX-style events).
vajra inspect claim.json
echo '{"patient": "Martinez", "status": "active"}' | vajra inspect -
NDJSON (Newline-Delimited JSON)
Each line is an independent JSON document. Natural format for logs, event streams, and batch data.
vajra anomalies claims.ndjson
NDJSON records are aggregated into a single array for analysis. Commands like stats, anomalies, invariants, and essence compute across all records as a unified population.
Example input:
{"claim_id": "C001", "status": "adjudicated", "amount": 285.00}
{"claim_id": "C002", "status": "denied", "amount": 0.00}
{"claim_id": "C003", "status": "adjudicated", "amount": 47250.00}
YAML
Single-document and multi-document YAML both supported. Parsed via serde_yaml and converted to Vajra’s internal document model.
vajra inspect config.yaml
Multi-document YAML (separated by ---):
---
claim_id: C001
status: adjudicated
amount: 285.00
---
claim_id: C002
status: denied
amount: 0.00
vajra anomalies multi_claims.yaml
CSV
The first row is treated as column headers. Each subsequent row becomes a JSON object with header names as keys.
vajra stats claims.csv
Example input:
claim_id,status,charge_amount,allowed_amount
C001,adjudicated,285.00,210.00
C002,denied,125.00,
C003,adjudicated,890.00,675.00
Vajra converts this to:
[
{"claim_id": "C001", "status": "adjudicated", "charge_amount": "285.00", "allowed_amount": "210.00"},
{"claim_id": "C002", "status": "denied", "charge_amount": "125.00", "allowed_amount": ""},
{"claim_id": "C003", "status": "adjudicated", "charge_amount": "890.00", "allowed_amount": "675.00"}
]
Empty cells are preserved as empty strings, allowing missingness analysis to detect them.
TSV
Identical to CSV but tab-delimited. Same header-to-object conversion.
vajra stats data.tsv
vajra stats data.txt --input-format tsv
Markdown
Vajra extracts structured content from Markdown files:
- Tables are parsed into arrays of objects (headers become keys)
- JSON/YAML code blocks are parsed as embedded documents
vajra inspect report.md
Text is extracted from PDF files and parsed for any structured content (embedded tables, JSON fragments, structured text patterns).
vajra inspect document.pdf
PDF support depends on the pdf-extract crate. Complex layouts may lose structure during extraction.
Source Code
Vajra can analyze source code from any language supported by tree-sitter. The source file is parsed into a concrete syntax tree (CST), converted to a JSON structure, and analyzed through the full Vajra pipeline — entropy, anomalies, fingerprinting, drift, motifs, and essence all work on code.
vajra inspect main.rs # auto-detect Rust
vajra stats app.py # auto-detect Python
vajra drift v1/server.go v2/server.go # code structural drift
vajra essence lib.rs --profile engineer # code essence
vajra inspect main.rs --lang rust # explicit language
vajra inspect code.txt --input-format source --lang python # override format + language
Supported languages (each enabled by a feature flag, all on by default):
| Language | Extensions | Feature Flag |
|---|---|---|
| Rust | .rs | rust |
| Python | .py, .pyi | python |
| JavaScript | .js, .mjs, .cjs, .jsx | javascript |
| TypeScript | .ts, .tsx, .mts, .cts | typescript |
| Go | .go | go |
| Java | .java | java |
| C | .c, .h | c |
| C++ | .cpp, .cc, .cxx, .hpp | cpp |
| Ruby | .rb | ruby |
What Vajra reveals on code:
| Analysis | What It Finds |
|---|---|
| Entropy of AST node types | Structural diversity — boilerplate vs complex code |
| Rarity of node types | Unusual constructs — goto, unsafe, eval |
| Nesting depth anomalies | Complexity hotspots |
| Fingerprint comparison | Structural clones across files |
| Drift between versions | Added functions, removed classes, changed signatures |
| Motifs | Repeated structural patterns — copy-paste code |
Source code analysis requires the vajra-source crate (included by default). The companion vajra-domain-source plugin adds recognizers for naming conventions (snake_case, camelCase, PascalCase) and code structure relationships.
Semantic Paths
The --semantic-paths flag maps tree-sitter node kinds to human-readable labels in the output. Instead of raw AST node names like function_item or impl_item, you see function and implementation.
vajra inspect main.rs --semantic-paths
Without --semantic-paths:
$.program.function_item[0].identifier "process_record"
$.program.function_item[0].parameters.parameter[0] "record: &Record"
$.program.impl_item[0].identifier "Pipeline"
With --semantic-paths:
$.program.function[0].name "process_record"
$.program.function[0].parameters.param[0] "record: &Record"
$.program.implementation[0].name "Pipeline"
Covers 9 languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, and Ruby.
Git Repository
When the input is a directory containing a .git/ subdirectory, Vajra reads the commit history directly — no export step required.
vajra stats ./my-repo
vajra cascade ./my-repo --entity-field '$.author' --time-field '$.date' --event-field '$.type' --response-values 'fix,revert'
Each commit becomes a JSON record with fields like author, date, message, files_changed, and insertions/deletions.
Flags:
| Flag | Description | Default |
|---|---|---|
--git-limit <N> | Maximum number of commits to read | 500 |
--git-branch <branch> | Branch to read from | current HEAD |
vajra stats ./my-repo --git-limit 1000 --git-branch main
Auto-detection is based on the presence of .git/ in the input directory. To override, use --input-format git.
V8 CPU Profile
Vajra parses .cpuprofile files produced by V8-based tools (Chrome DevTools, Node.js --prof). The profile’s call tree is converted to a flat array of records with function name, source location, hit count, and self/total time.
vajra stats profile.cpuprofile
vajra anomalies profile.cpuprofile
Auto-detected by the .cpuprofile extension.
strace Summary
Vajra parses the summary table produced by strace -c. Each syscall row becomes a record with fields for time percentage, seconds, calls, errors, and syscall name.
strace -c ls 2>&1 | vajra stats -
vajra stats strace_output.txt --input-format strace
Auto-detected when content contains the % time column header characteristic of strace -c output.
Compressed Files (Gzip, Zstd)
Compression is transparent. Vajra decompresses on the fly and auto-detects the inner format.
vajra inspect claims.json.gz
vajra stats archive.json.zst
This works with any inner format — claims.ndjson.gz, data.yaml.zst, report.csv.gz.
HTTP URLs
Vajra fetches the URL via blocking HTTP GET and analyzes the response body.
vajra inspect https://api.example.com/v1/claims/12345
vajra stats https://data.example.com/feed.ndjson
The response content type and body are used for format detection. No authentication headers are supported in the current version — for authenticated endpoints, fetch with curl and pipe to stdin:
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/data | vajra inspect -
Stdin
The - argument reads from standard input. Format is auto-detected from the content.
cat claim.json | vajra inspect -
curl https://api.example.com/data | vajra stats -
jq '.claims[]' data.json | vajra anomalies -
zcat claims.json.gz | vajra inspect -
Multi-Document Formats
NDJSON and multi-document YAML naturally contain multiple documents. NDJSON records are now aggregated into a single array, so all commands — including stats, anomalies, invariants, and essence — compute across all records as a unified population.
vajra anomalies claims.ndjson # analyzes all lines as a batch
vajra stats claims.ndjson # computes stats across all records
Directory Input
When the input is a directory path, Vajra discovers all supported files:
vajra batch ./claims/ # processes all files in the directory
vajra cluster ./claims/ # clusters all files in the directory
Subdirectories are not traversed recursively by default.
The Engine
Vajra processes structured data through a six-layer pipeline. Each layer depends on the one before it. Each layer’s outputs are independently useful. The pipeline can exit early at any layer depending on the command.
The Six Layers
Raw Input
-> [1] Parse + Normalize
-> [2] Structural Analysis
-> [3] Statistical Analysis
-> [4] Semantic Lifting
-> [5] Concern-Oriented Scoring
-> [6] Deterministic Essence Rendering
Layer 1: Parse + Normalize
Responsibility: Take raw bytes and produce a traversable document model.
What happens:
- Format detection. Auto-detect or apply
--input-formatoverride. See Input Formats. - Decompression. Gzip and Zstd payloads are decompressed transparently.
- Parsing. JSON via
simd-json(DOM mode) or SAX-style streaming. YAML, CSV, TSV, Markdown, PDF converted to JSON-equivalent internal representation. - Canonicalization. RFC 8785 (JSON Canonicalization Scheme) applied: lexicographic key ordering, deterministic number formatting, Unicode NFC normalization.
- Input hardening. Maximum nesting depth enforced (default 256). Maximum string length enforced. Malformed input produces clean errors with byte offset locations.
Output: A Document — the parsed value tree plus metadata (node count, depth, raw size, content hash).
Complexity: O(n) time. O(n) memory in DOM mode, O(1) in streaming.
Commands that stop here: None. Every command needs at least a parsed document.
Layer 2: Structural Analysis
Responsibility: Extract the structural skeleton — every path, every type, every parent-child relationship.
What happens:
- Path extraction. DFS traversal computes full JSONPath for every node. Array indices normalized to
[*]for wildcard paths. - Path trie construction. Wildcard paths stored in a trie. Each trie node holds aggregated metadata: count, type distribution, depth, parent type, sibling count.
- Fingerprinting. BLAKE3 path set hash, typed path hash, and Merkle subtree hashes computed in a single bottom-up traversal.
- Motif detection. Subtree hashes that appear more than once identify repeated structural patterns. Ranked by frequency times subtree size.
- Array morphology. Per-array cardinality distribution, type homogeneity, element uniqueness, nested shape diversity.
Output: Path trie, fingerprints, motif index, array morphology profiles.
Complexity: O(n) time, O(p) memory where p = distinct wildcard paths.
Commands that exit here: inspect, fingerprint.
Layer 3: Statistical Analysis
Responsibility: Quantify the distribution of every observable quantity in the document.
What happens:
- Frequency analysis. Per-path value frequencies via exact counting (or Count-Min Sketch in streaming mode). Top-k values via Space-Saving.
- Entropy computation. Shannon entropy and normalized entropy per path. The most informative universal signal in the system.
- Missingness profiling. Null rate, absent rate, empty rate, type instability rate per path. Identifies quasi-required fields and suspicious omissions.
- Numeric distributions. Min, max, mean, median, MAD, percentiles via DDSketch. Skewness proxy. Heavy-tail indicator.
- Co-occurrence. Pointwise Mutual Information (PMI) between field pairs for the top-k most frequent paths.
Output: Per-path feature vectors stored in the feature store. The statistical backbone of everything downstream.
Complexity: O(n) time, O(p + v) memory where v = distinct values per path (bounded by sketches in streaming mode).
Commands that exit here: stats, anomalies.
Layer 4: Semantic Lifting
Responsibility: Infer likely semantic types from raw JSON scalar types and discover cross-field relationships.
What happens:
- Type inference. DFA bank runs against values: dates, currency-like values, identifiers, enum-like fields, code tokens, phone numbers, free text. Each inference carries a confidence label (definite, dominant, heuristic, unclassified).
- Relationship discovery. Conditional entropy between field pairs identifies functional dependencies. PMI identifies co-occurrence patterns.
- Domain plugin integration. Registered plugins contribute additional type recognizers and relationship hints. The medical plugin recognizes ICD-10, CPT, NPI, HCPCS patterns.
- Temporal analysis. When date/datetime fields are detected, inter-event intervals, monotonicity, gaps, and chronology violations are analyzed.
Output: Semantic type annotations, relationship graph, temporal observations, domain hints.
Complexity: O(n) for type inference, O(k^2 * n) for relationship discovery where k = top-k field screening threshold (default 50).
Commands that exit here: invariants, query.
Layer 5: Concern-Oriented Scoring
Responsibility: Score every observation against the active concern profile’s weight vector and select what matters.
What happens:
- Candidate collection. Every notable observation from layers 2-4 becomes a candidate: high-entropy fields, anomalies, motifs, relationship discoveries, drift observations.
- Signal normalization. Each of the six scoring dimensions normalized to [0, 1].
- Composite scoring. Weighted sum using the profile’s weight vector.
- Ranking. Candidates sorted by composite score with deterministic tie-breaking (path depth, then lexicographic).
- Token budget enforcement. If
--budget Nis set, greedy knapsack selection by score-per-token.
Output: Ranked, budgeted list of observations ready for rendering.
Complexity: O(c log c) where c = number of candidates (typically a few dozen to a few hundred).
Commands that exit here: None directly — this feeds rendering.
Layer 6: Deterministic Essence Rendering
Responsibility: Transform the scored, ranked observations into the final output.
What happens:
- Motif collapsing. Repeated structures represented once with count and variation notes.
- Template application. The profile’s rendering configuration (vocabulary level, section headers, formatting rules) is applied.
- Format rendering. Output produced in the requested format: text, JSON, Markdown, or compact-AI.
- Redaction. If
--redactis enabled, pattern-based redaction applied before final emission. - Provenance attachment. Every essence includes: Vajra version, profile used, input hash, config hash, timestamp.
Output: The essence — a compressed, prioritized, faithful representation of the input data.
Complexity: O(c) where c = number of included observations.
Commands that exit here: essence, drift, cluster, batch.
Data Flow Diagram
+-----------+
| Raw Input |
+-----+-----+
|
[1] Parse + Normalize
|
+------v------+
| Document |
| (value tree |
| + metadata)|
+------+------+
|
[2] Structural Analysis
|
+-------+--------+--------+--------+
| | | | |
Path Finger- Motif Array Domain
Trie prints Index Morph. Hints
| | | | |
+-------+--------+--------+--------+
|
[3] Statistical Analysis
|
+------v------+
| Feature |
| Store |
| (per-path |
| vectors) |
+------+------+
|
[4] Semantic Lifting
|
+-------+--------+--------+
| | | |
Type Relation- Temporal Plugin
Infer. ships Patterns Hints
| | | |
+-------+--------+--------+
|
[5] Scoring + Selection
|
+------v------+
| Ranked |
| Observations|
+------+------+
|
[6] Rendering
|
+------v------+
| Essence |
+-------------+
Early Exit Points
Not every command runs all six layers. The engine exits as early as possible:
| Command | Layers Used |
|---|---|
inspect | 1, 2 |
fingerprint | 1, 2 |
stats | 1, 2, 3 |
anomalies | 1, 2, 3 |
invariants | 1, 2, 3, 4 |
query | 1, 2, 3, 4 |
essence | 1, 2, 3, 4, 5, 6 |
drift | 1, 2, 3 (both docs), then comparison |
cluster | 1, 2 (all docs), then similarity |
batch | 1, 2, 3 (all docs), then aggregation |
This is why inspect is fast and essence is slower — inspect exits after structural analysis while essence runs the full pipeline.
Deep Dives
- Algorithms — every algorithm with provenance, complexity, and what it replaced
- Streaming — how the engine handles documents that exceed memory
- Determinism — how every source of nondeterminism is eliminated
Algorithms
This is the technical heart of Vajra. Every algorithm here was selected against three gates. Any algorithm that failed any gate was cut.
The Three Gates
Gate 1: Scale
O(n) or O(n log n) time complexity. Bounded or streaming-compatible memory. If an algorithm cannot handle a billion nodes without choking, it does not enter.
Gate 2: Battle-Tested
Published, peer-reviewed, deployed in production systems at scale. No novel algorithms. No research prototypes. No “clever tricks” that have not survived contact with real data.
Gate 3: Deterministic
Same input, same output. If an algorithm requires random sampling without seed control, or produces platform-dependent results, or depends on iteration order of an unordered collection — it does not enter.
The Algorithms
BLAKE3 Hashing
Provenance: O’Connor, Aumasson, Neves, Wilcox-O’Hearn, 2020. Rust-native reference implementation.
What it does: All hashing in Vajra. Path set fingerprints, typed path fingerprints, Merkle subtree hashing, content hashing, MinHash hash functions.
Why BLAKE3 over alternatives:
| Contender | Why Rejected |
|---|---|
| SHA-256 | 3-7x slower on modern hardware. No parallelism. |
| SHA-3 | Slower than BLAKE3 on all platforms. No parallel tree structure. |
| xxHash / FNV | Not cryptographic. Collision resistance matters for fingerprinting. |
| SipHash | Designed for hash tables, not content addressing. Slower for bulk data. |
Why BLAKE3 wins: 3-7x faster than SHA-256. Internally parallelizable via Bao tree structure. 256-bit output with cryptographic strength. Rust-native. Deterministic. One algorithm for every hashing need in the system.
Complexity: O(n) time, O(1) memory per hash. Internally parallel for large inputs.
simd-json Parsing
Provenance: Langdale & Lemire, 2019. Based on the simdjson C++ library, ported to Rust.
What it does: DOM-mode JSON parsing at 2+ GB/s using SIMD instructions for structural character classification and string validation.
Why simd-json over alternatives:
| Contender | Why Rejected |
|---|---|
| serde_json | 400-800 MB/s. Adequate but not operational speed for large documents. |
| sonic-rs | Young ecosystem. Less battle-tested. |
| Manual SAX parser | Necessary for streaming mode, but DOM mode needs random access. |
Why simd-json wins: Measured 2+ GB/s on modern hardware. Uses SIMD for structural indexing. Operates on mutable borrowed slices for zero-copy access to string values.
Complexity: O(n) time. O(n) memory for the DOM.
RFC 8785 Canonicalization (JCS)
Provenance: IETF RFC 8785, published 2020. The standard for deterministic JSON serialization.
What it does: Removes irrelevant representational variance. Lexicographic key ordering by UTF-16 code unit sequence. Deterministic number formatting. No whitespace.
Extensions beyond the RFC:
- Unicode NFC normalization (UAX #15) for string comparison stability
- Null vs. absent distinction preservation (RFC 8785 does not address this; Vajra tracks it explicitly)
- Configurable array order policy: preserve (default), set (unordered deduplicated), multiset (unordered with duplicates)
Complexity: O(n log k) where k = maximum keys per object. Memory: O(n).
Shannon Entropy
Provenance: Shannon, 1948. The foundation of information theory. Sixty-eight years of deployment in every field that measures information.
What it does: For each path, measures the information content of observed values.
H(X) = -sum p(x) * log2(p(x))
Normalized:
H_norm(X) = H(X) / log2(|support|)
Why entropy is the strongest universal primitive: It distinguishes boilerplate from signal without domain knowledge. A constant field (H = 0) is noise. A uniform random field (H_norm = 1) is unstructured. Meaningful variation lives in the middle — identifiers, dates, codes, status values.
Streaming computation: Maintained via running counts per value per path. When the value space exceeds memory, entropy is estimated from Count-Min Sketch frequency approximations.
Complexity: O(n) time, O(v) space where v = distinct values per path.
Count-Min Sketch (CMS) with Conservative Update
Provenance: Cormode & Muthukrishnan, 2005 (original CMS). Conservative update: Estan & Varghese, 2002.
What it does: Streaming frequency estimation when exact counts would exceed memory. Maintains a 2D array of counters with multiple hash functions.
Conservative update improvement: Instead of incrementing all d counters, only increment counters currently equal to the minimum. Provably reduces over-estimation error without changing the data structure.
Parameters:
- Width w = ceil(e / epsilon) where epsilon = desired error rate (default 0.001)
- Depth d = ceil(ln(1 / delta)) where delta = failure probability (default 0.01)
- Default: w = 2,718, d = 5
Guarantees: Estimated count satisfies: true count <= estimate <= true count + epsilon * N with probability >= 1 - delta, where N = total count.
What it replaced:
| Contender | Why Rejected |
|---|---|
| Exact hash maps | Unbounded memory on high-cardinality paths. |
| Bloom filters | Cannot count — only membership testing. |
| Count Sketch | Returns negative estimates. CMS guarantees non-negative. |
Complexity: O(d) per update. O(w * d) memory. Both constants independent of data size.
Space-Saving Algorithm
Provenance: Metwally, Agrawal, El Abbadi, 2005.
What it does: Identifies the top-k most frequent elements in a stream using exactly k counters.
Mechanism: When a new element arrives that is not tracked, evict the element with the smallest count, replace it, and increment. Despite its simplicity, every element whose true frequency exceeds N/k is guaranteed to be in the summary.
What it replaced:
| Contender | Why Rejected |
|---|---|
| Frequent algorithm (Misra-Gries) | Weaker error bounds for the same space. |
| Lossy Counting | Higher space complexity. More complex implementation. |
| Full sorting | O(n log n) and requires all data in memory. |
Complexity: O(1) amortized per update with a min-heap. O(k) memory.
DDSketch
Provenance: Masson, Rim, Lee, 2019. Developed at Datadog. Deployed in production across billions of data points per second.
What it does: Streaming quantile estimation with relative error guarantees. For any quantile q, the returned value satisfies |estimate - true| <= alpha * |true| where alpha is the relative accuracy parameter.
Why DDSketch over alternatives:
| Contender | Why Rejected |
|---|---|
| t-digest (Dunning, 2019) | No formal error guarantees. Accuracy is empirically good but theoretically unbounded. |
| Fixed-width histograms | Absolute error is meaningless when values span orders of magnitude (cents to millions). |
| Random sampling | No guarantees on tail quantiles — precisely where anomalies live. |
| GK sketch (Greenwald-Khanna) | Provides absolute error, not relative. DDSketch adapts to data scale. |
Critical property: mergeability. DDSketch instances can be merged exactly, preserving accuracy guarantees. This enables parallel batch processing: analyze partitions independently, merge sketches for global statistics with zero accuracy loss.
Parameters: alpha = 0.01 (1% relative error) by default. Memory: O(log(max/min) / log(1 + alpha)) buckets — typically a few hundred for financial data spanning cents to millions.
Complexity: O(1) per insertion. O(1) per quantile query.
Median Absolute Deviation (MAD)
Provenance: Hampel, 1974. Standard robust statistics.
What it does: Robust measure of dispersion.
MAD = median(|x_i - median(X)|)
Modified z-score:
z_MAD = 0.6745 * (x_i - median(X)) / MAD
Why MAD over standard deviation: Standard deviation has a 0% breakdown point — a single extreme value inflates sigma enough to mask every other anomaly. MAD has a 50% breakdown point — half the data can be arbitrarily corrupted before MAD gives a misleading result. This is the strongest possible breakdown point for any location/scale estimator.
Anomaly threshold: |z_MAD| > 3.5 flags an anomaly candidate (Iglewicz & Hoaglin, 1993). Configurable per profile.
Streaming computation: Exact MAD requires sorted data. Running approximate median via DDSketch enables streaming MAD estimation with bounded relative error.
Complexity: O(n) with sorting, or O(n) streaming via DDSketch.
MinHash
Provenance: Broder, 1997. The foundation of modern near-duplicate detection and similarity search.
What it does: Estimates Jaccard similarity between sets in constant time per comparison.
Vajra computes MinHash signatures over wildcard path sets using k independent hash functions (k = 128 by default). For memory efficiency, the b-bit MinHash variant (Li & Konig, 2011) stores only the lowest b bits of each hash value.
What it replaced:
| Contender | Why Rejected |
|---|---|
| Exact pairwise Jaccard | O(n^2) for batch comparison. Fine for < 1,000 docs, breaks above that. |
| Random projection (SimHash) | Better for cosine similarity. MinHash is optimal for Jaccard. |
Complexity: O(n) for signature computation. O(k) for pairwise comparison. O(k) memory per document.
SimHash
Provenance: Charikar, 2002.
What it does: Fixed-width fingerprints where Hamming distance approximates cosine distance. Used for near-motif detection — subtrees that are semantically the same but differ in one or two fields.
SimHash operates over (key, value_type) feature pairs within each subtree. Subtrees whose SimHash values have Hamming distance <= t (default t = 3 out of 64 bits) are grouped as near-motifs.
Complexity: O(n) time. O(1) per fingerprint comparison.
Locality-Sensitive Hashing (LSH)
Provenance: Indyk & Motwani, 1998. Banded variant as described by Leskovec, Rajaraman & Ullman.
What it does: Partitions MinHash signatures into b bands of r rows, hashing each band into buckets. Documents sharing a bucket in any band are candidate pairs.
The S-curve probability: P(candidate) = 1 - (1 - s^r)^b
With k = 128, b = 16, r = 8: documents with Jaccard similarity > 0.5 have > 98% chance of being found. Documents with similarity < 0.2 have < 2% false positive rate.
What it replaced:
| Contender | Why Rejected |
|---|---|
| Hierarchical agglomerative clustering | O(n^2 log n) time, O(n^2) memory. Breaks on > 10K documents. |
| k-means / k-medoids | Requires specifying k in advance. Vajra cannot know the cluster count. |
Complexity: O(n) amortized for LSH indexing.
Jensen-Shannon Divergence (JSD)
Provenance: Lin, 1991. Square root metric property: Endres & Schindelin, 2003; Osterreicher & Vajda, 2003.
What it does: Measures distributional drift between two value distributions.
JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)
where M = 0.5 * (P + Q).
Why JSD over alternatives:
| Contender | Why Rejected |
|---|---|
| KL divergence | Asymmetric. Infinite when Q(x) = 0 and P(x) > 0. |
| Chi-squared test | Sensitive to bin size. Not a proper metric. |
| Kolmogorov-Smirnov | Measures maximum deviation only, not overall distribution shape. |
| Total variation distance | Does not account for similarity between nearby values. |
Why JSD wins: Symmetric. Always finite. Bounded to [0, 1] with log base 2. Square root is a proper metric satisfying the triangle inequality. Drift magnitudes can be meaningfully compared.
Complexity: O(v) where v = union of value supports.
1D Wasserstein Distance (Earth Mover’s Distance)
Provenance: Kantorovich, 1942. Computational formulation: O(n log n) via CDF sorting.
What it does: For numeric paths, measures “how far did values move” — not just that the distribution changed, but the magnitude of the shift.
Why included alongside JSD: JSD measures probability mass redistribution. Wasserstein captures the distance of the shift. A distribution that shifts entirely from $100 to $100.01 has low Wasserstein but potentially high JSD. A distribution that shifts from $100 to $10,000 has high Wasserstein. Both measures together give a complete picture.
Complexity: O(n log n) via sorting.
Pointwise Mutual Information (PMI)
Provenance: Church & Hanks, 1989 (in NLP). Rooted in Shannon, 1948.
What it does: Measures association strength between field pairs.
PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))
Positive = co-occur more than chance. Negative = avoid each other. Zero = independent.
Complexity: O(1) per pair given precomputed frequencies.
Conditional Entropy
Provenance: Shannon, 1948.
What it does: Measures how much knowing field X reduces uncertainty about field Y.
H(Y|X) = -sum p(x,y) * log2(p(y|x))
Low H(Y|X) means X predicts Y. H(Y|X) near 0 means functional dependency.
Complexity: O(n) to compute from co-occurrence counts.
Benford’s Law Analysis
Provenance: Newcomb, 1881. Benford, 1938. Formalized by Hill, 1995. Forensic application: Nigrini, 1996.
What it does: Tests whether leading digit distribution matches the expected logarithmic distribution:
P(d) = log10(1 + 1/d)
Departure measured via chi-squared goodness-of-fit. Effective for financial amounts, counts, and quantities. Not applicable to identifiers, codes, or constrained-range values — Vajra applies this only to paths classified as numeric with high cardinality and range spanning at least one order of magnitude.
Complexity: O(n) — single pass to count leading digits.
The Rejection List
These algorithms were evaluated and rejected. Each had a specific reason.
| Algorithm | Reason for Rejection |
|---|---|
| Isolation Forest (Liu et al., 2008) | Non-deterministic without careful seeding. O(n log n) per tree. Contamination parameter requires tuning. MAD + rarity + structural distance cover the same space with stronger interpretability. |
| Local Outlier Factor (Breunig et al., 2000) | O(n^2) naive, O(n log n) with spatial indexing. Sensitive to k parameter. Breaks universality on large datasets. |
| t-digest (Dunning, 2019) | No formal error guarantees. Accuracy is empirically good but theoretically unbounded. DDSketch provides provable bounds. |
| Hierarchical agglomerative clustering | O(n^2 log n) time, O(n^2) memory. Replaced by LSH-based component detection at O(n). |
| k-means / k-medoids | Requires specifying k in advance. Also O(n^2) per iteration for k-medoids. |
| Any method requiring training data | Vajra operates on cold data with no prior. Every method must work on a single document or batch with no history. |
| Any method requiring labeled examples | Same constraint. Unsupervised only. |
| Any method without deterministic output | The determinism guarantee is non-negotiable. |
Information Theory
Vajra’s analytical core is an information-theoretic pipeline. Every measure of diversity, anomaly, drift, similarity, and dependency traces back to a concept from information theory. This chapter covers the full stack — from foundational primitives through composite metrics to the scoring model that turns bits into insights.
Foundation: Shannon Entropy
The starting point. Shannon entropy measures the average surprise per observation:
H(X) = -sum p(x) * log2(p(x))
- 0 bits: constant field (no information)
- log2(k) bits: uniform distribution over k values (maximum information)
- Between: the interesting space where identifiers, dates, and codes live
Normalized entropy scales to [0, 1]:
H_norm(X) = H(X) / log2(|support|)
This is the single most important signal in Vajra. A field with H_norm near 0 is noise. A field with H_norm near 1 is unstructured randomness. Meaningful variation lives in the middle.
Files: vajra-stats/src/entropy.rs
The Renyi Spectrum
Shannon entropy is one point on a continuous family parameterized by alpha:
H_alpha(X) = (1 / (1 - alpha)) * log2(sum p(x)^alpha)
| alpha | Name | What It Measures |
|---|---|---|
| 0 | Hartley | log2(support size) — how many distinct values exist |
| 1 | Shannon | average surprise (limit as alpha approaches 1) |
| 2 | Collision | -log2(sum p^2) — probability two random draws match |
| infinity | Min-entropy | -log2(max p) — worst-case unpredictability |
Why a spectrum? A single entropy number hides the shape of the distribution. The Renyi spectrum reveals it:
- High Shannon, low min-entropy: long tail with one dominant value
- All orders equal: near-uniform distribution
- Large divergence (H0 - H_inf): heavy concentration with many rare values
Security application: Min-entropy is the correct measure for cryptographic key strength — not Shannon. A key with high Shannon but low min-entropy has a predictable most-likely value.
Spectrum divergence (H0 - H_inf) is itself a signal: it quantifies how far the distribution is from uniform. Zero divergence = uniform. High divergence = concentrated.
Complexity: O(n), same as Shannon. Computed from the same frequency counts.
Files: vajra-stats/src/renyi.rs
Structural Complexity: Lempel-Ziv
Shannon entropy measures average information per symbol. It cannot distinguish:
| Input | Shannon Entropy | LZ Complexity |
|---|---|---|
| Random UUIDs | High | High |
PROJ-001, PROJ-002, … | High | Low |
Repeated "active" | Low | Low |
Lempel-Ziv complexity (LZ76) measures the number of distinct subpatterns needed to describe a sequence. The LZ76 algorithm scans left-to-right, extending the current phrase until it hasn’t been seen before:
Normalized C_LZ = phrase_count / (n / log2(n))
The entropy-complexity plane has four quadrants:
| Low LZ | High LZ | |
|---|---|---|
| High entropy | Structured (patterned identifiers) | Random (UUIDs, hashes) |
| Low entropy | Constant (repeated values) | Anomalous (theoretically unlikely) |
A field in the “structured” quadrant (high entropy, low complexity) is a generated identifier with a pattern. A field in the “random” quadrant is truly unpredictable. Shannon alone cannot tell them apart.
Complexity: O(n) single pass. No external dependencies.
Files: vajra-stats/src/lz_complexity.rs
Relationships: Conditional Entropy and PMI
Conditional Entropy H(Y|X)
How much knowing X reduces uncertainty about Y:
H(Y|X) = -sum p(x,y) * log2(p(y|x))
- H(Y|X) = 0: X completely determines Y (functional dependency)
- H(Y|X) = H(Y): X tells you nothing about Y (independence)
Relationship strength normalizes this:
strength = 1 - H(Y|X) / H(Y)
Clamped to [0, 1]. Zero = independent. One = deterministic.
Pointwise Mutual Information
Measures co-occurrence strength between specific value pairs:
PMI(x, y) = log2(P(x,y) / (P(x) * P(y)))
Positive = co-occur more than chance. Negative = avoid each other. Zero = independent.
Total Correlation
Pairwise measures miss higher-order structure. Three fields can be independent in pairs but jointly constrained (city + state + zip). Total correlation captures this:
TC(X1,...,Xn) = sum H(Xi) - H(X1,...,Xn)
- TC = 0: all fields are independent
- High TC: the schema has deep internal structure
- TC / sum H(Xi): normalized to [0, 1]
Total correlation answers: “how much redundancy exists across all fields simultaneously?” This is the gap between pairwise analysis and true multivariate dependency.
Complexity: O(n) for marginals. Joint entropy estimated via binning, bounded to 8-field subsets for tractability.
Files: vajra-stats/src/relationships.rs, vajra-stats/src/total_correlation.rs
Distributional Drift: JSD and Wasserstein
Jensen-Shannon Divergence
Symmetric, bounded, and a proper metric (via square root):
JSD(P, Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)
where M = 0.5 * (P + Q) and KL is Kullback-Leibler divergence.
- JSD in [0, 1] with log base 2
- sqrt(JSD) satisfies the triangle inequality (Endres & Schindelin 2003)
- Used for categorical distribution drift
1D Wasserstein Distance
For numeric distributions, measures the “earth mover’s distance”:
W1 = integral |CDF_a(x) - CDF_b(x)| dx
JSD tells you the distributions changed. Wasserstein tells you by how much — it captures the magnitude of the shift, not just its existence.
When to use which:
| Data Type | Metric | Why |
|---|---|---|
| Categorical (strings, enums) | JSD | Probability mass redistribution |
| Numeric (amounts, counts) | Wasserstein | Shift magnitude in original units |
Files: vajra-drift/src/jsd.rs, vajra-drift/src/wasserstein.rs
Directed Information Flow: Transfer Entropy
Transfer entropy measures how much knowing the past of X reduces uncertainty about Y’s future, beyond what Y’s own past already tells you:
TE(X->Y) = H(Y_t | Y_{t-1}^k) - H(Y_t | Y_{t-1}^k, X_{t-1}^l)
Key properties:
- Directional: TE(X->Y) != TE(Y->X) — reveals causal flow
- Non-negative: information can only help prediction
- Granger causality generalized: captures nonlinear dependencies
This transforms cascade detection from temporal pattern matching into rigorous directed information flow quantification. Instead of “A happened before B,” transfer entropy says “A’s past carries 2.3 bits of information about B’s future that B’s own history doesn’t contain.”
Net information flow = TE(X->Y) - TE(Y->X). Positive means X drives Y. Negative means Y drives X.
Complexity: O(n * k) where k is history depth. Deterministic with fixed binning.
Files: vajra-stats/src/transfer_entropy.rs
Universal Similarity: NCD
Normalized Compression Distance approximates the normalized information distance — provably the most general similarity metric:
NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))
where C is a real compressor (zstd at fixed level 3).
Why NCD is strictly more powerful than feature-based similarity: MinHash captures set overlap. SimHash captures angular proximity. Both require choosing features. NCD captures all computable regularities — structure, patterns, naming conventions, content — with zero feature engineering.
Two documents that share structural patterns but zero literal values will have low NCD. Two documents with random shared tokens but different structure will have high NCD.
- NCD(x, x) approaches 0 (self-similarity)
- NCD(x, random) approaches 1 (dissimilarity)
- Symmetric: NCD(x, y) = NCD(y, x)
- Deterministic given fixed compressor and level
Complexity: O(n) per compression. O(n^2) for all-pairs matrix with C(x) caching.
Files: vajra-fingerprint/src/ncd.rs
Anomaly Scoring
Self-Information (Surprisal)
The rarity of a single observation:
I(x) = -log2(p(x))
A value seen once in 10,000 observations carries 13.3 bits of rarity. This is the information-theoretic foundation of rare value detection.
MAD-Based Outlier Detection
Median Absolute Deviation with modified z-scores:
z_MAD = 0.6745 * (x - median) / MAD
Values with |z_MAD| > 3.5 are flagged. MAD has a 50% breakdown point — half the data can be corrupted before it gives misleading results.
Benford’s Law
Leading digit distribution for numeric fields:
P(d) = log10(1 + 1/d)
Conformity tested via chi-squared and Nigrini’s MAD score. Non-conformity (MAD > 0.015) signals potentially fabricated or unusual numeric data.
The Six-Dimensional Scoring Model
Every observation is scored across six information-theoretic dimensions:
| Dimension | Source | Range |
|---|---|---|
| rarity | Self-information, cardinality | [0, 1] |
| instability | Type distribution: 1 - (dominant/total) | [0, 1] |
| entropy_signal | Normalized Shannon entropy | [0, 1] |
| structural_coverage | Null rate, enum-like patterns | [0, 1] |
| anomaly_strength | MAD z-scores, rarity magnitude | [0, 1] |
| concern_relevance | Domain-specific importance | [0, 1] |
The composite score is a weighted sum:
score = sum weight_i * dimension_i
Weights depend on the concern profile:
| Profile | Rarity | Instability | Entropy | Coverage | Anomaly | Concern |
|---|---|---|---|---|---|---|
| Engineer | 0.15 | 0.15 | 0.15 | 0.15 | 0.15 | 0.15 |
| Staff | 0.10 | 0.10 | 0.10 | 0.25 | 0.30 | 0.15 |
| Fraud | 0.25 | 0.10 | 0.10 | 0.10 | 0.35 | 0.10 |
The Integration Pipeline
JSON Document
|
v
[Stats Analyzer] --- entropy, Renyi spectrum, LZ complexity, cardinality, rarity
|
v
[Anomaly Analyzer] --- rare values (surprisal), type instabilities, MAD outliers
|
v
[Relationship Discovery] --- conditional entropy, PMI, total correlation
|
v
[Drift Analyzer] --- JSD (categorical), Wasserstein (numeric), severity
|
v
[Cascade Analyzer] --- transfer entropy, directed information flow
|
v
[Feature Store] --- PathFeatures with all information-theoretic signals
|
v
[Essence Builder] --- ScoredObservations across 6 dimensions
|
v
[Profile Scorer] --- Weighted composite score
|
v
[EssenceData] --- Prioritized findings for humans and AI
Every anomaly signal, every drift measurement, every relationship discovery, and every cascade detection is rooted in information theory. The entire system is fundamentally an information-theoretic lens on structured data.
Streaming
Vajra handles JSON of any size. A 50 KB medical claim and a 10 GB event log enter the same pipeline. The streaming engine is what makes this possible.
Two Modes
DOM Mode
For documents that fit in memory. The parser builds a full in-memory tree with random access to every node. All analysis passes can access any part of the document at any time.
Parser: simd-json at 2+ GB/s.
Memory: O(n) where n = document size.
Activates: By default, for documents below the streaming threshold (default 100 MB, configurable).
Streaming Mode
For documents that exceed available memory. SAX-style event parsing with bounded memory. The parser emits events (start-object, key, value, end-object, start-array, end-array) and the analyzers update their accumulators incrementally.
Memory: O(p + s) where p = distinct paths and s = sum of sketch sizes. For typical JSON with < 1,000 distinct paths: < 10 MB regardless of document size.
Activates: Automatically when document size exceeds the streaming threshold. Force with --streaming.
The Two-Pass Hybrid Strategy
Streaming mode does not mean single-pass-only. Vajra uses a hybrid strategy that balances memory efficiency with analysis depth.
Pass 1: Profile the Document
A single streaming pass collects:
- Path extraction. Every wildcard path discovered and registered in the path trie.
- Frequency counting. Value frequencies per path via Count-Min Sketch (conservative update).
- Top-k identification. Most frequent values per path via Space-Saving algorithm.
- Type profiling. Type distribution per path tracked via simple counters.
- Numeric sketches. DDSketch accumulators for every numeric path — percentiles, median, MAD.
- Null and missingness tracking. Per-path counters for null, absent, empty.
- Entropy estimation. Computed from CMS frequency estimates when exact counting exceeds memory.
- Fingerprint accumulation. Merkle hashes built incrementally as subtrees complete.
After Pass 1, Vajra has a complete statistical profile of the document without having held more than one event in memory at a time.
Pass 2 (Optional): Selective DOM for High-Signal Subtrees
If the command requires rich analysis that streaming cannot provide (motif analysis, essence generation with deep context), Vajra can selectively parse high-signal subtrees into DOM.
The decision is based on Pass 1 results:
- Subtrees with high anomaly density are candidates for DOM parsing.
- Subtrees with high entropy fields that need value-level analysis.
- The dominant motif’s representative instance.
Pass 2 is optional. Commands like stats and fingerprint need only Pass 1. Commands like essence may invoke Pass 2 for targeted depth.
Sketch Data Structures in Streaming Mode
DDSketch
Role: Numeric distribution analysis — percentiles, median, MAD.
One DDSketch per numeric path. Each sketch maintains O(log(max/min) / log(1 + alpha)) buckets. With alpha = 0.01 and financial data spanning $0.01 to $1,000,000, this is roughly 700 buckets — a few KB of memory per path.
Key property: Mergeability. When processing a batch in parallel, per-file DDSketch instances merge into a global sketch with zero accuracy loss.
#![allow(unused)]
fn main() {
// Streaming numeric stats
let mut stats = StreamingStatsAccumulator::default();
for event in parser {
stats.on_event(&event?)?;
}
let result = stats.finalize()?;
// result.numeric_stats contains DDSketch-derived percentiles
}
Count-Min Sketch (CMS)
Role: Frequency estimation for values, paths, and key names when cardinality exceeds configurable thresholds.
Default configuration: width = 2,718, depth = 5. Total memory: ~54 KB per sketch. Error guarantee: estimated count within 0.1% of total count with 99% probability.
Activation: Exact counting is preferred when it fits in memory. CMS activates as a fallback when distinct value count per path exceeds the threshold (default: 10,000 distinct values).
Space-Saving
Role: Identifying top-k most frequent elements without storing all elements.
Maintains exactly k counters (default k = 100). Guaranteed to include every element whose true frequency exceeds N/k. Memory: k entries, a few KB.
Memory Budget
The total streaming memory budget is bounded:
| Component | Memory |
|---|---|
| Path trie | O(p) where p = distinct wildcard paths |
| DDSketch (per numeric path) | ~3 KB per path |
| CMS (per high-cardinality path) | ~54 KB per path |
| Space-Saving (per path) | ~4 KB per path (k=100) |
| Type counters (per path) | ~48 bytes per path |
| Null/absent counters (per path) | ~32 bytes per path |
| Fingerprint accumulator | O(current depth) |
For a document with 500 distinct paths, 100 numeric paths, and 50 high-cardinality paths:
Path trie: ~100 KB
DDSketch: ~300 KB (100 paths x 3 KB)
CMS: ~2.7 MB (50 paths x 54 KB)
Space-Saving: ~2.0 MB (500 paths x 4 KB)
Type/null counters: ~40 KB (500 paths x 80 bytes)
Fingerprint: ~10 KB
---
Total: ~5.2 MB
This budget holds regardless of whether the document is 100 MB or 100 GB. The streaming guarantee: bounded memory independent of input size.
DOM vs. Streaming: What Changes
| Capability | DOM Mode | Streaming Mode |
|---|---|---|
| Parsing speed | 2+ GB/s (simd-json) | ~500 MB/s (event parser) |
| Random access | Full | None (sequential events) |
| Exact frequency counts | Yes | Only when cardinality fits in memory; CMS otherwise |
| Exact percentiles | Yes (via sorting) | Approximate (DDSketch, 1% relative error) |
| Exact entropy | Yes | Approximate (from CMS estimates) |
| Motif detection | Full (Merkle subtree hashing) | Partial (incremental, no lookback) |
| Relationship discovery | Full (random access to value pairs) | Partial (co-occurrence counters) |
| Essence quality | Full | Slightly reduced (no selective subtree re-parse in Pass 1) |
Every streaming approximation carries formal error bounds. The output explicitly labels which statistics are exact and which are approximate.
When Each Mode Activates
Document size < streaming_threshold (default 100 MB)
-> DOM mode
Document size >= streaming_threshold
-> Streaming mode (automatic)
--streaming flag present
-> Streaming mode (forced, regardless of size)
The threshold is configurable in the TOML config:
[parsing]
streaming_threshold = 104_857_600 # 100 MB
The StreamAnalyzer Trait
Any analyzer that implements StreamAnalyzer can participate in streaming mode:
#![allow(unused)]
fn main() {
pub trait StreamAnalyzer {
type Accumulator: Default;
type Output;
fn on_event(&self, event: &JsonEvent, acc: &mut Self::Accumulator) -> Result<()>;
fn finalize(&self, acc: Self::Accumulator) -> Result<Self::Output>;
}
}
The accumulator holds all state. Events arrive one at a time. finalize produces the result when the stream ends.
This trait is the key to extensibility. Custom analyzers that implement it automatically work in both DOM and streaming modes — DOM mode simply feeds all events from the pre-parsed tree.
Differential Testing: DOM vs. Streaming
For every document in the test corpus, Vajra runs both modes and asserts:
- CMS frequency estimates are within proven error bounds of exact counts
- DDSketch quantile estimates are within relative accuracy of exact quantiles
- Path sets are identical
- Fingerprints are identical
- Type distributions are identical
This ensures streaming mode is not a second-class citizen. It is a formally bounded approximation of DOM mode, not a degraded fallback.
Determinism
Determinism in Vajra is not a feature. It is a structural guarantee.
Given identical input bytes, identical configuration, and identical Vajra version, the output is identical — byte for byte. Fingerprints, scores, orderings, essence text, anomaly rankings. Every run. Every platform. Every time.
This is what makes Vajra trustworthy in CI pipelines, audits, compliance workflows, and AI systems that depend on stable preprocessing.
The Guarantee
Identical:
- Input bytes
- Configuration (profile, flags, config file)
- Vajra version
Produces identical:
- Fingerprints
- Scores (to floating-point bit-level)
- Orderings
- Essence text (byte-for-byte)
- Anomaly rankings
Sources of Nondeterminism and How Each Is Eliminated
The HashMap Rule
Problem: Rust’s HashMap uses a random seed for its hash function (SipHash with random key by default). Iteration order is nondeterministic. Any code that iterates a HashMap and includes the iteration order in output produces nondeterministic results.
Mitigation: BTreeMap is used for all externally-visible orderings. HashMap is permitted only for internal scratch computations where iteration order is never observed in output.
This is enforced by code review and tested by the determinism test suite. If a HashMap iteration order leaks into output, the 10-run determinism test catches it immediately.
The Thread Scheduling Rule
Problem: Rayon’s parallel batch processing schedules work across threads nondeterministically. If results are merged in arrival order, the output depends on thread scheduling.
Mitigation: Deterministic merge order. After parallel analysis, results are sorted by input identity (file path or record index) before merging. Parallel execution affects speed, never output.
#![allow(unused)]
fn main() {
// Parallel analysis
let results: Vec<_> = files.par_iter()
.map(|f| analyze(f))
.collect();
// Deterministic merge — sorted by input identity, not arrival order
let mut results = results;
results.sort_by_key(|r| r.input_path.clone());
}
The Floating-Point Accumulation Rule
Problem: Floating-point addition is not associative. (a + b) + c can differ from a + (b + c) at the bit level. If summation order varies (due to thread scheduling, hash map iteration, or unstable sorting), floating-point results drift.
Mitigation: Fixed traversal order. All traversals are DFS, left-to-right. All summations occur in deterministic order defined by the path trie’s BTreeMap-based key ordering. The traversal order is a function of the input alone.
The Seed Rule
Problem: MinHash and SimHash use hash functions that can be seeded. Different seeds produce different signatures (and different similarity estimates, cluster assignments, etc.).
Mitigation: Default seed is 0. The --seed flag provides explicit control.
vajra cluster batch/*.json # seed = 0 (default, deterministic)
vajra cluster batch/*.json --seed 42 # seed = 42 (different but still deterministic)
Same seed + same input = same output. Different seed = potentially different output. Both are deterministic within their seed.
The ryu Rule
Problem: Floating-point to string conversion varies across platforms. Rust’s default Display for f64 can produce different decimal representations on different architectures or with different optimization levels.
Mitigation: All float-to-string conversion uses the ryu crate — Ulf Adams’ algorithm (2018) for shortest round-trip-safe decimal representation. ryu is deterministic and platform-independent. The same f64 bit pattern produces the same string on every platform Vajra supports.
#![allow(unused)]
fn main() {
// Not this:
format!("{}", value) // platform-dependent
// This:
ryu::Buffer::new().format(value) // deterministic, platform-independent
}
The Canonicalization Rule
Problem: JSON objects are unordered by specification. {"a": 1, "b": 2} and {"b": 2, "a": 1} are semantically identical but textually different. Any operation that depends on key order (hashing, fingerprinting, rendering) must first impose a deterministic order.
Mitigation: RFC 8785 canonicalization. Keys sorted by UTF-16 code unit sequence (the RFC’s specified ordering). Numbers formatted deterministically. Unicode NFC normalized. Applied before any hashing, fingerprinting, or comparison operation.
The Unicode Rule
Problem: The same visual string can have multiple Unicode representations. “e with acute accent” can be a single codepoint (U+00E9) or a base character plus combining mark (U+0065 U+0301). If these are treated as different strings, frequency counts, entropy, and fingerprints diverge.
Mitigation: Unicode NFC normalization (UAX #15) applied during canonicalization. All string comparisons, frequency counting, and hashing operate on NFC-normalized forms.
Verifying Determinism: The 10-Run Test
The determinism test suite runs every command on every document in the test corpus:
- Run Vajra N times (N >= 10) with identical configuration.
- Assert byte-identical output across all runs.
- Run with
--seed 0and--seed 42— outputs may differ between seeds. - Run each seed N times — assert identical within-seed output.
# Manual verification
for i in $(seq 1 10); do
vajra essence claim.json --profile engineer --format json > "run_$i.json"
done
# All files must be identical
md5sum run_*.json
# Every line shows the same hash
If any two runs produce different output, the determinism contract is broken. This test runs in CI on every commit.
What Determinism Costs
The determinism guarantee imposes real engineering costs:
| Constraint | Cost | Payoff |
|---|---|---|
| BTreeMap everywhere | ~10-20% slower than HashMap for insertion-heavy code | Deterministic iteration order |
| Fixed traversal order | Cannot parallelize within-document traversal for speed | Deterministic accumulation |
| ryu for float formatting | Additional dependency | Platform-independent output |
| Seeded PRNG for MinHash | Cannot use hardware RNG for “better” randomness | Reproducible signatures |
| Deterministic merge order | Sorting step after parallel batch processing | Reproducible batch results |
Every cost is paid gladly. Determinism is not negotiable. Speed optimizations that violate it are rejected.
What Determinism Does NOT Cover
Determinism applies to the mapping from (input, config, version) to output. It does not mean:
- Different versions produce the same output. Algorithm changes, bug fixes, and threshold adjustments may change output between versions. The version is part of the contract.
- Different configs produce the same output. Changing the profile, the seed, the budget, or any flag may change output. The config is part of the contract.
- Streaming mode matches DOM mode exactly. Streaming mode uses approximate algorithms (DDSketch, CMS) that produce bounded approximations of DOM mode’s exact results. Both modes are internally deterministic. They may differ from each other within the documented error bounds.
For Library Users
The determinism guarantee extends to the Rust library API. If you call the same analyzer with the same Document and the same configuration, you get the same result.
#![allow(unused)]
fn main() {
use vajra_core::Document;
use vajra_stats::StatsAnalyzer;
use vajra_types::Analyzer;
let doc = Document::parse_file("claim.json")?;
let stats1 = StatsAnalyzer.analyze(&doc)?;
let stats2 = StatsAnalyzer.analyze(&doc)?;
// stats1 and stats2 are identical at the bit level
assert_eq!(
serde_json::to_string(&stats1)?,
serde_json::to_string(&stats2)?
);
}
Domain Plugins
Core Vajra is domain-agnostic. It analyzes structure, statistics, and deviation from norms — without knowing what the data represents. Domain intelligence enters through plugins that extend the engine without contaminating it.
A plugin does not change what Vajra computes. It enriches what Vajra knows.
The Plugin Architecture
Plugins contribute four kinds of extensions:
- Type recognizers — pattern matchers that identify domain-specific value types (ICD-10 codes, NPIs, SWIFT codes)
- Concern profiles — custom scoring weight vectors and rendering templates
- Relationship hints — domain knowledge about which fields form logical groups
- Custom renderers — domain-specific essence rendering templates
Plugins cannot modify the core analysis pipeline, access the filesystem beyond their own configuration, make network calls, or mutate the input document. They are additive. They are isolated.
The VajraPlugin Trait
#![allow(unused)]
fn main() {
pub trait VajraPlugin: Send + Sync {
/// Plugin identifier.
fn name(&self) -> &str;
/// Plugin version string.
fn version(&self) -> &str;
/// Additional type recognizers beyond the core DFA bank.
/// These run alongside the core recognizers during semantic lifting.
fn type_recognizers(&self) -> Vec<Box<dyn TypeRecognizer>> {
vec![]
}
/// Additional concern profile definitions.
/// These appear alongside built-in profiles in `vajra profiles`.
fn concern_profiles(&self) -> Vec<Box<dyn ConcernProfile>> {
vec![]
}
/// Field relationship heuristics.
/// Example: "code + description + system = coded concept"
fn relationship_hints(&self) -> Vec<RelationshipHint> {
vec![]
}
/// Custom rendering templates for essence output.
fn renderers(&self) -> Vec<Box<dyn EssenceRenderer>> {
vec![]
}
}
}
Every method has a default implementation that returns empty. A plugin can implement only the capabilities it needs.
TypeRecognizer
Type recognizers extend Vajra’s semantic lifting layer. They match raw string values against domain-specific patterns.
#![allow(unused)]
fn main() {
pub trait TypeRecognizer: Send + Sync {
/// The name of the recognized type (e.g., "ICD-10-CM", "CPT", "NPI").
fn type_name(&self) -> &str;
/// Returns true if the value matches this type's pattern.
fn matches(&self, value: &str) -> bool;
/// Optional confidence level for the match.
fn confidence(&self, value: &str) -> f64 {
if self.matches(value) { 1.0 } else { 0.0 }
}
}
}
Type recognizers run during Layer 4 (Semantic Lifting) of the engine pipeline. They are evaluated after the core DFA bank, allowing domain-specific patterns to augment — not override — the core type inference.
RelationshipHint
Relationship hints tell Vajra that certain field combinations form logical groups:
#![allow(unused)]
fn main() {
pub struct RelationshipHint {
/// Fields that form a logical group when co-located.
pub field_patterns: Vec<String>,
/// Name for this relationship.
pub name: String,
/// Description of what the group represents.
pub description: String,
}
}
Example from the medical plugin:
#![allow(unused)]
fn main() {
RelationshipHint {
field_patterns: vec![
"code".to_string(),
"system".to_string(),
"display".to_string(),
],
name: "coded-concept".to_string(),
description: "A coded value with its coding system and human-readable display".to_string(),
}
}
When Vajra finds code, system, and display as sibling keys in an object, the medical plugin’s relationship hint identifies this as a coded concept — not three independent strings.
The Medical Plugin: vajra-domain-med
The medical plugin is the reference implementation. It demonstrates every plugin capability.
Type Recognizers
| Recognized Type | Pattern | Example Values |
|---|---|---|
| ICD-10-CM | [A-Z][0-9]{2}(\.[0-9A-Z]{1,4})? | E11.9, J44.1, M54.5 |
| ICD-10-PCS | [0-9A-HJ-NP-Z]{7} | 0SG00ZJ |
| CPT | [0-9]{5} (with known range validation) | 99213, 99214, 27447 |
| HCPCS | [A-V][0-9]{4} | J0129, G0438 |
| NDC | [0-9]{4,5}-[0-9]{3,4}-[0-9]{1,2} | 0069-0770-01 |
| NPI | [0-9]{10} (with Luhn check) | 1234567893 |
| Denial Reason | (CO|PR|OA|PI|CR)-[0-9]{1,3} | CO-45, PR-1, OA-23 |
Relationship Hints
| Hint | Fields | Meaning |
|---|---|---|
| Coded Concept | code, system, display | A value from a terminology system |
| Service Line | procedure_code, charge_amount, service_date, status | A line item on a claim |
| Patient Identity | patient.id, patient.name, patient.dob | Patient demographic group |
| Provider Identity | provider.npi, provider.name, provider.taxonomy | Provider identification group |
| Adjudication | allowed_amount, paid_amount, status, adjustment | Payment determination group |
What It Enables
With the medical plugin loaded, vajra inspect on a medical claim produces:
=== Domain Type Recognition ===
$.claims[*].diagnosis[*].code E11.9 ICD-10-CM
$.claims[*].diagnosis[*].code J44.1 ICD-10-CM
$.claims[*].service_lines[*].procedure_code 99213 CPT
$.claims[*].provider.npi 1234567890 NPI
$.claims[*].service_lines[*].adjustment.reason CO-45 Denial Reason
Without the plugin, those values are just strings. With it, they are clinically meaningful codes.
Building Your Own Plugin
Step 1: Create a Crate
cargo new vajra-domain-finance --lib
Step 2: Depend on vajra-types
# Cargo.toml
[dependencies]
vajra-types = { version = "0.1", path = "../vajra-types" }
Step 3: Implement the Trait
#![allow(unused)]
fn main() {
use vajra_types::traits::{VajraPlugin, TypeRecognizer, RelationshipHint};
pub struct FinancePlugin;
impl VajraPlugin for FinancePlugin {
fn name(&self) -> &str { "finance" }
fn version(&self) -> &str { "0.1.0" }
fn type_recognizers(&self) -> Vec<Box<dyn TypeRecognizer>> {
vec![
Box::new(SwiftCodeRecognizer),
Box::new(IbanRecognizer),
Box::new(CurrencyCodeRecognizer),
]
}
fn relationship_hints(&self) -> Vec<RelationshipHint> {
vec![
RelationshipHint {
field_patterns: vec![
"amount".to_string(),
"currency".to_string(),
],
name: "monetary-value".to_string(),
description: "Amount with its currency denomination".to_string(),
},
]
}
}
struct SwiftCodeRecognizer;
impl TypeRecognizer for SwiftCodeRecognizer {
fn type_name(&self) -> &str { "SWIFT/BIC" }
fn matches(&self, value: &str) -> bool {
let len = value.len();
(len == 8 || len == 11)
&& value[..4].chars().all(|c| c.is_ascii_uppercase())
&& value[4..6].chars().all(|c| c.is_ascii_uppercase())
&& value[6..8].chars().all(|c| c.is_ascii_alphanumeric())
}
}
}
Step 4: Register the Plugin
Static plugins are compiled into the binary at build time by adding the crate to vajra-cli’s dependencies.
Dynamic plugins are loaded at runtime via libloading from the plugin directory (default: ~/.vajra/plugins/).
Error Isolation
Plugins run in an isolation boundary. If a plugin panics or returns an error:
- The panic is caught at the plugin boundary (via
std::panic::catch_unwind). - Core analysis continues without the plugin’s contributions.
- The plugin failure is recorded in the output’s provenance metadata.
- A diagnostic message is emitted to stderr.
vajra: plugin "finance" failed during type recognition: index out of bounds
vajra: continuing analysis without finance plugin contributions
No plugin failure can crash Vajra. No plugin can corrupt the core analysis. The isolation is structural, not aspirational.
Plugin Constraints
A plugin may:
- Register type recognizers, profiles, relationship hints, and renderers
- Read its own configuration files
- Use any safe Rust code internally
A plugin may not:
- Modify the core analysis pipeline
- Access the filesystem beyond its own config directory
- Make network calls
- Mutate the input document
- Introduce nondeterminism (all plugin methods must be deterministic)
Shipped Plugins
Six domain plugins ship with Vajra, all enabled by default via feature flags:
| Domain | Plugin | Type Recognizers | Hints |
|---|---|---|---|
| Medical / EDI | vajra-domain-med | ICD-10, CPT, HCPCS, NDC, NPI, Diagnosis Code | 6 (claim service line, diagnosis, patient, provider, adjudication, denial) |
| Security | vajra-domain-sec | CVE, IPv4, IPv6, CIDR, MAC, SHA-256, SHA-1, MD5, JWT, MITRE ATT&CK Technique, MITRE Tactic, CVSS | 6 (network flow, alert classification, vulnerability, auth, process execution, DNS) |
| DevOps | vajra-domain-devops | Container ID, Semver, Git SHA, Docker Image, AWS ARN, GCP Resource, CIDR, Cron, K8s Namespace, Terraform Resource | 6 (K8s pod spec, deployment metadata, service endpoint, Terraform, CI pipeline, container spec) |
| Source Code | vajra-domain-source | snake_case, camelCase, PascalCase, SCREAMING_SNAKE, import paths, source file paths | 6 (function definition, class definition, import statement, parameter list, conditional, loop) |
| Encoding | vajra-domain-encoding | Base64, Base64URL, hex, URL-encoded, HTML entities, Unicode escapes, PEM, data URI, quoted-printable, MIME encoded word, Punycode, double-encoded, mixed-encoding | 3 (content+encoding, transfer encoding, encoded/decoded pairs) |
| GitHub | vajra-domain-github | PR number, issue number, GitHub username, repo slug, commit SHA, branch name, label, milestone, review state, merge method | 7 (pull request, issue, review, commit, release, workflow run, discussion) |
Feature Flags
# vajra-cli/Cargo.toml
[features]
default = ["medical", "security", "devops", "source", "encoding", "github"]
medical = ["vajra-domain-med"]
security = ["vajra-domain-sec"]
devops = ["vajra-domain-devops"]
source = ["vajra-source", "vajra-domain-source"]
encoding = ["vajra-domain-encoding"]
github = ["vajra-domain-github"]
all-plugins = ["medical", "security", "devops", "source", "encoding", "github"]
Build without a plugin: cargo build --no-default-features --features security,devops
The Security Plugin: vajra-domain-sec
The security plugin recognizes types commonly found in SIEM events, vulnerability scans, threat intelligence feeds, and network flow data.
Type Recognizers
| Recognized Type | Pattern | Example Values |
|---|---|---|
| CVE ID | CVE-YYYY-NNNNN | CVE-2024-3400, CVE-2023-44487 |
| IPv4 | Dotted-quad, each octet 0-255 | 192.168.1.1, 10.0.0.1 |
| IPv6 | Full, compressed, mixed notation | 2001:db8::1, ::1 |
| CIDR | IPv4/prefix (0-32) | 10.0.0.0/8, 192.168.1.0/24 |
| MAC Address | Colon or hyphen separated | aa:bb:cc:dd:ee:ff |
| SHA-256 | 64 lowercase hex chars | e3b0c44298fc1c14... |
| SHA-1 | 40 lowercase hex chars | da39a3ee5e6b4b0d... |
| MD5 | 32 lowercase hex chars | d41d8cd98f00b204... |
| JWT | eyJ...\.eyJ...\.sig | JSON Web Tokens |
| MITRE ATT&CK Technique | T\d{4}(.\d{3})? | T1059, T1059.001 |
| MITRE ATT&CK Tactic | TA\d{4} | TA0001, TA0040 |
| CVSS Vector | CVSS:3.x/AV:.../... | Full CVSS v3 vector strings |
The DevOps Plugin: vajra-domain-devops
The DevOps plugin recognizes types in Kubernetes manifests, Terraform state, CI/CD pipeline output, Docker configurations, and cloud infrastructure JSON.
Type Recognizers
| Recognized Type | Pattern | Example Values |
|---|---|---|
| Container ID | 12 or 64 lowercase hex chars | a1b2c3d4e5f6 |
| Semver | v?MAJOR.MINOR.PATCH(-pre)?(+build)? | v1.2.3, 1.0.0-beta.1 |
| Git SHA | 7-12 or 40 lowercase hex chars | a1b2c3d, full 40-char SHA |
| Docker Image | [registry/]repo:tag or repo@sha256:digest | nginx:latest, gcr.io/proj/img:v1 |
| AWS ARN | arn:aws:service:region:account:resource | arn:aws:s3:::my-bucket |
| GCP Resource | projects/*/... or organizations/*/... | projects/my-proj/topics/t |
| CIDR Block | IPv4/prefix (0-32) | 10.0.0.0/16 |
| Cron Expression | 5-field cron pattern | 0 */6 * * * |
| K8s Namespace | DNS-1123 labels, known system namespaces | kube-system, my-app-staging |
| Terraform Resource | provider_type.name | aws_instance.web |
The Source Code Plugin: vajra-domain-source
The source code plugin recognizes patterns in the JSON trees produced by vajra-source (tree-sitter CST-to-JSON output). It works alongside vajra-source, which handles the parsing.
Type Recognizers
| Recognized Type | Pattern | Example Values |
|---|---|---|
| snake_case identifier | [a-z][a-z0-9]*(_[a-z0-9]+)+ | my_function, get_value |
| camelCase identifier | [a-z]...[A-Z]... | myFunction, getValue |
| PascalCase identifier | [A-Z][a-zA-Z0-9]+ | MyClass, HttpClient |
| SCREAMING_SNAKE_CASE | [A-Z][A-Z0-9]*(_[A-Z0-9]+)+ | MAX_SIZE, HTTP_STATUS |
| Import path | mod::path or pkg.Class or @scope/pkg | std::collections::HashMap |
| Source file path | Path ending in .rs, .py, .go, etc. | src/main.rs, lib/utils.py |
Relationship Hints
| Hint | Pattern | Meaning |
|---|---|---|
| Function definition | name + parameters + body | A function or method |
| Class definition | name + body + inheritance | A class or struct |
| Import statement | path + optional alias | A use/import declaration |
| Parameter list | type + name pairs | Function parameters |
| Conditional block | condition + consequence + alternative | An if/else construct |
| Loop block | condition/iterator + body | A for/while loop |
The Encoding Plugin: vajra-domain-encoding
The encoding plugin detects data encodings embedded in JSON string values. It identifies Base64, hex, URL encoding, HTML entities, PEM certificates, and more — including adversarial patterns like double encoding and mixed encoding used for evasion.
Type Recognizers (3 Tiers)
Tier 1 — Definite confidence (structural markers, near-zero false positives):
| Recognized Type | Pattern | Example Values |
|---|---|---|
| PEM block | -----BEGIN ...----- prefix/suffix | Certificates, private keys |
| Data URI | data:mime;base64,... | Embedded images, payloads |
| MIME encoded word | =?charset?B/Q?...?= | Email header encoding |
| Punycode | xn-- prefix | Internationalized domain names |
Tier 2 — Dominant confidence (strong patterns, low false positives):
| Recognized Type | Pattern | Example Values |
|---|---|---|
| URL encoded | 2+ %XX sequences + trial decode | hello%20world%21 |
| Quoted-printable | 3+ =XX sequences | MIME email encoding |
| HTML entity | 2+ &...; entities | <script> |
| Unicode escape | 2+ \uXXXX or \xNN | \u0048\u0065 |
| Base64URL | 16+ chars, URL-safe alphabet | API tokens, URL-safe data |
Tier 3 — Heuristic (aggressive false positive gating):
| Recognized Type | Detection | Security Signal |
|---|---|---|
| Base64 | 24+ chars, div-by-4, trial decode, entropy gate | Obfuscated payloads, exfiltration |
| Hex encoded | 32+ chars, excludes known hash lengths | Shellcode, binary blobs |
| Double encoded | Decode reveals another encoding | Evasion technique (%253C → %3C → <) |
| Mixed encoding | 2+ encoding types in one value | Obfuscation, WAF bypass |
Layer Peeling API
Beyond type recognition, the plugin provides detect_encoding_layers() for recursive analysis:
#![allow(unused)]
fn main() {
use vajra_domain_encoding::detect_encoding_layers;
let layers = detect_encoding_layers("%2548ello%2520world", 5);
// Returns: [url_encoded(depth=0), url_encoded(depth=1)]
}
Bounded at depth 5, decode capped at 4KB per layer. Catches base64(url(hex(payload))).
The GitHub Plugin: vajra-domain-github
The GitHub plugin recognizes types commonly found in GitHub API responses, webhook payloads, and exported repository data (PRs, issues, commits, reviews, releases, workflow runs).
Type Recognizers
| Recognized Type | Pattern | Priority | Confidence | Example Values |
|---|---|---|---|---|
| PR Number | #\d+ or bare integer in PR context | 10 | 0.90 | #142, 1587 |
| Issue Number | #\d+ or bare integer in issue context | 10 | 0.90 | #23, 456 |
| GitHub Username | [a-zA-Z0-9](-?[a-zA-Z0-9]){0,38} | 20 | 0.75 | copyleftdev, octocat |
| Repo Slug | owner/repo pattern | 15 | 0.85 | copyleftdev/vajra, rust-lang/rust |
| Commit SHA | 7-40 hex chars in commit context | 10 | 0.95 | a1b2c3d, full 40-char SHA |
| Branch Name | Ref-like strings with / separators | 25 | 0.70 | main, feature/cascade-cmd |
| Label | Known label patterns (bug, enhancement, etc.) | 30 | 0.65 | bug, good first issue |
| Milestone | Version-like or sprint-like strings | 30 | 0.60 | v1.0, Sprint 12 |
| Review State | One of: approved, changes_requested, commented, dismissed | 5 | 1.00 | approved, changes_requested |
| Merge Method | One of: merge, squash, rebase | 5 | 1.00 | squash, rebase |
Relationship Hints
| Hint | Field Patterns | Meaning |
|---|---|---|
| Pull Request | number, title, state, author, base, head | A pull request record |
| Issue | number, title, state, labels, assignees | An issue record |
| Review | author, state, body, submitted_at | A PR review |
| Commit | sha, message, author, date | A commit record |
| Release | tag_name, name, published_at, assets | A release record |
| Workflow Run | name, status, conclusion, run_number | A CI workflow run |
| Discussion | title, author, category, answer | A GitHub discussion |
Future Plugin Domains
The architecture supports any domain:
| Domain | Plugin | Type Recognizers |
|---|---|---|
| Financial | vajra-domain-finance | SWIFT, IBAN, CUSIP, currency codes |
| Telecom | vajra-domain-telecom | E.164 numbers, IMSI, CDR fields |
| IoT / Sensor | vajra-domain-iot | Sensor types, unit patterns, device IDs |
Architecture
Vajra is a Rust workspace of 17 crates. Each crate has a single responsibility. Dependencies flow downward. Nothing cycles.
The 17-Crate Workspace
vajra/
├── vajra-types/ Shared types, traits, contracts
├── vajra-core/ Parsing, traversal, canonicalization, path extraction
├── vajra-fingerprint/ BLAKE3 hashing, Merkle trees, MinHash, SimHash, LSH
├── vajra-stats/ CMS, Space-Saving, DDSketch, MAD, entropy, frequency
├── vajra-anomaly/ Outlier scoring, instability, rarity, structural anomaly
├── vajra-drift/ JSD, Wasserstein, path diff, drift classification
├── vajra-motif/ Motif counting, near-motif grouping, motif compression
├── vajra-essence/ Profiles, scoring, ranking, rendering, templates
├── vajra-query/ Expression parsing, path filtering, analysis functions
├── vajra-source/ Source code parsing via tree-sitter (Rust, Python, Go, JS, +5)
├── vajra-cli/ CLI argument parsing, command dispatch, output formatting
├── vajra-domain-med/ Medical/EDI type recognizers (ICD-10, CPT, NPI, NDC, HCPCS)
├── vajra-domain-sec/ Security type recognizers (CVE, MITRE ATT&CK, IPs, hashes, JWT)
├── vajra-domain-devops/ DevOps type recognizers (K8s, Docker, Terraform, ARN, semver)
├── vajra-domain-source/ Source code recognizers (naming conventions, import paths)
├── vajra-domain-encoding/ Encoding detection (Base64, hex, URL, PEM, layers)
└── Cargo.toml Workspace root
Dependency Graph
vajra-types
/ | \
/ | \
vajra-core | vajra-domain-{med,sec,devops}
/ \ | /
/ \ | /
vajra-fingerprint vajra-stats
| \ / |
| \ / |
| vajra-anomaly
| |
| vajra-drift
| |
| vajra-motif
| / |
| / |
vajra-essence
|
vajra-query
|
vajra-cli
Root crates (no internal dependencies):
vajra-types— shared types, trait definitions, result contractsvajra-coredepends only onvajra-types
Leaf crate (depends on everything):
vajra-cli— the binary. It orchestrates all other crates.
Crate Responsibilities
vajra-types
The foundation. Shared types that every crate depends on.
Document— the parsed document model (value tree + path trie + metadata)WildcardPath— normalized path representation with[*]array indicesPathTrie— trie data structure for efficient path storage and lookupFeatureStore— per-path feature vectorsJsonType— enum of JSON types (object, array, string, number, boolean, null)- Core traits:
Analyzer,StreamAnalyzer,FeatureExtractor,ConcernProfile,Fingerprinter,DriftDetector
#![allow(unused)]
fn main() {
pub trait Analyzer {
type Output;
fn analyze(&self, doc: &Document) -> Result<Self::Output>;
}
pub trait StreamAnalyzer {
type Accumulator: Default;
type Output;
fn on_event(&self, event: &JsonEvent, acc: &mut Self::Accumulator) -> Result<()>;
fn finalize(&self, acc: Self::Accumulator) -> Result<Self::Output>;
}
}
vajra-core
Parsing, traversal, and the foundational index.
simd-jsonintegration for DOM-mode parsing- Multi-format input support (JSON, NDJSON, YAML, CSV, TSV, Markdown, PDF)
- Compression handling (gzip, zstd)
- HTTP URL fetching
- RFC 8785 canonicalization
- DFS path extraction and path trie construction
- Unicode NFC normalization
- Redaction engine (
vajra_core::redact) - Input hardening (depth limits, string length limits, size limits)
vajra-fingerprint
Structural identity.
- BLAKE3 path set fingerprint
- BLAKE3 typed path fingerprint
- Merkle subtree hashing (shape fingerprint)
- MinHash signature computation (k = 128)
- SimHash for near-motif detection
- LSH bucketing for scalable similarity search
- Cluster computation from LSH candidates
StreamingFingerprintAccumulatorfor streaming mode
vajra-stats
The statistical engine.
- Shannon entropy (exact and CMS-approximate)
- Normalized entropy
- Count-Min Sketch with conservative update
- Space-Saving top-k
- DDSketch for streaming quantiles
- MAD and modified z-scores
- Frequency analysis (key, path, value)
- Missingness profiling (null rate, absent rate, empty rate)
- Numeric distribution summary (min, max, mean, median, percentiles)
- Co-occurrence and PMI computation
- Benford’s Law leading digit analysis
StreamingStatsAccumulatorfor streaming mode
vajra-anomaly
Deviation detection.
- Numeric outlier detection (MAD-based z-scores)
- Rarity scoring (self-information)
- Structural deviation detection (Jaccard distance from mode)
- Type instability detection
- Composite anomaly scoring
- Anomaly report generation
vajra-drift
Change detection between documents.
- Path set symmetric difference (structural drift)
- Type drift detection
- Jensen-Shannon Divergence for distributional drift
- 1D Wasserstein distance for numeric drift magnitude
- Drift classification (additive, subtractive, type-mutative, distributional, cardinality-shift, null-rate-shift)
- Severity scoring with profile-dependent weights
vajra-motif
Repeated structure analysis.
- Motif counting from Merkle subtree hash frequencies
- Near-motif grouping via SimHash Hamming distance
- Motif ranking (frequency x subtree size)
- Motif compression for essence generation
- Array morphology analysis (homogeneity, uniqueness, shape diversity)
vajra-essence
The rendering engine.
- Built-in profiles:
StaffProfile,EngineerProfile,AuditorProfile,AiProfile,FraudProfile - Custom profile loading from TOML
- Six-dimensional scoring model
- Candidate collection and ranking
- Token budget enforcement (greedy knapsack)
- Text, JSON, Markdown, and compact-AI renderers
- Motif collapsing
--explainscore decomposition- Provenance metadata attachment
vajra-query
Path-based query engine.
- Expression parser for path filters and analysis functions
entropy(path),rarity(path, value),instability(path),null_rate(path),stats(path),anomaly_score(path),motif(path)- Conditional expression evaluation (e.g.,
entropy($.status) > 0.5) - Integration with stats, anomaly, and motif analyzers
vajra-cli
The command-line interface.
- Clap-based argument parsing
- Command dispatch (
inspect,stats,anomalies,fingerprint,essence,drift,cluster,invariants,query,batch,profiles) - Output format rendering (text, JSON, Markdown, compact-AI)
- Redaction integration
- Streaming mode selection
- Custom profile loading
- Batch processing with Rayon parallelism
vajra-domain-med
The medical/EDI domain plugin.
- ICD-10-CM and ICD-10-PCS pattern recognizers
- CPT and HCPCS code recognizers
- NDC (National Drug Code) recognizer
- NPI (National Provider Identifier) recognizer with Luhn check
- Denial reason code recognizer (CO, PR, OA, PI, CR)
- Claim, service line, patient, provider, and adjudication relationship hints
- Implements
VajraPlugintrait
Core Traits
The trait system is the architectural backbone. Each trait is small, composable, and independently testable.
| Trait | Defined In | Purpose |
|---|---|---|
Analyzer | vajra-types | DOM-mode analysis: document in, typed output out |
StreamAnalyzer | vajra-types | Streaming analysis: events in, accumulator maintained, output finalized |
FeatureExtractor | vajra-types | Extract features into the shared feature store |
ConcernProfile | vajra-types | Define scoring weights and rendering behavior |
Fingerprinter | vajra-types | Compute structural fingerprints |
DriftDetector | vajra-types | Compare two analyzed documents for drift |
VajraPlugin | vajra-types | Plugin extension point |
TypeRecognizer | vajra-types | Domain-specific value type recognition |
Navigating the Codebase
“I want to understand how parsing works.”
Start at vajra-core/src/. The input module handles multi-format loading. The parse module handles JSON parsing. The canon module handles canonicalization.
“I want to understand the statistical engine.”
Start at vajra-stats/src/. Each statistical primitive has its own module. StatsAnalyzer composes them.
“I want to add a new profile.”
Look at vajra-essence/src/. The built-in profiles (StaffProfile, EngineerProfile, etc.) implement ConcernProfile. Follow the pattern.
“I want to add a domain plugin.”
Look at vajra-domain-med/ as the reference implementation. Implement VajraPlugin in a new crate.
“I want to add a new command.”
Start at vajra-cli/src/main.rs. Each command is a function (cmd_inspect, cmd_stats, etc.). Add a new variant to the Command enum and implement the handler.
“I want to understand how essences are built.”
Start at vajra-essence/src/. The EssenceBuilder collects observations from stats, anomaly, and motif analyzers, scores them, and renders the result.
Build and Run
# Build the entire workspace
cargo build --release
# Run tests across all crates
cargo test --workspace
# Run the CLI
./target/release/vajra inspect claim.json
# Run benchmarks
cargo bench --workspace
External Dependencies
| Dependency | Version | Purpose |
|---|---|---|
serde / serde_json | 1.x | Serialization |
serde_yaml | 0.9 | YAML input format |
csv | 1.x | CSV/TSV input format |
blake3 | 1.x | All hashing |
clap | 4.x | CLI argument parsing |
ryu | 1.x | Deterministic float formatting |
unicode-normalization | 0.1 | Unicode NFC normalization |
toml | 0.8 | Config and profile loading |
regex | 1.x | Pattern matching (redaction, type recognition) |
rayon | 1.x | Parallel batch processing |
thiserror / anyhow | 2.x / 1.x | Error handling |
flate2 | 1.x | Gzip decompression |
zstd | 0.13 | Zstd decompression |
pulldown-cmark | 0.12 | Markdown input parsing |
pdf-extract | 0.10 | PDF text extraction |
ureq | 2.x | HTTP URL fetching |
proptest | 1.x | Property-based testing |
criterion | 0.5 | Benchmarks |
All dependencies are Rust-native. No C bindings, no FFI, no system library requirements beyond a standard Rust toolchain.
Lints
The workspace enforces strict Clippy lints:
[workspace.lints.clippy]
pedantic = { level = "warn", priority = -1 }
nursery = { level = "warn", priority = -1 }
unwrap_used = "deny" # No .unwrap() — use Result
expect_used = "deny" # No .expect() — use Result
panic = "deny" # No panic!() — ever
No panics on any input. No unwraps. No expects. Every error path returns a Result.
Testing
Vajra’s test suite is not an afterthought. It is a structural guarantee. 1075 tests across 7 testing strategies ensure that every algorithm, every command, and every output contract works as specified — and continues to work as the codebase evolves.
The Test Philosophy
-
Every algorithm has a unit test that verifies it against known inputs with expected outputs. No algorithm ships without a proof that it computes correctly.
-
Every property that should hold universally is tested with random inputs. Canonicalization is idempotent. Fingerprints are key-order-independent. Drift detection is symmetric. These are not checked on one example — they are checked on thousands of generated inputs.
-
Every failure mode is tested. Malformed JSON, deeply nested documents, pathologically wide objects, adversarial strings. If Vajra can encounter it in the wild, the fuzzer has already thrown it.
-
Determinism is tested directly. Same input, same config, 10 runs, byte-identical output. This runs in CI on every commit.
-
Streaming and DOM modes are tested against each other. They must agree within documented error bounds. If they diverge, the streaming approximation is broken.
Test Categories
Unit Tests
1075 tests across all 17 crates. Each primitive, each algorithm, each data structure has targeted tests with known inputs and expected outputs. Domain plugins (medical, security, DevOps) each carry their own property tests, determinism tests, and golden corpus validation.
Examples from each crate:
vajra-core:
#![allow(unused)]
fn main() {
#[test]
fn canonicalization_sorts_keys_lexicographically() {
let input = r#"{"b": 2, "a": 1, "c": 3}"#;
let doc = Document::parse_str(input).unwrap();
let canonical = doc.canonical_json();
assert_eq!(canonical, r#"{"a":1,"b":2,"c":3}"#);
}
#[test]
fn path_extraction_normalizes_array_indices() {
let input = r#"{"items": [{"id": 1}, {"id": 2}]}"#;
let doc = Document::parse_str(input).unwrap();
let paths = doc.trie().all_paths();
assert!(paths.iter().any(|p| p.to_string() == "$.items[*].id"));
}
#[test]
fn malformed_json_returns_error_not_panic() {
let input = r#"{"unclosed": "string"#;
let result = Document::parse_str(input);
assert!(result.is_err());
}
}
vajra-stats:
#![allow(unused)]
fn main() {
#[test]
fn shannon_entropy_of_uniform_distribution() {
// 4 equally likely values -> entropy = 2.0 bits
let values = vec!["a", "b", "c", "d"];
let counts: BTreeMap<&str, u64> = values.iter()
.map(|v| (*v, 25u64))
.collect();
let entropy = shannon_entropy(&counts);
assert!((entropy - 2.0).abs() < 1e-10);
}
#[test]
fn mad_of_known_distribution() {
let values = vec![1.0, 2.0, 3.0, 4.0, 5.0, 100.0];
let median = 3.5;
let mad = compute_mad(&values);
// MAD = median(|1-3.5|, |2-3.5|, |3-3.5|, |4-3.5|, |5-3.5|, |100-3.5|)
// = median(2.5, 1.5, 0.5, 0.5, 1.5, 96.5) = 1.5
assert!((mad - 1.5).abs() < 1e-10);
}
#[test]
fn ddsketch_quantiles_within_relative_accuracy() {
let mut sketch = DDSketch::new(0.01); // 1% relative accuracy
for v in &known_distribution {
sketch.insert(*v);
}
let estimated_p50 = sketch.quantile(0.5).unwrap();
let true_p50 = exact_median(&known_distribution);
assert!((estimated_p50 - true_p50).abs() <= 0.01 * true_p50.abs());
}
}
vajra-fingerprint:
#![allow(unused)]
fn main() {
#[test]
fn path_set_fingerprint_is_key_order_independent() {
let a = Document::parse_str(r#"{"x": 1, "y": 2}"#).unwrap();
let b = Document::parse_str(r#"{"y": 2, "x": 1}"#).unwrap();
let fp_a = FingerprintAnalyzer.analyze(&a).unwrap();
let fp_b = FingerprintAnalyzer.analyze(&b).unwrap();
assert_eq!(fp_a.path_set, fp_b.path_set);
}
#[test]
fn identical_subtrees_produce_identical_merkle_hashes() {
let input = r#"{"items": [{"a": 1, "b": 2}, {"a": 3, "b": 4}]}"#;
let doc = Document::parse_str(input).unwrap();
let fp = FingerprintAnalyzer.analyze(&doc).unwrap();
// Both array elements have the same structure -> same subtree hash
assert_eq!(fp.subtree_hashes[0], fp.subtree_hashes[1]);
}
}
vajra-anomaly:
#![allow(unused)]
fn main() {
#[test]
fn mad_outlier_detection_flags_extreme_values() {
let values: Vec<f64> = (0..100).map(|i| i as f64).collect();
let mut values_with_outlier = values.clone();
values_with_outlier.push(10_000.0);
let report = AnomalyAnalyzer::detect_numeric_outliers(
&values_with_outlier, 3.5
);
assert!(report.outliers.iter().any(|o| o.value == 10_000.0));
}
}
vajra-drift:
#![allow(unused)]
fn main() {
#[test]
fn jsd_is_symmetric() {
let p = distribution_a();
let q = distribution_b();
let jsd_pq = jensen_shannon_divergence(&p, &q);
let jsd_qp = jensen_shannon_divergence(&q, &p);
assert!((jsd_pq - jsd_qp).abs() < 1e-10);
}
#[test]
fn jsd_is_zero_for_identical_distributions() {
let p = distribution_a();
let jsd = jensen_shannon_divergence(&p, &p);
assert!(jsd.abs() < 1e-10);
}
}
Property Tests
Using proptest, Vajra tests invariants that must hold for all valid inputs:
Canonicalization idempotence:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn canonicalize_is_idempotent(json in arb_json()) {
let once = canonicalize(&json);
let twice = canonicalize(&once);
prop_assert_eq!(once, twice);
}
}
}
Fingerprint stability under key reordering:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn fingerprint_stable_under_key_reorder(obj in arb_json_object()) {
let original = fingerprint(&obj);
let shuffled = shuffle_keys(&obj);
let recomputed = fingerprint(&shuffled);
prop_assert_eq!(original.path_set, recomputed.path_set);
}
}
}
Merkle hash determinism:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn merkle_hash_deterministic(json in arb_json()) {
let hash1 = merkle_subtree_hash(&json);
let hash2 = merkle_subtree_hash(&json);
prop_assert_eq!(hash1, hash2);
}
}
}
Drift symmetry:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn structural_drift_is_symmetric(a in arb_json(), b in arb_json()) {
let drift_ab = structural_drift(&a, &b);
let drift_ba = structural_drift(&b, &a);
prop_assert_eq!(drift_ab.added_paths, drift_ba.removed_paths);
prop_assert_eq!(drift_ab.removed_paths, drift_ba.added_paths);
}
}
}
MinHash accuracy convergence:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn minhash_jaccard_converges(
a in arb_string_set(1..100),
b in arb_string_set(1..100)
) {
let true_jaccard = exact_jaccard(&a, &b);
let estimated = minhash_jaccard(&a, &b, 128);
// With 128 hashes, expected error < 0.1 at 95% confidence
prop_assert!((true_jaccard - estimated).abs() < 0.15);
}
}
}
DDSketch relative error guarantee:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn ddsketch_quantile_within_bounds(values in arb_f64_vec(10..1000)) {
let mut sketch = DDSketch::new(0.01);
for v in &values { sketch.insert(*v); }
let estimated = sketch.quantile(0.5).unwrap();
let exact = exact_median(&values);
prop_assert!((estimated - exact).abs() <= 0.01 * exact.abs() + 1e-10);
}
}
}
Scoring determinism:
#![allow(unused)]
fn main() {
proptest! {
#[test]
fn scoring_is_deterministic(json in arb_json(), profile in arb_profile()) {
let score1 = compute_scores(&json, &profile);
let score2 = compute_scores(&json, &profile);
prop_assert_eq!(score1, score2);
}
}
}
Chaos Tests (Fuzzing)
Using cargo-fuzz and AFL, the fuzzer throws adversarial inputs at every entry point:
| Input Category | What It Tests |
|---|---|
| Truncated JSON | {"key": "valu — parser graceful failure |
| Unbalanced braces | {{{}} — parser error recovery |
| Invalid UTF-8 | Raw byte sequences — no undefined behavior |
| Depth 10,000+ nesting | [[[[[... — depth limit enforcement |
| 100,000+ keys per object | {"k1":1,"k2":2,...} — performance, memory |
| 1M identical array elements | [1,1,1,...] — motif detection, sketch behavior |
| Type chaos | Same path alternates string/number/null — instability detection |
| Adversarial strings | Null bytes, RTL markers, control characters, multi-byte Unicode |
| Near-max-size documents | At the streaming threshold boundary — mode switching |
Target: Zero panics. Zero undefined behavior. Graceful error on every input.
# Run the fuzzer
cd vajra-core
cargo fuzz run parse_json -- -max_total_time=3600
Differential Tests
Two implementations of the same analysis must agree within documented bounds:
DOM vs. Streaming:
#![allow(unused)]
fn main() {
#[test]
fn dom_and_streaming_stats_agree() {
let doc = Document::parse_file("corpus/claim.json").unwrap();
let dom_stats = StatsAnalyzer.analyze(&doc).unwrap();
let mut acc = StreamingStatsAccumulator::default();
for event in stream_events("corpus/claim.json") {
acc.on_event(&event.unwrap()).unwrap();
}
let stream_stats = acc.finalize().unwrap();
// Path sets must be identical
assert_eq!(dom_stats.paths.keys().collect::<Vec<_>>(),
stream_stats.paths.keys().collect::<Vec<_>>());
// CMS estimates within error bounds
for (path, dom_ps) in &dom_stats.paths {
let stream_ps = &stream_stats.paths[path];
for (value, &exact_count) in &dom_ps.value_frequencies {
let estimated = stream_ps.estimated_frequency(value);
assert!(exact_count <= estimated);
assert!(estimated <= exact_count + EPSILON * stream_stats.total_values);
}
}
}
}
Exact quantiles vs. DDSketch:
#![allow(unused)]
fn main() {
#[test]
fn ddsketch_within_relative_accuracy() {
let values = load_test_values("corpus/charge_amounts.json");
let mut sketch = DDSketch::new(0.01);
for v in &values { sketch.insert(*v); }
for q in &[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99] {
let exact = exact_quantile(&values, *q);
let estimated = sketch.quantile(*q).unwrap();
let relative_error = (estimated - exact).abs() / exact.abs();
assert!(relative_error <= 0.01,
"q={}: exact={}, estimated={}, error={}",
q, exact, estimated, relative_error);
}
}
}
Determinism Tests
#![allow(unused)]
fn main() {
#[test]
fn ten_run_determinism() {
let corpus = load_corpus("corpus/");
for file in &corpus {
let mut outputs = Vec::new();
for _ in 0..10 {
let output = run_vajra(&["essence", file, "--profile", "engineer", "--format", "json"]);
outputs.push(output);
}
for i in 1..outputs.len() {
assert_eq!(outputs[0], outputs[i],
"Determinism violation on run {} for file {}", i, file);
}
}
}
#[test]
fn different_seeds_may_differ() {
let output_seed0 = run_vajra(&["cluster", "corpus/", "--seed", "0", "--format", "json"]);
let output_seed42 = run_vajra(&["cluster", "corpus/", "--seed", "42", "--format", "json"]);
// May differ — that is fine. But within each seed, must be deterministic.
}
#[test]
fn same_seed_is_deterministic() {
for seed in &["0", "42", "12345"] {
let mut outputs = Vec::new();
for _ in 0..10 {
let output = run_vajra(&["cluster", "corpus/", "--seed", seed, "--format", "json"]);
outputs.push(output);
}
for i in 1..outputs.len() {
assert_eq!(outputs[0], outputs[i]);
}
}
}
}
Golden Tests
For each profile-format combination, golden output files are committed to the repository:
tests/golden/
├── claim_staff_text.golden
├── claim_staff_json.golden
├── claim_engineer_text.golden
├── claim_engineer_json.golden
├── claim_auditor_markdown.golden
├── claim_ai_compact.golden
├── claim_fraud_text.golden
├── drift_engineer_text.golden
├── anomalies_text.golden
└── ...
CI asserts byte-exact match between current output and golden files:
#![allow(unused)]
fn main() {
#[test]
fn golden_staff_text() {
let output = run_vajra(&["essence", "corpus/claim.json", "--profile", "staff"]);
let golden = std::fs::read_to_string("tests/golden/claim_staff_text.golden").unwrap();
assert_eq!(output, golden, "Golden test failed: staff/text");
}
}
Golden files are updated explicitly — never auto-updated. When output changes intentionally (algorithm improvement, rendering change), the developer updates the golden files and the diff is reviewed in the PR.
This catches: rendering regressions, ordering instabilities, score drift from algorithm changes.
Benchmark Tests
Using criterion, tracking performance across commits:
#![allow(unused)]
fn main() {
fn bench_parse_1mb(c: &mut Criterion) {
let input = std::fs::read("benches/fixtures/1mb.json").unwrap();
c.bench_function("parse_1mb", |b| {
b.iter(|| Document::parse_bytes(black_box(&input)))
});
}
fn bench_stats_1mb(c: &mut Criterion) {
let doc = Document::parse_file("benches/fixtures/1mb.json").unwrap();
c.bench_function("stats_1mb", |b| {
b.iter(|| StatsAnalyzer.analyze(black_box(&doc)))
});
}
fn bench_fingerprint_comparison(c: &mut Criterion) {
let fp_a = /* precomputed */;
let fp_b = /* precomputed */;
c.bench_function("fingerprint_compare", |b| {
b.iter(|| minhash_jaccard(black_box(&fp_a), black_box(&fp_b)))
});
}
}
Performance targets validated in CI:
| Scenario | Target | Test |
|---|---|---|
| 1 MB JSON, full analysis | < 100 ms | bench_full_1mb |
| 100 MB JSON, full analysis | < 5 s | bench_full_100mb |
| 10,000 document batch | < 30 s | bench_batch_10k |
| Fingerprint comparison | < 1 us per pair | bench_fingerprint_compare |
Regressions > 10% fail the build.
Running Everything
# All unit and integration tests
cargo test --workspace
# Property tests (may run longer)
cargo test --workspace -- --include-ignored proptest
# Benchmarks
cargo bench --workspace
# Fuzzing (runs until stopped)
cd vajra-core && cargo fuzz run parse_json
# Determinism check (manual)
for i in $(seq 1 10); do
vajra essence test/claim.json --format json > "/tmp/run_$i.json"
done
md5sum /tmp/run_*.json
# All hashes must be identical
The Invariant Catalog
These properties are tested across the suite. If any is violated, the build fails.
| Invariant | Test Type |
|---|---|
| Canonicalization is idempotent | Property test |
| Fingerprints are key-order-independent | Property test |
| Identical subtrees produce identical Merkle hashes | Property test |
| Structural drift is symmetric (with direction inversion) | Property test |
| MinHash Jaccard converges to true Jaccard | Property test |
| DDSketch quantiles within relative accuracy | Property + differential |
| CMS estimates within proven error bounds | Differential test |
| DOM and streaming produce consistent results | Differential test |
| 10 runs produce byte-identical output | Determinism test |
| No panics on any input | Fuzz test |
| No undefined behavior | Fuzz test |
| Golden output is byte-stable | Golden test |
| Performance within 10% of baseline | Benchmark |
| Mutation score > 85% | Mutation test |