VAJRA

Deterministic Semantic Reduction Engine

Break noise. Preserve truth.

761 Tests

12 Crates

11 Commands

22K Lines of Rust

0 Failures

What Vajra Does

Feed it any structured data. Get back shape, signal, anomalies, and truth.

Vajra analyzes JSON, YAML, CSV, NDJSON, Markdown, and PDF. It extracts structural fingerprints, computes entropy and statistical profiles, detects anomalies and schema drift, discovers cross-field relationships, and renders deterministic essences tuned for humans, auditors, or AI pipelines.

Inspect

vajra inspect claim.json

Full structural analysis — paths, types, fingerprints, domain recognition.

Essence

vajra essence data.json --profile staff

Concern-oriented reduction. 7 profiles. Token budgets. Compact-AI output for LLMs.

Drift

vajra drift v1.json v2.json

Schema drift detection with JSD, Wasserstein distance, severity classification.

Anomalies

vajra anomalies batch.ndjson

MAD-based outliers, rarity scoring, type instability. Deterministic. Explainable.

Query

vajra query data.json 'entropy($.status) > 0.5'

Path expressions with analysis functions. Entropy, rarity, null rate, instability.

Cluster

vajra cluster batch/*.json

MinHash + LSH similarity clustering. Finds payload families in seconds.

Forged for the Agent Gods

Vajra was not designed for casual use. It was forged as a weapon — an instrument of precision for AI systems that need to understand structured data at scale.

The compact-ai output compresses a 1000-node JSON document into a token-efficient essence that preserves every anomaly, every structural motif, every statistical signal — in a format an LLM can parse in a single pass.

The chain-ready drill section tells the downstream model exactly which paths have deeper analysis available, enabling multi-turn investigation without re-processing.

The determinism guarantee means the same input always produces the same output. No drift. No randomness. No surprises. An AI pipeline that depends on Vajra can depend on Vajra.

vajra essence massive.json --profile ai --format compact-ai --budget 500

{
  "v": "vajra/1",
  "doc": {"nodes": 847, "paths": 23, "depth": 6},
  "anomalies": [
    {"p": "$.claims[*].allowed", "t": "type_instability", "s": 0.4},
    {"p": "$.claims[*].charge", "t": "numeric_outlier", "v": 350, "z": 4.2}
  ],
  "drill": [
    {"path": "$.claims[*].service_lines", "available": ["stats", "anomalies", "motifs"]}
  ],
  "meta": {"profile": "ai", "truncated": false}
}

The Engine

✦

BLAKE3 Fingerprinting

Merkle subtree hashing. Path set signatures. Motif detection falls out for free. O(n).

∿

Shannon Entropy

Distinguishes boilerplate from signal without domain knowledge. The strongest universal primitive.

∆

MAD Outliers

50% breakdown point. Half the data can be corrupted before MAD gives a misleading result.

⇄

Jensen-Shannon Divergence

Symmetric. Bounded. A proper metric via sqrt. The right way to measure distribution drift.

≡

DDSketch

Relative-error quantile estimation. Mergeable. O(1) per insert. Streams terabytes in megabytes of RAM.

⌘

MinHash + LSH

Sublinear similarity search. Cluster 10K documents in seconds. No O(n^2) anywhere.

Install

cargo install vajra-cli

Or from source:

git clone https://github.com/copyleftdev/vajra
cd vajra
cargo build --release

First useful output in under 30 seconds:

echo '{"hello": "world"}' | vajra inspect -

Quickstart

You have 60 seconds. Let us not waste them.

Install

From crates.io:

cargo install vajra-cli

From source:

git clone https://github.com/copyleftdev/vajra
cd vajra
cargo build --release
# Binary lands at ./target/release/vajra

Verify:

vajra --help

Four Commands That Prove the Point

1. Inspect a JSON document

Feed Vajra a medical claim. Get back its skeleton — every path, every type, every fingerprint.

vajra inspect claim.json

=== Document Metadata ===
  Total nodes:    847
  Max depth:      6
  Distinct paths: 23
  Raw size:       14208 bytes

=== Wildcard Paths ===
  PATH                                           TYPE      COUNT  INSTABILITY  NULL_RATE
  $                                              object        1       0.0000     0.0000
  $.claims                                       array         1       0.0000     0.0000
  $.claims[*]                                    object        1       0.0000     0.0000
  $.claims[*].patient.id                         string        1       0.0000     0.0000
  $.claims[*].diagnosis[*].code                  string        2       0.0000     0.0000
  $.claims[*].service_lines[*].procedure_code    string       14       0.0000     0.0000
  $.claims[*].service_lines[*].charge_amount     number       14       0.0000     0.0000
  $.claims[*].service_lines[*].allowed_amount    number       11       0.0000     0.2143
  $.claims[*].service_lines[*].status            string       14       0.0000     0.0000

=== Fingerprints ===
  Path set:    a1b2c3d4e5f6...
  Typed path:  f7e8d9c0b1a2...
  Shape:       1234abcd5678...

=== Domain Type Recognition ===
  $.claims[*].diagnosis[*].code           E11.9      ICD-10-CM
  $.claims[*].service_lines[*].procedure_code  99213  CPT

Every path. Every type. Every structural fingerprint. Domain-specific codes recognized automatically. Zero configuration.

2. Generate an essence

Compress the entire document into what matters, shaped for a specific audience.

vajra essence claim.json --profile staff

=== Essence (staff profile) ===

Document Summary:
  1 claim with 14 service lines, 1 patient, 2 diagnosis codes.
  Primary status: partially adjudicated.

What Stands Out:
  - 3 service lines are missing allowed amounts (lines 2, 7, 11).
    This field is present in 79% of service lines — its absence is notable.
  - Adjustment reason code "CO-45" repeats across 8 of 14 lines.
    Repetition at this frequency suggests a systematic pattern, not random variation.
  - 1 diagnosis structure differs from the other.
    The second diagnosis carries an extra "qualifier" field.

What This Likely Means:
  - Most of the claim is consistent and well-formed.
  - A subset of service lines appears incomplete or differently processed.
  - The repeated adjustment code points to a systematic issue.

Same command, different audience:

vajra essence claim.json --profile ai --format json --budget 500

{
  "vajra_essence": {
    "version": "0.1.0",
    "profile": "ai",
    "structure": {
      "root_type": "object",
      "total_nodes": 847,
      "distinct_paths": 23,
      "max_depth": 6
    },
    "dominant_motif": {
      "path": "$.claims[0].service_lines[*]",
      "count": 14,
      "shape_hash": "f2c1..."
    },
    "anomalies": [
      {"path": "$.claims[0].service_lines[2,7,11].allowed_amount", "type": "missing", "severity": 4.2},
      {"path": "$.claims[0].diagnosis[1]", "type": "structural_deviation", "severity": 3.1}
    ]
  }
}

3. Detect drift between versions

Compare yesterday’s API response to today’s. Find what changed and how much it changed.

vajra drift baseline.json current.json

Drift Report: baseline.json -> current.json
Structural similarity: 0.94 (Jaccard)

Added paths (2):
  $.response.metadata.processing_flags    [array of strings]
  $.response.metadata.api_version         [string]

Removed paths (0): none

Type changes (1):
  $.response.items[*].quantity            string -> number

Distribution shifts (1):
  $.response.items[*].status              JSD: 0.34
    before: {"active": 0.82, "pending": 0.15, "error": 0.03}
    after:  {"active": 0.61, "pending": 0.12, "error": 0.27}
    note: "error" rate increased 9x

Overall severity: MEDIUM

Two paths added. One type migrated. The error rate in status jumped ninefold. Vajra found all of it in one pass.

4. Surface anomalies

Find what deviates from the population — without defining what “normal” looks like.

vajra anomalies claims_batch.ndjson

=== Anomaly Report ===
Records analyzed: 1,247
Anomalies found:  8

Numeric outliers:
  $.claims[*].service_lines[*].charge_amount
    Record 834: value 47,250.00 (z_MAD = 6.3, median = 285.00, MAD = 195.00)
    Record 1102: value 0.01 (z_MAD = -4.8)

Rarity outliers:
  $.claims[*].status
    Record 419: value "voided" (self-information = 10.3 bits, seen 1/1247)

Structural deviations:
  Record 662: missing 4 paths present in 99%+ of records
    - $.claims[*].subscriber.group_number
    - $.claims[*].subscriber.member_id
    - $.claims[*].provider.npi
    - $.claims[*].provider.taxonomy

Type instability:
  $.claims[*].service_lines[*].quantity
    Records 88, 204, 917: string where number expected (instability = 0.002)

Eight anomalies across four dimensions. Every one carries its score, its evidence, and the statistical context that makes it interpretable.

What Just Happened

You did not configure a schema. You did not define rules. You did not train a model.

Vajra read the raw structure, computed its statistical profile, and surfaced what deviates from the population — deterministically, explainably, in seconds.

That is the point.

Next Steps

Philosophy — why Vajra exists and what it refuses to be
Commands — all 11 commands at a glance
Profiles — tune the lens for your audience
Algorithms — the mathematics behind every score

Philosophy

Vajra exists because JSON is lying to you — not about its content, but about its complexity.

A 14,000-line medical claim is not 14,000 lines of information. It is a handful of structural motifs repeated dozens of times, wrapped in representational noise, carrying a few critical signals buried at unpredictable depths. The humans who depend on this data cannot see the signal. The AI systems consuming it waste tokens on the noise. The auditors verifying it have no tools that operate at the right level of abstraction.

Vajra was forged to solve this. Not by transforming the data. Not by summarizing it probabilistically. By analyzing it — deterministically, mathematically, and completely — and rendering the result as a compressed, faithful essence tuned to the concern of whoever is reading it.

The Three Views of JSON

This is the foundational insight. Every JSON document is three things simultaneously.

A Tree

The literal parse tree. Parent-child relationships, nesting depth, sibling structure, array indices. This is what JSON.parse() gives you. It is necessary but not sufficient.

The tree tells you what is here. It does not tell you what matters.

A Graph

Repeated structures create implicit references. Co-occurring keys form relationships. A diagnosis[*].code that appears alongside a diagnosis[*].system and a diagnosis[*].display is not three independent strings — it is a coded concept. A subscriber.id that functionally determines a subscriber.name is a dependency edge, invisible in the tree but real in the data.

The graph tells you how things relate. It reveals structure that the tree hides.

A Distribution

Every key name, every value, every type, every path, every null, every length — all form measurable statistical distributions. Shannon entropy distinguishes boilerplate from signal. Frequency reveals what is common and what is rare. MAD scores expose outliers that standard deviation would mask. The distribution of leading digits (Benford’s Law) separates naturally occurring financial data from fabricated numbers.

The distribution tells you what is normal and what deviates. It does this without rules, without schemas, without training data.

Raw JSON exposes only the tree. Vajra reads all three simultaneously.

The Six Design Principles

These are not aspirations. They are constraints. Every design decision in Vajra was tested against all six. Anything that violated even one was cut.

1. Universal

Any JSON. Any size. Any schema. Any nesting depth. No required schema definition, no required domain knowledge, no assumption about structure. If it parses as JSON, Vajra handles it.

This means: the core engine cannot contain a single line of code that assumes the data is a medical claim, or a financial transaction, or an API response. Domain intelligence enters only through plugins and profiles — never through the engine.

2. Deterministic

Same input + same config + same version = same output. Always. Fingerprints, scores, orderings, essence text, anomaly rankings — all reproducible to the byte.

This is not a nice-to-have. It is the foundation that makes Vajra trustworthy in pipelines, audits, and CI. An AI system that depends on Vajra can depend on Vajra. A compliance team that runs it twice gets the same answer twice.

The cost of this constraint is real: HashMap is banned from all externally-visible orderings (replaced by BTreeMap). Floating-point formatting uses ryu for platform independence. Every randomized algorithm is seeded. These costs are paid gladly.

3. Honest

Every inference is labeled as inference. Every score is decomposable. Every anomaly is explainable. Vajra never silently asserts a heuristic conclusion as truth.

When Vajra infers that a string is a date, it tells you the confidence level: definite (100% of values matched the DFA), dominant (>80%), heuristic (entropy-based), or unclassified (no inference applied). When it flags an anomaly, it shows the z-score, the median, the MAD, and the path. When it ranks an observation in an essence, --explain decomposes the score into its six contributing dimensions.

Magic is the enemy of trust. Vajra does not do magic.

4. Fast

Operational speed. Not batch-overnight speed. Seconds on typical payloads, minutes on gigabyte-scale files. Fast enough to use interactively in a terminal. Fast enough to gate a CI pipeline. Fast enough that reaching for Vajra is faster than opening the file.

The engine achieves this through simd-json for 2+ GB/s parsing throughput, O(n) single-pass analysis wherever possible, arena allocation for ephemeral analysis memory, and Rayon-based parallelism for batch operations.

5. Composable

The CLI, the Rust library, and the plugin system are each independently useful. Analyzers compose. Outputs chain. Profiles combine with formats and budgets.

vajra stats feeds vajra essence. vajra fingerprint feeds vajra drift. vajra anomalies can read from stdin in a pipeline. The library API exposes the same analyzers as the CLI, composable in Rust code without the CLI overhead.

6. Minimal Assumption

The core engine assumes nothing about the domain, the schema, or the purpose of the data. It analyzes structure, statistics, and deviation from population norms. It does not know what a “claim” is. It does not know what “E11.9” means. It does not know that allowed_amount should never be null.

Domain intelligence is real and valuable — but it enters through plugins (vajra-domain-med) and concern profiles (--profile auditor), never through hardcoded logic in the analysis pipeline.

This separation is what makes Vajra universal. The same engine that analyzes medical claims also analyzes IoT sensor payloads, financial transactions, API responses, and configuration files — because it never assumed it was analyzing any of them.

What Vajra Is NOT

Precision requires boundaries. Vajra is not:

A replacement for jq. jq transforms JSON. Vajra analyzes and reduces it. They are complementary, not competitive. Use jq to reshape; use Vajra to understand.
A probabilistic summarizer. Every reduction Vajra performs is deterministic and explainable. There is no language model in the pipeline. There is no sampling. There is no “approximately.”
A database or data store. Vajra is ephemeral. It reads, analyzes, and emits. It does not persist data, cache results, or maintain state between runs.
A schema registry. Vajra infers schema characteristics — it does not define or enforce them. It tells you what shape the data has, not what shape it should have.
A GUI or BI platform. Vajra is a CLI and a library. It renders text, JSON, Markdown, and compact-AI output. Visualization is left to tools that specialize in it.
A data transformation tool. Vajra never rewrites source data. It reads. It analyzes. It emits results. The input is sacred.
A validator or linter. Vajra does not check against rules you define. It discovers what the data is and what deviates from what the data normally is. The difference is fundamental.

The Category Vajra Creates

There is no existing category that accurately describes Vajra. The closest neighbors are:

Structured-data observability. Like application observability (metrics, traces, logs) but for the data itself. What is the shape of this payload? What changed since yesterday? What is anomalous in this batch?

Semantic reduction. Not summarization (which loses information probabilistically) but reduction (which compresses information deterministically, preserving all signal above a configurable threshold).

Operational cognition tooling. Tools that make the shape of complex data legible to the humans and AI systems that depend on it.

Vajra sits at the intersection of these three. It is the first tool built specifically to occupy this space.

The Mantra

Break noise. Preserve truth.

Every decision in Vajra flows from these four words. Noise is representational redundancy, structural boilerplate, repeated motifs, and cognitive overhead. Truth is anomalies, deviations, relationships, and operational signal.

The essence is what remains when the noise is broken and the truth is preserved.

Commands

Vajra ships 11 commands. Each does one thing. They compose.

Reference Table

Command	Purpose	Input	Key Output
`inspect`	Full structural analysis	Single document	Paths, types, fingerprints, domain hints
`stats`	Statistical summary	Single document	Entropy, frequency, distributions, null rates
`anomalies`	Anomaly detection	Single or batch	Outliers, rarity, structural deviations
`fingerprint`	Structural fingerprints	Single document	BLAKE3 hashes, MinHash signature
`essence`	Concern-oriented reduction	Single document	Compressed, ranked, profile-shaped output
`drift`	Schema drift detection	Two documents	Added/removed paths, type changes, JSD
`cluster`	Similarity clustering	Multiple documents	Cluster assignments, centroids, outliers
`invariants`	Cross-field relationships	Single or batch	Conditional entropy, PMI, dependencies
`query`	Path-based query with analysis functions	Single document	Filtered analysis results
`batch`	Parallel batch analysis	Directory	Aggregated stats, per-file summaries
`profiles`	List available profiles	None	Built-in and custom profile descriptions

Global Flags

Every command accepts these flags:

--format <text|json|markdown|compact-ai>   Output format (default: text)
--profile <name>                           Concern profile (default: engineer)
--config <path>                            Path to TOML config with custom profiles
--budget <N>                               Token budget for essence output
--streaming                                Force streaming mode (bounded memory)
--input-format <format>                    Override input format auto-detection
--redact                                   Apply built-in redaction patterns
--quiet                                    Suppress progress output
--explain                                  Include score decomposition in output

Quick Examples

Inspect

vajra inspect claim.json
vajra inspect claim.json --format json
cat payload.json | vajra inspect -

Stats

vajra stats claim.json
vajra stats claim.json --format json

Anomalies

vajra anomalies claim.json
vajra anomalies claims_batch.ndjson --format json

Fingerprint

vajra fingerprint claim.json
vajra fingerprint claim.json --format json

Essence

vajra essence claim.json --profile staff
vajra essence claim.json --profile ai --format compact-ai --budget 500
vajra essence claim.json --profile auditor --format markdown

Drift

vajra drift v1.json v2.json
vajra drift baseline.json candidate.json --format json

Cluster

vajra cluster batch/*.json
vajra cluster file1.json file2.json file3.json --format json

Invariants

vajra invariants claims_batch.ndjson
vajra invariants claims_batch.ndjson --top-k 100

Query

vajra query claim.json 'entropy($.claims[*].status) > 0.5'
vajra query claim.json '$.claims[*].service_lines[*].charge_amount'

Batch

vajra batch ./claims_directory/
vajra batch ./claims_directory/ --format json --profile auditor

Profiles

vajra profiles
vajra profiles --config custom.toml

Input Conventions

All commands that accept <input> understand:

File path: claim.json, ./data/payload.yaml
Stdin: - (pipe data in)
Directory: ./batch/ (processes all supported files)
Compressed: .json.gz, .json.zst (auto-decompressed)
HTTP URL: https://api.example.com/data.json (fetched, then analyzed)

Format is auto-detected from extension and content. Override with --input-format.

See Input Formats for the full list.

Output Conventions

All commands emit to stdout. All commands support --format json for machine-readable output. Diagnostics and errors go to stderr.

The --explain flag adds score decomposition to essence and anomaly output — showing exactly which dimensions contributed to each observation’s ranking.

The --redact flag applies built-in pattern redaction (SSN, email, phone, credit card) before any output is rendered. The essence never sees unredacted values.

inspect

The foundational command. inspect performs full structural analysis of a JSON document and reports every path, every type, every fingerprint, and every domain-recognized value it finds.

This is the command you reach for first. Before you know what you are looking for, inspect tells you what is there.

Usage

vajra inspect <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, `-` for stdin, or an HTTP URL

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--streaming`	Force streaming mode (bounded memory)	off
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

What It Reports

Document Metadata

Total node count, maximum nesting depth, number of distinct wildcard paths, raw byte size.

Wildcard Path Table

Every distinct path in the document, normalized with [*] for array indices. For each path:

Dominant type — the most common JSON type at that path
Count — how many times that path appears across the document
Type instability — fraction of observations where the type differs from the dominant type (0.0 = perfectly stable)
Null rate — fraction of observations that are null

Structural Fingerprints

Three BLAKE3-based fingerprints:

Path set fingerprint — hash of the sorted set of distinct wildcard paths. Captures what fields exist.
Typed path fingerprint — hash of sorted (path, dominant_type) pairs. Captures what fields exist and what types they carry.
Shape fingerprint — Merkle subtree hash computed bottom-up. Captures the full structural shape including nesting.

Domain Type Recognition

Values matched against domain-specific type recognizers (e.g., the medical plugin recognizes ICD-10-CM codes, CPT codes, NPI numbers). Each match reports the path, the value, and the recognized type.

Example: Text Output

vajra inspect claim.json

=== Document Metadata ===
  Total nodes:    847
  Max depth:      6
  Distinct paths: 23
  Raw size:       14208 bytes

=== Wildcard Paths ===
  PATH                                              TYPE      COUNT  INSTABILITY  NULL_RATE
  $                                                 object        1       0.0000     0.0000
  $.claims                                          array         1       0.0000     0.0000
  $.claims[*]                                       object        1       0.0000     0.0000
  $.claims[*].claim_id                              string        1       0.0000     0.0000
  $.claims[*].patient                               object        1       0.0000     0.0000
  $.claims[*].patient.id                            string        1       0.0000     0.0000
  $.claims[*].patient.name                          string        1       0.0000     0.0000
  $.claims[*].diagnosis                             array         1       0.0000     0.0000
  $.claims[*].diagnosis[*]                          object        2       0.0000     0.0000
  $.claims[*].diagnosis[*].code                     string        2       0.0000     0.0000
  $.claims[*].diagnosis[*].system                   string        2       0.0000     0.0000
  $.claims[*].service_lines                         array         1       0.0000     0.0000
  $.claims[*].service_lines[*]                      object       14       0.0000     0.0000
  $.claims[*].service_lines[*].procedure_code       string       14       0.0000     0.0000
  $.claims[*].service_lines[*].charge_amount        number       14       0.0000     0.0000
  $.claims[*].service_lines[*].allowed_amount       number       11       0.0000     0.2143
  $.claims[*].service_lines[*].status               string       14       0.0000     0.0000
  $.claims[*].service_lines[*].service_date         string       14       0.0000     0.0000
  $.claims[*].service_lines[*].adjustment           object       14       0.0000     0.0000
  $.claims[*].service_lines[*].adjustment.reason    string       14       0.0000     0.0000
  $.claims[*].service_lines[*].adjustment.amount    number       14       0.0000     0.0000
  $.claims[*].provider.npi                          string        1       0.0000     0.0000
  $.claims[*].subscriber.member_id                  string        1       0.0000     0.0000

=== Fingerprints ===
  Path set:    a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
  Typed path:  f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8
  Shape:       1234abcd56781234abcd56781234abcd56781234abcd56781234abcd56781234abcd

=== Domain Type Recognition ===
  $.claims[*].diagnosis[*].code                  E11.9      ICD-10-CM
  $.claims[*].diagnosis[*].code                  J44.1      ICD-10-CM
  $.claims[*].service_lines[*].procedure_code    99213      CPT
  $.claims[*].service_lines[*].procedure_code    99214      CPT
  $.claims[*].provider.npi                       1234567890 NPI

Example: JSON Output

vajra inspect claim.json --format json

{
  "metadata": {
    "total_nodes": 847,
    "max_depth": 6,
    "distinct_paths": 23,
    "raw_size_bytes": 14208
  },
  "paths": [
    {
      "path": "$.claims[*].service_lines[*].charge_amount",
      "dominant_type": "number",
      "count": 14,
      "type_instability": 0.0,
      "null_rate": 0.0
    },
    {
      "path": "$.claims[*].service_lines[*].allowed_amount",
      "dominant_type": "number",
      "count": 11,
      "type_instability": 0.0,
      "null_rate": 0.2143
    }
  ],
  "fingerprints": {
    "path_set": "a1b2c3d4...",
    "typed_path": "f7e8d9c0...",
    "shape": "1234abcd..."
  },
  "domain_hints": [
    {
      "path": "$.claims[*].diagnosis[*].code",
      "value": "E11.9",
      "recognized_type": "ICD-10-CM"
    }
  ]
}

When to Use It

First contact with unfamiliar data. You just received a JSON payload and need to know its shape.
Schema exploration. What paths exist? What types do they carry? How stable are those types?
Domain validation. Does the medical plugin recognize the codes in this claim? Is the NPI present?
Regression gating. Fingerprint the output of an API endpoint. If the fingerprint changes, the schema changed.

Pairs Well With

stats — once you know the structure, stats tells you the statistical profile
fingerprint — if you only need the fingerprints (faster, less output)
drift — compare two inspect snapshots to find what changed
essence — when you want the compressed version, not the full inventory

stats

stats computes the statistical profile of a JSON document. Entropy, frequency distributions, numeric summaries, null rates, cardinality — the quantitative foundation that every other analysis depends on.

Where inspect tells you what exists, stats tells you how it behaves.

Usage

vajra stats <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, `-` for stdin, or an HTTP URL

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--streaming`	Force streaming mode (sketch-based approximations)	off
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off
`--window <period>`	Temporal windowing: `month`, `week`, or `day`	off
`--time-field <path>`	JSONPath to timestamp field (e.g., `'$.date'`). Auto-detected if omitted.	auto

Temporal Windowing

When --window is specified, stats partitions records by time period and computes per-window statistics. Cross-window trend lines are included in the output, showing how distributions shift over time.

The --time-field flag tells Vajra which field contains the timestamp. If omitted, Vajra auto-detects by scanning for fields with date/time patterns (ISO 8601, Unix timestamps, common date formats).

vajra stats commits.ndjson --window month --time-field '$.date'

=== Statistical Summary (windowed: month) ===
Document: commits.ndjson (1,247 records, 8 paths)

--- Window: 2026-01 (312 records) ---
  $.files_changed
    Mean: 4.2   Median: 3.0   p95: 12.0

--- Window: 2026-02 (298 records) ---
  $.files_changed
    Mean: 5.1   Median: 4.0   p95: 15.0

--- Window: 2026-03 (337 records) ---
  $.files_changed
    Mean: 6.8   Median: 5.0   p95: 19.0

--- Cross-Window Trends ---
  $.files_changed  mean: 4.2 -> 5.1 -> 6.8 (upward, +62% over 3 months)
  $.type           "fix" share: 0.18 -> 0.24 -> 0.31 (increasing)

Windowing works with any multi-record input: NDJSON, CSV, multi-document YAML, or directories.

What It Reports

For every wildcard path in the document:

Frequency and Cardinality

Count — total observations at this path
Cardinality — number of distinct values
Top values — the most frequent values with their counts

Entropy

Shannon entropy — H(X) in bits. Measures information content.
Normalized entropy — H(X) / log2(|support|). Scales to [0, 1] regardless of cardinality.

The entropy pair is one of the most powerful signals in the system:

Entropy	Normalized	Interpretation
0	0	Constant — single value, pure boilerplate
Low	Low	Enum-like — few distinct states
Low	High	Near-uniform over tiny support
High	Moderate	Meaningful variation — identifiers, dates, codes
High	High	Near-uniform over large support — free text, UUIDs

Missingness

Null rate — fraction of observations that are JSON null
Absent rate — fraction of parent records where this path does not appear
Empty rate — fraction of values that are empty strings, empty arrays, or empty objects

Numeric Distributions (for numeric paths)

Min, max, mean, median
Percentiles — p01, p05, p25, p50, p75, p95, p99
MAD — Median Absolute Deviation (robust dispersion)
Skewness proxy — (mean - median) / MAD

Type Distribution

Breakdown of JSON types observed at each path (e.g., 98% number, 2% string)
Type instability score — fraction of observations deviating from the dominant type

Example: Text Output

vajra stats claim.json

=== Statistical Summary ===
Document: claim.json (847 nodes, 23 paths)

--- $.claims[*].service_lines[*].charge_amount ---
  Count:       14
  Cardinality: 12
  Entropy:     3.41 bits (normalized: 0.88)
  Type:        number (100%)
  Min:         45.00    Max: 1250.00
  Mean:        312.50   Median: 285.00
  MAD:         195.00
  p25:         125.00   p75: 425.00
  p95:         890.00   p99: 1125.00

--- $.claims[*].service_lines[*].status ---
  Count:       14
  Cardinality: 3
  Entropy:     1.22 bits (normalized: 0.77)
  Type:        string (100%)
  Top values:
    "adjudicated"  10 (71.4%)
    "pending"        3 (21.4%)
    "denied"         1 (7.1%)

--- $.claims[*].service_lines[*].allowed_amount ---
  Count:       11
  Cardinality: 9
  Entropy:     3.12 bits (normalized: 0.93)
  Type:        number (100%)
  Null rate:   0.000
  Absent rate: 0.214  ** notable: missing in 3 of 14 service lines **
  Min:         32.00    Max: 875.00
  Mean:        245.30   Median: 210.00
  MAD:         142.00

--- $.claims[*].diagnosis[*].code ---
  Count:       2
  Cardinality: 2
  Entropy:     1.00 bits (normalized: 1.00)
  Type:        string (100%)
  Top values:
    "E11.9"  1 (50.0%)
    "J44.1"  1 (50.0%)

--- $.claims[*].service_lines[*].adjustment.reason ---
  Count:       14
  Cardinality: 4
  Entropy:     1.56 bits (normalized: 0.78)
  Type:        string (100%)
  Top values:
    "CO-45"   8 (57.1%)
    "CO-97"   3 (21.4%)
    "PR-1"    2 (14.3%)
    "OA-23"   1 (7.1%)

Example: JSON Output

vajra stats claim.json --format json

{
  "document": "claim.json",
  "total_nodes": 847,
  "distinct_paths": 23,
  "paths": {
    "$.claims[*].service_lines[*].charge_amount": {
      "count": 14,
      "cardinality": 12,
      "entropy": 3.41,
      "normalized_entropy": 0.88,
      "types": {"number": 14},
      "null_rate": 0.0,
      "absent_rate": 0.0,
      "numeric": {
        "min": 45.0,
        "max": 1250.0,
        "mean": 312.5,
        "median": 285.0,
        "mad": 195.0,
        "percentiles": {
          "p01": 45.0, "p05": 52.0, "p25": 125.0,
          "p50": 285.0, "p75": 425.0, "p95": 890.0, "p99": 1125.0
        }
      },
      "top_values": [
        {"value": "285.00", "count": 2},
        {"value": "125.00", "count": 2}
      ]
    },
    "$.claims[*].service_lines[*].status": {
      "count": 14,
      "cardinality": 3,
      "entropy": 1.22,
      "normalized_entropy": 0.77,
      "types": {"string": 14},
      "null_rate": 0.0,
      "absent_rate": 0.0,
      "top_values": [
        {"value": "adjudicated", "count": 10},
        {"value": "pending", "count": 3},
        {"value": "denied", "count": 1}
      ]
    }
  }
}

When to Use It

Understanding data distributions. What does the charge_amount field actually look like? What are the common status values? How much entropy does this field carry?
Finding hidden nulls and absences. A field with 21% absent rate across service lines is operationally significant — stats surfaces this.
Establishing baselines. Run stats on today’s batch. Run it again tomorrow. Compare the distributions manually or feed them to drift.
Identifying enum-like fields. Low cardinality + low entropy = enum. High cardinality + high entropy = identifier. stats makes this distinction quantitative.

Pairs Well With

inspect — structural overview before statistical deep dive
anomalies — stats computes the distributions; anomalies flags what deviates from them
essence — the essence builder uses stats internally to score observation importance
invariants — cross-field analysis builds on per-field statistics

anomalies

anomalies surfaces records, fields, and structural elements that deviate meaningfully from the population. It does this across four dimensions — numeric outliers, rarity, structural deviation, and type instability — using only deterministic, interpretable methods.

No training data. No labeled examples. No rules to configure. Feed it cold data and it finds what deviates from what the data says is normal.

Usage

vajra anomalies <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, NDJSON batch, `-` for stdin, or directory

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--streaming`	Force streaming mode	off
`--redact`	Apply built-in redaction before output	off
`--explain`	Include score decomposition for each anomaly	off
`--quiet`	Suppress progress output	off

The Four Dimensions

Dimension 1: Numeric Outliers

Method: MAD-based modified z-scores.

For every numeric path, Vajra computes the median and the Median Absolute Deviation (MAD). Values where the modified z-score exceeds the threshold (default 3.5) are flagged.

z_MAD = 0.6745 * (value - median) / MAD

MAD has a 50% breakdown point — half the data can be arbitrarily corrupted before it gives a misleading result. Standard deviation has a 0% breakdown point. This distinction matters when the data you are analyzing might contain the very outliers you are trying to detect.

Dimension 2: Rarity Outliers

Method: self-information scoring.

For each (path, value) pair:

rarity = -log2(frequency / total)

A value seen once in 10,000 records scores ~13.3 bits. A value seen in half the records scores 1 bit. The threshold adapts per path: values exceeding mean_rarity + 2 * MAD_of_rarity are flagged.

Dimension 3: Structural Deviations

Method: Jaccard distance from the dominant path set.

In batch analysis, Vajra computes the most common set of paths (the structural mode). Each document is compared:

structural_anomaly = 1 - Jaccard(doc_paths, mode_paths)

Documents with structural anomaly > 0.2 are flagged, with the specific missing and extra paths listed.

Dimension 4: Type Instability

Method: per-path type instability score.

instability = 1 - (dominant_type_count / total_observations)

Paths with instability > 0.01 are flagged. Individual records contributing the minority type are identified.

Example: Text Output

vajra anomalies claims_batch.ndjson

=== Anomaly Report ===
Records analyzed: 1,247
Anomalies found:  8

--- Numeric Outliers ---
  $.claims[*].service_lines[*].charge_amount
    Record 834: 47,250.00 (z_MAD = 6.3, median = 285.00, MAD = 195.00)
    Record 1102: 0.01 (z_MAD = -4.8, median = 285.00, MAD = 195.00)

  $.claims[*].service_lines[*].allowed_amount
    Record 834: 45,000.00 (z_MAD = 5.9, median = 210.00, MAD = 142.00)

--- Rarity Outliers ---
  $.claims[*].status
    Record 419: "voided" (10.3 bits, 1 of 1,247 records)

  $.claims[*].service_lines[*].adjustment.reason
    Record 77: "N-832" (9.1 bits, 2 of 17,458 service lines)

--- Structural Deviations ---
  Record 662: Jaccard distance 0.31 from structural mode
    Missing paths:
      $.claims[*].subscriber.group_number
      $.claims[*].subscriber.member_id
      $.claims[*].provider.npi
      $.claims[*].provider.taxonomy

--- Type Instability ---
  $.claims[*].service_lines[*].quantity
    Records 88, 204, 917: string where number expected
    Instability: 0.002 (3 of 1,247 records)

Example: JSON Output

vajra anomalies claims_batch.ndjson --format json

{
  "records_analyzed": 1247,
  "anomaly_count": 8,
  "numeric_outliers": [
    {
      "path": "$.claims[*].service_lines[*].charge_amount",
      "record": 834,
      "value": 47250.0,
      "z_mad": 6.3,
      "median": 285.0,
      "mad": 195.0
    },
    {
      "path": "$.claims[*].service_lines[*].charge_amount",
      "record": 1102,
      "value": 0.01,
      "z_mad": -4.8,
      "median": 285.0,
      "mad": 195.0
    }
  ],
  "rarity_outliers": [
    {
      "path": "$.claims[*].status",
      "record": 419,
      "value": "voided",
      "self_information_bits": 10.3,
      "frequency": 1,
      "total": 1247
    }
  ],
  "structural_deviations": [
    {
      "record": 662,
      "jaccard_distance": 0.31,
      "missing_paths": [
        "$.claims[*].subscriber.group_number",
        "$.claims[*].subscriber.member_id",
        "$.claims[*].provider.npi",
        "$.claims[*].provider.taxonomy"
      ],
      "extra_paths": []
    }
  ],
  "type_instability": [
    {
      "path": "$.claims[*].service_lines[*].quantity",
      "records": [88, 204, 917],
      "expected_type": "number",
      "actual_type": "string",
      "instability": 0.002
    }
  ]
}

Example: With –explain

vajra anomalies claim.json --explain

--- Numeric Outliers ---
  $.claims[*].service_lines[*].charge_amount
    Record 834: 47,250.00
      z_MAD:       6.3
      median:      285.00
      MAD:         195.00
      threshold:   3.5
      score decomposition:
        rarity:             0.82
        instability:        0.00
        entropy_signal:     0.34
        structural_coverage: 0.15
        anomaly_strength:   0.95
        concern_relevance:  0.40
        composite:          0.71

When to Use It

Cold data triage. You received a batch of claims and need to know what is unusual before reading any of them.
Fraud screening. The --profile fraud variant amplifies rarity and numeric outlier weights. Unusual charge amounts, rare status values, and missing provider fields all surface.
Data quality monitoring. Run anomalies on each day’s batch in CI. If the anomaly count spikes, something changed upstream.
Pre-audit preparation. Give auditors the anomaly report alongside the raw data. They know where to look.

Pairs Well With

stats — anomalies are scored against the statistical baseline that stats computes
essence — anomalies feed into the essence as high-priority observations
drift — anomalies detect deviations within a batch; drift detects changes between batches
cluster — structural deviations often indicate documents that belong to different clusters

fingerprint

fingerprint computes structural fingerprints for a JSON document — cryptographic hashes that capture what the document looks like independently of its values.

Two documents with the same fingerprint have the same structure. If the fingerprint changes, the schema changed. This is the fastest possible regression check.

Usage

vajra fingerprint <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, `-` for stdin, or an HTTP URL

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--streaming`	Force streaming mode	off
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

Fingerprint Types

Path Set Fingerprint

BLAKE3 hash of the sorted set of distinct wildcard paths. Captures what fields exist, ignoring their types and values.

Two documents with the same path set fingerprint have identical field structures — the same keys at the same nesting levels, even if every value differs.

Typed Path Fingerprint

BLAKE3 hash of sorted (path, dominant_type) pairs. Captures what fields exist and what types they carry.

This is strictly more specific than the path set fingerprint. A type migration (e.g., quantity changing from string to number) changes the typed path fingerprint but not the path set fingerprint.

Shape Fingerprint (Merkle)

Bottom-up hash computed via Merkle subtree hashing:

Leaf nodes hash their type
Objects hash the sorted concatenation of (key, child_hash) pairs
Arrays hash the concatenation of child hashes

The root hash is the shape fingerprint. This captures the full structural shape including nesting hierarchy.

A critical secondary benefit: subtree hashes at every node enable motif detection as a byproduct. Identical subtrees produce identical hashes. This falls out of a single O(n) traversal.

MinHash Signature

A 128-hash MinHash signature over the path set, enabling constant-time Jaccard similarity estimation between documents. Used internally by cluster and drift, but exposed here for direct access.

Example: Text Output

vajra fingerprint claim.json

=== Fingerprints ===
  Path set:    a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2
  Typed path:  f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8
  Shape:       1234abcd56781234abcd56781234abcd56781234abcd56781234abcd56781234abcd
  MinHash:     [64 x u64 values]

=== Subtree Motifs ===
  Hash d4e5f6a1... appears 14 times (service line object)
  Hash b2c3d4e5... appears 2 times (diagnosis object)

Example: JSON Output

vajra fingerprint claim.json --format json

{
  "path_set": "a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4e5f6a1b2",
  "typed_path": "f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8d9c0b1a2f7e8",
  "shape": "1234abcd56781234abcd56781234abcd56781234abcd56781234abcd56781234abcd",
  "minhash": [18446744073709551615, 12345678901234567890, "..."],
  "motifs": [
    {
      "hash": "d4e5f6a1...",
      "count": 14,
      "node_count": 8,
      "representative_path": "$.claims[*].service_lines[*]"
    },
    {
      "hash": "b2c3d4e5...",
      "count": 2,
      "node_count": 3,
      "representative_path": "$.claims[*].diagnosis[*]"
    }
  ]
}

Use Cases

CI Regression Check

Store the fingerprint of your API’s response format. On every deploy, compare:

# Capture baseline
vajra fingerprint api_response.json --format json > baseline_fp.json

# On each CI run
vajra fingerprint today_response.json --format json > current_fp.json
diff baseline_fp.json current_fp.json

If the path set fingerprint changed, fields were added or removed. If the typed path fingerprint changed, a type migrated. If only the shape fingerprint changed, the nesting structure shifted.

Quick Structural Comparison

vajra fingerprint file_a.json --format json | jq .path_set
vajra fingerprint file_b.json --format json | jq .path_set

Same hash? Same structure. Different hash? Feed them to drift for the details.

Motif Discovery

The motif section reveals repeated substructures. In a medical claim, you will see the service line object repeated 14 times with the same hash — proof that those 14 elements are structurally identical.

When to Use It

Schema regression gating. The fastest way to detect structural changes.
Deduplication. Documents with identical shape fingerprints are structurally identical.
Batch pre-screening. Fingerprint a batch before clustering to quickly identify structural families.
Motif identification. What substructures repeat, and how many times?

Pairs Well With

drift — when fingerprints differ, drift tells you exactly what changed
cluster — uses MinHash signatures internally for similarity estimation
inspect — fingerprint is the focused subset of what inspect computes
essence — motif discovery feeds directly into essence compression

essence

essence is the command Vajra was built for. It takes a JSON document, runs the full analysis pipeline, scores every observation against a concern profile’s weight vector, and renders a compressed, ranked, faithful representation — shaped for whoever is reading it.

An essence is not a summary. A summary loses information probabilistically. An essence compresses information deterministically, preserving everything above a configurable importance threshold while collapsing structural noise.

Usage

vajra essence <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, `-` for stdin, directory, or HTTP URL

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--profile <name>`	Concern profile: `staff`, `engineer`, `auditor`, `ai`, `fraud`, or custom	`engineer`
`--budget <N>`	Approximate token budget for output	unlimited
`--config <path>`	Path to TOML file with custom profile definitions	none
`--input-format <fmt>`	Override auto-detected input format	auto
`--streaming`	Force streaming mode	off
`--redact`	Apply built-in redaction before rendering	off
`--explain`	Include score decomposition for each observation	off
`--quiet`	Suppress progress output	off

How Essence Construction Works

Collect candidates. All observations from the analysis pipeline — notable fields, motifs, anomalies, relationship discoveries — become candidates.
Score each candidate using the active profile’s six-dimensional weight vector:
- rarity — self-information of the observation
- instability — type instability at the path
- entropy_signal — distance from 0.5 normalized entropy (both constants and noise score high)
- structural_coverage — fraction of total nodes under this path
- anomaly_strength — maximum anomaly score across dimensions
- concern_relevance — profile-specific boost for this path or observation type
Collapse motifs. Repeated structural patterns are represented once with a count and specific variations noted.
Rank by composite score with deterministic tie-breaking (shallower paths first, then lexicographic).
Apply token budget (if --budget is set). Greedy selection by score-per-token — the fractional knapsack approximation.
Render using the profile’s vocabulary and rendering style.

Profiles at a Glance

Profile	Vocabulary	Rendering	Emphasizes
`staff`	Plain language	Narrative sections	Anomalies, structural coverage
`engineer`	Technical, JSONPath	Tabular, list-based	Type instability, all dimensions balanced
`auditor`	Formal	Completeness-focused	Instability, concern relevance, missingness
`ai`	Compact, terse	Machine-readable	Entropy signal, structural coverage, anomalies
`fraud`	Investigative	Outlier-focused	Rarity, anomaly strength

See Profiles for full weight vectors and customization.

Example: Staff Profile

vajra essence claim.json --profile staff

=== Essence (staff profile) ===

Document Summary:
  1 claim with 14 service lines, 1 patient, 2 diagnosis codes.
  Primary status: partially adjudicated.

What Stands Out:
  - 3 service lines are missing allowed amounts (lines 2, 7, 11).
    This field is present in 79% of service lines — its absence is notable.
  - Adjustment reason code "CO-45" repeats across 8 of 14 lines.
    Repetition at this frequency suggests a systematic pattern, not random variation.
  - 1 diagnosis structure differs from the other.
    The second diagnosis carries an extra "qualifier" field.
  - Provider taxonomy code is absent.
    This field is expected in 94% of claims in typical batches.

What This Likely Means:
  - Most of the claim is consistent and well-formed.
  - A subset of service lines appears incomplete or differently processed.
  - The repeated adjustment code points to a systematic issue.

No JSONPath. No z-scores. No jargon. The staff member gets what they need to act.

Example: Engineer Profile

vajra essence claim.json --profile engineer

=== Essence (engineer profile) ===

Structure: 847 nodes, 23 distinct paths, max depth 6
Fingerprint (path set): a1b2c3d4...
Dominant motif: $.claims[*].service_lines[*] (14 instances, 8 fields each)

Notable paths:
  $.claims[*].service_lines[*].allowed_amount
    null_rate: 0.214, entropy: 3.12, type: number (100%)
    absent in 3 of 14 service lines (indices 2, 7, 11)

  $.claims[*].service_lines[*].adjustment.reason
    entropy: 1.56, cardinality: 4
    dominant value: "CO-45" (57.1%, 8 of 14)

  $.claims[*].diagnosis[1]
    structural deviation: extra field "qualifier" (not in diagnosis[0])

Type stability: 100% across all paths
Array homogeneity: service_lines 100% (1 shape hash), diagnosis 50% (2 shape hashes)

Example: AI Profile with Token Budget

vajra essence claim.json --profile ai --format json --budget 500

{
  "vajra_essence": {
    "version": "0.1.0",
    "profile": "ai",
    "input_hash": "b3a7f2c1d4e5...",
    "structure": {
      "root_type": "object",
      "total_nodes": 847,
      "distinct_paths": 23,
      "max_depth": 6
    },
    "dominant_motif": {
      "path": "$.claims[0].service_lines[*]",
      "count": 14,
      "shape_hash": "f2c1d4e5...",
      "fields": ["procedure_code", "service_date", "charge_amount", "allowed_amount", "status", "adjustment"]
    },
    "anomalies": [
      {
        "path": "$.claims[0].service_lines[2,7,11].allowed_amount",
        "type": "missing",
        "severity": 4.2
      },
      {
        "path": "$.claims[0].diagnosis[1]",
        "type": "structural_deviation",
        "severity": 3.1
      }
    ],
    "notable": [
      {
        "path": "$.claims[0].service_lines[*].adjustment.reason_code",
        "observation": "value 'CO-45' in 8/14 instances (57%)"
      }
    ],
    "meta": {
      "budget_tokens": 500,
      "truncated": false,
      "observations_included": 4,
      "observations_total": 7
    }
  }
}

The AI profile collapses aggressively. Motifs are represented once with counts. Observations are sorted by score-per-token. The meta.truncated field tells the downstream model whether anything was cut.

Example: Compact-AI Format

vajra essence claim.json --profile ai --format compact-ai --budget 300

{"v":"vajra/1","n":847,"p":23,"d":6,"motif":{"p":"$.claims[0].service_lines[*]","c":14},"a":[{"p":"$.claims[0].service_lines[2,7,11].allowed_amount","t":"miss","s":4.2},{"p":"$.claims[0].diagnosis[1]","t":"struct","s":3.1}],"drill":[{"p":"$.claims[*].service_lines","avail":["stats","anomalies","motifs"]}]}

Maximum compression. Every key shortened. The drill section tells the LLM which paths have deeper analysis available for follow-up queries.

Example: With –explain

vajra essence claim.json --profile engineer --explain

Notable paths:
  $.claims[*].service_lines[*].allowed_amount
    null_rate: 0.214, entropy: 3.12
    [score: 0.68]
      rarity:             0.42  x  weight 0.15  =  0.063
      instability:        0.00  x  weight 0.25  =  0.000
      entropy_signal:     0.24  x  weight 0.15  =  0.036
      structural_coverage: 0.18 x  weight 0.15  =  0.027
      anomaly_strength:   0.89  x  weight 0.15  =  0.134
      concern_relevance:  0.75  x  weight 0.15  =  0.113

Every score decomposed into its six dimensions. Nothing hidden. Nothing magic.

The Token Budget

When --budget N is specified, Vajra estimates the token cost of each observation (word count x 1.3) and selects greedily by score-per-token until the budget is exhausted. This is the fractional knapsack approximation — optimal for the greedy case.

The budget is approximate, not exact. It prevents bloated output without requiring precise token counting.

When to Use It

Non-technical stakeholders. --profile staff translates the data into plain language.
AI pipelines. --profile ai --format compact-ai --budget 500 compresses a 1000-node document into a token-efficient context.
Audits. --profile auditor emphasizes completeness, missingness, and traceability.
Fraud screening. --profile fraud amplifies anomalies and rare patterns.
Documentation. --format markdown renders the essence as publishable documentation.

Pairs Well With

stats — the statistical baseline that feeds scoring
anomalies — anomalies are the highest-priority candidates in most profiles
drift — drift observations appear in the essence when a baseline is available
Profiles — full control over what gets emphasized and how it gets rendered

drift

drift detects and quantifies structural, type, and distributional changes between two JSON documents. It answers the question every engineer asks when something breaks: what changed?

Not what changed in the values — what changed in the shape, types, and statistical behavior of the data.

Usage

vajra drift <baseline> <candidate> [flags]

Arguments:

Argument	Description
`<baseline>`	The reference document (the “before”)
`<candidate>`	The comparison document (the “after”)

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--profile <name>`	Concern profile for severity weighting	`engineer`
`--input-format <fmt>`	Override auto-detected input format	auto
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off
`--group-by <path>`	JSONPath for population-level comparison (e.g., `'$.author_type'`)	off

Population-Level Comparison

When --group-by is specified, drift partitions records by the field value and computes pairwise drift between all groups. Instead of comparing two documents, you compare two (or more) subpopulations within the same dataset.

vajra drift prs.ndjson --group-by '$.author_type'

Drift Report (grouped by $.author_type)
Groups: bot (412 records), human (835 records)

Pairwise drift: bot vs human
  Structural similarity: 0.91 (Jaccard)

  Distribution shifts:
    $.files_changed              JSD: 0.42 (high)
      bot:   median 1.0, p95 3.0
      human: median 4.0, p95 18.0

    $.review_comments            JSD: 0.38 (moderate)
      bot:   median 0.0, p95 1.0
      human: median 2.0, p95 8.0

  Overall severity: HIGH (significant distributional divergence)

This is useful for comparing behavioral subgroups — bot vs. human PRs, different teams, production vs. staging, before vs. after a policy change — without needing separate files.

Drift Dimensions

Structural Drift

Path set symmetric difference:

added_paths   = paths(candidate) \ paths(baseline)
removed_paths = paths(baseline) \ paths(candidate)

New fields appearing. Old fields disappearing. The most visible form of schema evolution.

Type Drift

For each path present in both documents, the dominant type is compared. Any path where the type changed (e.g., string to number, array to object) is flagged.

Distributional Drift

Jensen-Shannon Divergence (JSD) measures how much value distributions shifted between baseline and candidate:

JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)

where M = 0.5 * (P + Q).

JSD is symmetric, always finite, bounded to [0, 1], and its square root is a proper metric. This means drift magnitudes can be meaningfully compared and accumulated across paths.

For numeric paths, Vajra also computes the 1D Wasserstein distance (earth mover’s distance), which captures how far values moved, not just that they moved.

Drift Classification

Each drifted path receives a classification:

Class	Meaning
`additive`	New path appeared in candidate
`subtractive`	Path present in baseline, absent in candidate
`type-mutative`	Dominant type changed
`distributional`	Value distribution shifted (JSD > threshold)
`cardinality-shift`	Array lengths changed significantly
`null-rate-shift`	Null/missing ratio changed significantly

Severity Scoring

The overall drift severity is a weighted sum of drift dimensions, tuned by the active profile:

Auditor profiles weight subtractive drift highest (missing data is critical for compliance)
Engineer profiles weight type-mutative drift highest (breaking changes)
Fraud profiles weight distributional drift highest (behavioral shifts)

Example: Text Output

vajra drift yesterday.json today.json

Drift Report: yesterday.json -> today.json
Structural similarity: 0.94 (Jaccard)

Added paths (2):
  $.response.metadata.processing_flags    [array of strings]
  $.response.metadata.api_version         [string]

Removed paths (0): none

Type changes (1):
  $.response.items[*].quantity            string -> number (clean type migration)

Distribution shifts (1):
  $.response.items[*].status              JSD: 0.34 (moderate)
    before: {"active": 0.82, "pending": 0.15, "error": 0.03}
    after:  {"active": 0.61, "pending": 0.12, "error": 0.27}
    note: "error" rate increased 9x

Null rate changes (0): none

Overall severity: MEDIUM (structural additions + significant distribution shift)

Example: JSON Output

vajra drift yesterday.json today.json --format json

{
  "baseline": "yesterday.json",
  "candidate": "today.json",
  "jaccard_similarity": 0.94,
  "overall_severity": "medium",
  "added_paths": [
    {
      "path": "$.response.metadata.processing_flags",
      "type": "array"
    },
    {
      "path": "$.response.metadata.api_version",
      "type": "string"
    }
  ],
  "removed_paths": [],
  "type_changes": [
    {
      "path": "$.response.items[*].quantity",
      "baseline_type": "string",
      "candidate_type": "number",
      "jsd": 0.0
    }
  ],
  "distribution_shifts": [
    {
      "path": "$.response.items[*].status",
      "jsd": 0.34,
      "baseline_distribution": {
        "active": 0.82,
        "pending": 0.15,
        "error": 0.03
      },
      "candidate_distribution": {
        "active": 0.61,
        "pending": 0.12,
        "error": 0.27
      }
    }
  ],
  "null_rate_changes": []
}

Example: Medical Claim Drift

vajra drift baseline_claim.json updated_claim.json --profile auditor

Drift Report: baseline_claim.json -> updated_claim.json
Structural similarity: 0.87 (Jaccard)

Added paths (3):
  $.claims[*].service_lines[*].modifier_codes     [array of strings]
  $.claims[*].rendering_provider                   [object]
  $.claims[*].rendering_provider.npi               [string]

Removed paths (1):
  $.claims[*].provider.taxonomy                    [string]
    ** SUBTRACTIVE: field present in baseline, absent in candidate **

Type changes (0): none

Distribution shifts (2):
  $.claims[*].service_lines[*].status              JSD: 0.22
    before: {"adjudicated": 0.85, "pending": 0.15}
    after:  {"adjudicated": 0.64, "pending": 0.21, "denied": 0.15}
    note: new value "denied" appeared

  $.claims[*].service_lines[*].charge_amount       Wasserstein: 125.40
    before: median 285.00, p95 890.00
    after:  median 410.00, p95 1350.00
    note: charges shifted upward

Overall severity: HIGH (subtractive drift in auditor profile)

The auditor profile flags the removed taxonomy path as high severity because subtractive drift — data that was present and is now absent — is the most dangerous form of schema evolution for compliance.

When to Use It

API version migration. Compare the response shape before and after a deploy.
Vendor data monitoring. Compare this week’s feed to last week’s. Detect undocumented schema changes before they break your pipeline.
Regulatory compliance. Prove that the data structure has not drifted outside acceptable bounds.
CI integration. Gate deploys on drift severity. If drift exceeds a threshold, fail the build and require review.

Pairs Well With

fingerprint — quick structural same-or-different check before detailed drift analysis
inspect — understand each document’s structure before comparing
anomalies — drift detects changes between versions; anomalies detect deviations within a version
essence — drift observations feed into essence generation when a baseline is provided

cluster

cluster groups similar JSON documents by structural similarity. Feed it a batch of files and it tells you how many structural families exist, which documents belong to each, and which documents are structural outliers that fit nowhere.

No predefined cluster count. No training. The algorithm discovers the natural grouping from the data.

Usage

vajra cluster <inputs...> [flags]

Arguments:

Argument	Description
`<inputs...>`	One or more JSON files, glob patterns, or directories

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

How It Works

Small Batches (< 1,000 documents)

Exact pairwise Jaccard similarity over wildcard path sets:

J(A, B) = |paths(A) intersection paths(B)| / |paths(A) union paths(B)|

O(n^2) pairwise but tractable at small scale. Results are exact and deterministic.

Large Batches

MinHash + Locality-Sensitive Hashing (LSH).

During fingerprinting, each document receives a 128-hash MinHash signature.
LSH partitions each signature into bands, hashing each band into buckets.
Documents sharing a bucket in any band are candidate pairs.
Connected components in the candidate graph form initial clusters.
Within each component, exact pairwise similarity refines the grouping.

The probability curve is tuned so that documents with Jaccard similarity > 0.5 have > 98% chance of being found as candidates, while documents with similarity < 0.2 have < 2% false positive rate.

This achieves near-linear time clustering: O(n) for MinHash, O(n) amortized for LSH indexing.

Example: Text Output

vajra cluster claims_batch/*.json

=== Cluster Report ===
Documents: 247
Clusters:  3

--- Cluster 0 (198 documents, 80.2%) ---
  Representative: claim_001.json
  Distinct paths: 23
  Structural signature: a1b2c3d4...
  Members: claim_001.json, claim_002.json, claim_003.json, ... (+195 more)

--- Cluster 1 (41 documents, 16.6%) ---
  Representative: claim_048.json
  Distinct paths: 27
  Structural signature: e5f6a7b8...
  Additional paths vs Cluster 0:
    $.claims[*].service_lines[*].modifier_codes
    $.claims[*].rendering_provider
    $.claims[*].rendering_provider.npi
    $.claims[*].rendering_provider.taxonomy
  Members: claim_048.json, claim_052.json, claim_067.json, ... (+38 more)

--- Cluster 2 (8 documents, 3.2%) ---
  Representative: claim_199.json
  Distinct paths: 18
  Structural signature: c9d0e1f2...
  Missing paths vs Cluster 0:
    $.claims[*].subscriber.group_number
    $.claims[*].subscriber.member_id
    $.claims[*].provider.taxonomy
    $.claims[*].service_lines[*].adjustment
    $.claims[*].service_lines[*].adjustment.reason
  Members: claim_199.json, claim_201.json, claim_215.json, ... (+5 more)
  ** Potential structural anomalies — missing common fields **

=== Similarity Matrix (cluster centroids) ===
             Cluster 0  Cluster 1  Cluster 2
  Cluster 0      1.000      0.852      0.783
  Cluster 1      0.852      1.000      0.667
  Cluster 2      0.783      0.667      1.000

Example: JSON Output

vajra cluster claims_batch/*.json --format json

{
  "document_count": 247,
  "cluster_count": 3,
  "clusters": [
    {
      "id": 0,
      "size": 198,
      "representative": "claim_001.json",
      "distinct_paths": 23,
      "structural_signature": "a1b2c3d4...",
      "members": ["claim_001.json", "claim_002.json", "..."]
    },
    {
      "id": 1,
      "size": 41,
      "representative": "claim_048.json",
      "distinct_paths": 27,
      "structural_signature": "e5f6a7b8...",
      "additional_paths": [
        "$.claims[*].service_lines[*].modifier_codes",
        "$.claims[*].rendering_provider",
        "$.claims[*].rendering_provider.npi",
        "$.claims[*].rendering_provider.taxonomy"
      ],
      "members": ["claim_048.json", "claim_052.json", "..."]
    },
    {
      "id": 2,
      "size": 8,
      "representative": "claim_199.json",
      "distinct_paths": 18,
      "structural_signature": "c9d0e1f2...",
      "missing_paths": [
        "$.claims[*].subscriber.group_number",
        "$.claims[*].subscriber.member_id",
        "$.claims[*].provider.taxonomy"
      ],
      "members": ["claim_199.json", "claim_201.json", "..."]
    }
  ],
  "similarity_matrix": [
    [1.0, 0.852, 0.783],
    [0.852, 1.0, 0.667],
    [0.783, 0.667, 1.0]
  ]
}

Interpreting the Results

Large dominant cluster + small outlier clusters is the most common pattern. It means most documents share a structural template, and the outliers represent schema variants, incomplete records, or data from a different source.

Many clusters of similar size suggests multiple payload families — perhaps different message types, different API versions, or different upstream sources mixed in a single directory.

High similarity between clusters (> 0.8) means the clusters differ by only a few fields. This often indicates optional fields that are sometimes present and sometimes absent.

Low similarity between clusters (< 0.5) means fundamentally different structural families. These probably should not be processed by the same pipeline.

When to Use It

Batch triage. Before analyzing 10,000 claims, cluster them to understand how many structural families you are dealing with.
Schema variant discovery. A vendor says they send one format. Clustering reveals three.
Outlier isolation. The smallest cluster often contains the documents with missing fields or unusual structure — the ones that need manual review.
Pipeline routing. Different structural families may need different processing logic. Clustering reveals the routing keys.

Pairs Well With

fingerprint — clustering uses MinHash signatures from the fingerprinting layer
drift — compare cluster representatives to understand how the families differ
anomalies — documents in small outlier clusters are strong anomaly candidates
batch — batch analysis with clustering to segment results by structural family

invariants

invariants discovers cross-field relationships from observed data. It finds fields that predict other fields, fields that always co-occur, and fields that are functionally dependent — all without prior knowledge of the schema.

This is data archaeology. Vajra examines the statistical co-occurrence of fields and extracts the latent rules that the data obeys.

Usage

vajra invariants <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, NDJSON batch, `-` for stdin, or directory

Flags:

Flag	Description	Default
`--top-k <N>`	Maximum number of field pairs to consider	50
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

The Mathematics

Conditional Entropy

For field pairs (X, Y):

H(Y|X) = -sum p(x,y) * log2(p(y|x))

Low H(Y|X) means X strongly predicts Y. If H(Y|X) approaches 0, Y is functionally determined by X — knowing X tells you Y with near-certainty.

Pointwise Mutual Information (PMI)

PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))

Positive PMI means x and y co-occur more than chance predicts. Negative PMI means they avoid each other. Zero means independence.

PMI is the information-theoretic standard for measuring association strength.

Discovery Procedure

Screen: consider only paths with observation count > 30 (configurable). This filters noise.
Compute: for all pairs among the top-k most frequent paths, calculate conditional entropy and PMI.
Rank: ascending H(Y|X) for dependency strength, descending |PMI| for association strength.
Report: the strongest relationships with examples from the data.

With k = 50, this is 2,500 pairs — trivial even on large datasets. Unlike general association rule mining (which explores an exponential itemset space), this approach is bounded by design.

Example: Text Output

vajra invariants claims_batch.ndjson

=== Cross-Field Invariants ===
Records analyzed: 1,247
Field pairs screened: 1,225 (top 50 paths)

--- Functional Dependencies (H(Y|X) < 0.1) ---
  $.claims[*].subscriber.id -> $.claims[*].subscriber.name
    H(name|id) = 0.00
    subscriber.id fully determines subscriber.name
    Example: id "SUB-4421" -> name "Martinez, Elena" (47 records)

  $.claims[*].provider.npi -> $.claims[*].provider.name
    H(name|npi) = 0.03
    provider.npi nearly determines provider.name (3 exceptions in 1,247)
    Example: npi "1234567890" -> name "Valley Medical Group" (312 records)

--- Strong Co-occurrence (PMI > 2.0) ---
  $.claims[*].status = "denied" <-> $.claims[*].denial_reason present
    PMI = 3.8
    When status is "denied", denial_reason is present 97% of the time.
    When status is not "denied", denial_reason is present 2% of the time.

  $.claims[*].service_lines[*].procedure_code <-> $.claims[*].service_lines[*].service_date
    PMI = 3.2
    These fields co-occur in 99.8% of service lines. Effectively always together.

--- Conditional Presence ---
  $.claims[*].service_lines[*].modifier_codes
    Present in 100% of records where procedure_code starts with "9921"
    Present in 12% of records where procedure_code starts with "9939"
    Modifier presence is conditionally dependent on procedure type.

--- Anti-Correlation (PMI < -1.0) ---
  $.claims[*].status = "adjudicated" <-> $.claims[*].hold_reason present
    PMI = -2.1
    These rarely co-occur. Adjudicated claims almost never have hold reasons.

Example: JSON Output

vajra invariants claims_batch.ndjson --format json

{
  "records_analyzed": 1247,
  "pairs_screened": 1225,
  "functional_dependencies": [
    {
      "source": "$.claims[*].subscriber.id",
      "target": "$.claims[*].subscriber.name",
      "conditional_entropy": 0.0,
      "strength": "exact",
      "example": {
        "source_value": "SUB-4421",
        "target_value": "Martinez, Elena",
        "count": 47
      }
    },
    {
      "source": "$.claims[*].provider.npi",
      "target": "$.claims[*].provider.name",
      "conditional_entropy": 0.03,
      "strength": "near_exact",
      "exceptions": 3,
      "example": {
        "source_value": "1234567890",
        "target_value": "Valley Medical Group",
        "count": 312
      }
    }
  ],
  "co_occurrences": [
    {
      "field_a": "$.claims[*].status",
      "value_a": "denied",
      "field_b": "$.claims[*].denial_reason",
      "pmi": 3.8,
      "conditional_presence": 0.97
    }
  ],
  "anti_correlations": [
    {
      "field_a": "$.claims[*].status",
      "value_a": "adjudicated",
      "field_b": "$.claims[*].hold_reason",
      "pmi": -2.1
    }
  ]
}

What Invariants Reveal

Functional dependencies are the strongest signal. When subscriber.id fully determines subscriber.name, that is not an accident — it reflects a real-world constraint. If that constraint breaks (a subscriber ID mapping to two different names), you have a data quality issue.

Co-occurrence patterns reveal implicit business rules. “When status is denied, denial_reason is present” is a rule that lives in the data, not in a schema. Vajra discovers it empirically.

Anti-correlations reveal mutual exclusions. Fields that never co-occur often represent different branches of a state machine — knowing which branch you are on determines which fields exist.

Conditional presence reveals fields whose existence depends on the value of another field. This is where JSON schemas fall short — they cannot express “this field exists only when that field equals X.”

When to Use It

Schema documentation. Discover the implicit rules that the data already obeys. Document them before they are lost.
Data quality rules. Turn discovered invariants into validation rules. If subscriber.id always determines subscriber.name, alert when it does not.
Onboarding. New to a dataset? invariants shows you the relationships between fields faster than reading documentation (which may not exist).
Audit evidence. Demonstrate that field dependencies are consistent across a batch.

Pairs Well With

stats — invariants build on per-field statistics (entropy, frequency, null rates)
anomalies — broken invariants (a dependency that holds 99% of the time but not in record 662) are anomalies
essence — discovered relationships appear in the essence as notable observations
drift — if an invariant holds in the baseline but breaks in the candidate, that is a significant drift signal

query

query runs path-based expressions with analysis functions against a document. It lets you ask specific questions — what is the entropy at this path? Which values at this path are anomalous? What is the null rate for this field?

Where other commands analyze everything and present results, query lets you target a specific path and a specific measurement.

Usage

vajra query <input> '<expression>' [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON file, `-` for stdin, or an HTTP URL
`<expression>`	Query expression (path filter or analysis function)

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

Expression Language

Vajra defines its own expression language inspired by JSONPath with analysis extensions. This is not JSONAta — it is a purpose-built query system for structural analysis.

Path Filtering

Select values at a specific path:

vajra query claim.json '$.claims[*].service_lines[*].charge_amount'

Path: $.claims[*].service_lines[*].charge_amount
Values (14):
  125.00, 285.00, 45.00, 890.00, 310.00, 425.00, 285.00,
  1250.00, 175.00, 520.00, 95.00, 680.00, 340.00, 410.00

Analysis Functions

Apply analysis functions to a path:

vajra query claim.json 'entropy($.claims[*].service_lines[*].status)'

entropy($.claims[*].service_lines[*].status)
  Shannon entropy: 1.22 bits
  Normalized entropy: 0.77
  Cardinality: 3
  Interpretation: enum-like, few distinct states

Available Functions

Function	Returns	Description
`entropy(path)`	Shannon entropy and normalized entropy	Information content at this path
`rarity(path, value)`	Self-information in bits	How rare a specific value is at this path
`instability(path)`	Type instability ratio	Fraction of values deviating from dominant type
`null_rate(path)`	Null and absent rates	Missingness profile at this path
`stats(path)`	Full statistical summary	Entropy, frequency, numeric distribution
`anomaly_score(path)`	Composite anomaly score	Maximum anomaly strength across dimensions
`motif(path)`	Dominant motif description	Repeated structural pattern at an array path

Conditional Expressions

Filter by analysis thresholds:

vajra query claim.json 'entropy($.claims[*].service_lines[*].status) > 0.5'

entropy($.claims[*].service_lines[*].status) = 1.22
  Condition: > 0.5
  Result: TRUE

vajra query claim.json 'anomaly_score($.claims[*].service_lines[*].charge_amount) > 3.5'

anomaly_score($.claims[*].service_lines[*].charge_amount)
  Max z_MAD across values: 6.3 (at value 47,250.00)
  Condition: > 3.5
  Result: TRUE
  Flagged values:
    47,250.00 (z_MAD = 6.3)

Example: Text Output

vajra query claim.json 'stats($.claims[*].service_lines[*].charge_amount)'

stats($.claims[*].service_lines[*].charge_amount)
  Count:       14
  Cardinality: 12
  Entropy:     3.41 bits (normalized: 0.88)
  Type:        number (100%)
  Min:         45.00
  Max:         1250.00
  Mean:        312.50
  Median:      285.00
  MAD:         195.00
  p25:         125.00
  p75:         425.00
  p95:         890.00
  p99:         1125.00

Example: JSON Output

vajra query claim.json 'entropy($.claims[*].status)' --format json

{
  "function": "entropy",
  "path": "$.claims[*].status",
  "result": {
    "shannon_entropy": 1.22,
    "normalized_entropy": 0.77,
    "cardinality": 3,
    "support": ["adjudicated", "pending", "denied"]
  }
}

Example: Rarity Check

vajra query claims_batch.ndjson 'rarity($.claims[*].status, "voided")'

rarity($.claims[*].status, "voided")
  Self-information: 10.3 bits
  Frequency: 1 of 1,247
  Interpretation: extremely rare (> 10 bits)

Example: Null Rate Investigation

vajra query claim.json 'null_rate($.claims[*].service_lines[*].allowed_amount)'

null_rate($.claims[*].service_lines[*].allowed_amount)
  Null rate:   0.000 (0 of 14 are JSON null)
  Absent rate: 0.214 (3 of 14 parent records lack this field)
  Empty rate:  0.000
  Total missingness: 0.214

When to Use It

Targeted investigation. You saw an anomaly in the essence. Now drill into the specific path.
Threshold checks in CI. vajra query data.json 'instability($.status) > 0.01' — fail the build if type instability exceeds tolerance.
Statistical spot-checks. What is the entropy of this field? What is the null rate? How rare is this value?
Script integration. The --format json output is machine-readable. Parse it in your pipeline.

Pairs Well With

stats — query targets a single path; stats gives you everything
anomalies — query lets you drill into a specific anomaly with anomaly_score(path)
essence — the AI profile’s drill section tells downstream models which paths to query
inspect — inspect reveals the paths; query interrogates them

batch

batch runs parallel analysis across all JSON files in a directory. It produces aggregated statistics, per-file summaries, and batch-level observations — processing hundreds or thousands of files in seconds via Rayon-based parallelism.

Where single-document commands analyze one file, batch analyzes the population.

Usage

vajra batch <directory> [flags]

Arguments:

Argument	Description
`<directory>`	Path to a directory containing JSON files

Flags:

Flag	Description	Default
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--profile <name>`	Concern profile for essence generation	`engineer`
`--input-format <fmt>`	Override auto-detected input format	auto
`--streaming`	Force streaming mode for each file	off
`--redact`	Apply built-in redaction before output	off
`--quiet`	Suppress progress output	off

What It Does

Discovers files. Scans the directory for all supported files (.json, .yaml, .csv, .ndjson, etc.).
Parallel analysis. Each file is analyzed independently using Rayon’s work-stealing thread pool. On an 8-core machine, 8 files are analyzed simultaneously.
Per-file statistics. For each file: node count, path count, depth, fingerprint, anomaly count.
Aggregated statistics. Across the entire batch: merged frequency distributions, merged DDSketch quantiles, population-level entropy, cross-file type stability.
Batch-level observations. Structural families (via clustering), population anomalies, files that deviate from the batch norm.

Example: Text Output

vajra batch ./claims/

=== Batch Analysis ===
Directory: ./claims/
Files processed: 247
Total nodes: 208,729
Processing time: 1.4s (148,378 nodes/s)

=== Per-File Summary ===
  FILE                  NODES  PATHS  DEPTH  ANOMALIES  FINGERPRINT
  claim_001.json          847     23      6          0  a1b2c3d4...
  claim_002.json          891     23      6          0  a1b2c3d4...
  claim_003.json          723     23      6          1  a1b2c3d4...
  claim_048.json         1102     27      7          0  e5f6a7b8...
  claim_199.json          412     18      5          3  c9d0e1f2...
  ... (242 more files)

=== Structural Families ===
  Family 1: 198 files (80.2%) — 23 paths, signature a1b2c3d4...
  Family 2:  41 files (16.6%) — 27 paths, signature e5f6a7b8...
  Family 3:   8 files ( 3.2%) — 18 paths, signature c9d0e1f2...

=== Aggregated Statistics ===
  $.claims[*].service_lines[*].charge_amount
    Population median: 285.00
    Population MAD: 195.00
    Population p95: 1,420.00
    Cross-file consistency: high (coefficient of variation = 0.12)

  $.claims[*].service_lines[*].status
    Population entropy: 1.45 bits
    Dominant value: "adjudicated" (72.3%)
    Cardinality: 5 values across batch

=== Batch-Level Anomalies ===
  claim_199.json: structural outlier (Jaccard distance 0.31 from dominant family)
  claim_201.json: structural outlier (Jaccard distance 0.28 from dominant family)
  claim_834.json: contains numeric outlier (charge_amount = 47,250.00, z_MAD = 6.3)

Example: JSON Output

vajra batch ./claims/ --format json

{
  "directory": "./claims/",
  "files_processed": 247,
  "total_nodes": 208729,
  "processing_time_ms": 1400,
  "per_file": [
    {
      "file": "claim_001.json",
      "nodes": 847,
      "paths": 23,
      "depth": 6,
      "anomaly_count": 0,
      "fingerprint": "a1b2c3d4..."
    }
  ],
  "structural_families": [
    {
      "id": 0,
      "count": 198,
      "percentage": 80.2,
      "distinct_paths": 23,
      "signature": "a1b2c3d4..."
    },
    {
      "id": 1,
      "count": 41,
      "percentage": 16.6,
      "distinct_paths": 27,
      "signature": "e5f6a7b8..."
    },
    {
      "id": 2,
      "count": 8,
      "percentage": 3.2,
      "distinct_paths": 18,
      "signature": "c9d0e1f2..."
    }
  ],
  "aggregated_stats": {
    "$.claims[*].service_lines[*].charge_amount": {
      "population_median": 285.0,
      "population_mad": 195.0,
      "population_p95": 1420.0
    }
  },
  "batch_anomalies": [
    {
      "file": "claim_199.json",
      "type": "structural_outlier",
      "jaccard_distance": 0.31
    },
    {
      "file": "claim_834.json",
      "type": "numeric_outlier",
      "path": "$.claims[*].service_lines[*].charge_amount",
      "value": 47250.0,
      "z_mad": 6.3
    }
  ]
}

Parallelism and Performance

Batch uses Rayon’s work-stealing thread pool. The number of threads defaults to the number of CPU cores.

Performance targets:

Batch Size	Target
100 files, ~1 MB each	< 5 seconds
1,000 files, ~1 MB each	< 30 seconds
10,000 files, ~100 KB each	< 30 seconds

DDSketch instances are computed per-file and merged globally with no accuracy loss — this is the key property that makes parallel batch processing exact rather than approximate.

When to Use It

Daily batch monitoring. Run batch on each day’s incoming data. Track structural families, anomaly counts, and distribution shifts over time.
Pre-processing audit. Before feeding a batch to a downstream system, run batch to verify structural consistency and flag outliers.
Population baselines. Establish population-level statistics (median charge amount, expected null rates, typical structural signature) that individual-file analysis can compare against.
Quick directory survey. “What is in this folder?” — batch answers in seconds.

Pairs Well With

cluster — batch includes lightweight clustering; cluster provides detailed similarity analysis
anomalies — batch flags files with anomalies; drill into specific files for details
drift — compare today’s batch aggregates to yesterday’s for population-level drift
essence — run essence on specific files that batch identified as notable

cascade

cascade detects temporal cause-effect chains in event data. Given a stream of timestamped events grouped by entity, it identifies sequences where one event type triggers another — and measures how reliably that pattern holds.

Where anomalies finds single-record outliers, cascade finds multi-record temporal patterns: event A happens to entity X, then event B follows within a window.

Usage

vajra cascade <input> [flags]

Arguments:

Argument	Description
`<input>`	Path to a JSON/NDJSON file, `-` for stdin, or an HTTP URL

Flags:

Flag	Description	Default
`--entity-field <path>`	JSONPath to the entity identifier (e.g., `'$.author'`)	required
`--time-field <path>`	JSONPath to the timestamp field (e.g., `'$.date'`)	required
`--event-field <path>`	JSONPath to the event type field (e.g., `'$.type'`)	required
`--response-values <vals>`	Comma-separated list of event values that count as responses (e.g., `fix,revert`)	required
`--format <fmt>`	Output format: `text`, `json`, `markdown`, `compact-ai`	`text`
`--input-format <fmt>`	Override auto-detected input format	auto
`--quiet`	Suppress progress output	off

What It Reports

Cascade Rate

The fraction of trigger events that are followed by a response event from the same entity within the detection window. A high cascade rate means the cause-effect pattern is reliable.

Self-Fix Rate

The fraction of cascades where the same entity that caused the trigger also produced the response. Measures whether entities clean up their own problems.

Hot Entities

Entities that appear disproportionately in cascade chains. These are the nexus points — the authors, services, or components that most frequently participate in cause-and-effect sequences.

Cascade Chains

The full chain detail: trigger event, response event, entity, timestamps, and time delta between cause and effect.

Algorithm

O(n log n). Records are grouped by entity using a BTreeMap (ordered map), sorted by timestamp within each group, then scanned linearly to detect trigger-response pairs. The BTreeMap ensures deterministic iteration order regardless of input ordering.

Example: Commit Cascade Analysis

vajra cascade commits.ndjson \
  --entity-field '$.author' \
  --time-field '$.date' \
  --event-field '$.type' \
  --response-values 'fix,revert'

=== Cascade Report ===
Records: 1,247
Entities: 34
Trigger events: 312
Response events: 89

Cascade rate:  0.285 (89 of 312 triggers followed by a response)
Self-fix rate: 0.742 (66 of 89 responses by the same entity)

Hot entities:
  alice       23 cascades (25.8%)
  bob         14 cascades (15.7%)
  charlie      9 cascades (10.1%)

Cascade chains (top 5 by frequency):
  bug -> fix        62 occurrences, median delta: 2.3 days
  bug -> revert     18 occurrences, median delta: 0.4 days
  regression -> fix  9 occurrences, median delta: 4.1 days

Example: JSON Output

vajra cascade commits.ndjson \
  --entity-field '$.author' \
  --time-field '$.date' \
  --event-field '$.type' \
  --response-values 'fix,revert' \
  --format json

{
  "records": 1247,
  "entities": 34,
  "trigger_events": 312,
  "response_events": 89,
  "cascade_rate": 0.285,
  "self_fix_rate": 0.742,
  "hot_entities": [
    {"entity": "alice", "cascades": 23, "fraction": 0.258},
    {"entity": "bob", "cascades": 14, "fraction": 0.157},
    {"entity": "charlie", "cascades": 9, "fraction": 0.101}
  ],
  "chains": [
    {"trigger": "bug", "response": "fix", "count": 62, "median_delta_days": 2.3},
    {"trigger": "bug", "response": "revert", "count": 18, "median_delta_days": 0.4},
    {"trigger": "regression", "response": "fix", "count": 9, "median_delta_days": 4.1}
  ]
}

When to Use It

Incident response analysis. Which errors lead to fixes, and how quickly? Which lead to reverts?
Developer workflow. Who introduces bugs and who fixes them? Is there a self-fix pattern?
Service dependency. Event A in service X triggers event B in service Y — cascade reveals the coupling.
Repository health. Measure how reliably bugs get resolved and how long the resolution takes.

Pairs Well With

stats — statistical profile of the event fields before cascade analysis
anomalies — unusual cascade chains (an entity that never self-fixes) are anomaly candidates
invariants — cascade patterns are temporal invariants; invariants discovers structural ones
essence — cascade metrics feed into essence generation for project health assessments

Profiles

Profiles are the lens. They do not change what Vajra analyzes — they change how results are scored, ranked, and rendered.

The same document analyzed with --profile staff and --profile engineer produces the same underlying statistics. The difference is which observations surface, what language describes them, and what gets collapsed as noise.

The Scoring Model

Every observation in the analysis pipeline receives a composite importance score:

score = sum(weight_i * signal_i)

Six signal dimensions, each normalized to [0, 1]:

Dimension	What It Measures
`rarity`	Self-information of the observation. Rare things score high.
`instability`	Type instability at the path. Mixed types score high.
`entropy_signal`	Distance from 0.5 normalized entropy. Constants and noise both score high. Meaningful variation scores low.
`structural_coverage`	Fraction of total nodes under this path. Wide-reaching paths score high.
`anomaly_strength`	Maximum anomaly score across all four dimensions.
`concern_relevance`	Profile-specific boost for certain paths or observation types.

The profile defines the weights. The weights determine what rises to the top.

Built-in Profiles

staff

For: Non-technical operations staff who need “what is this and what stands out.”

Dimension	Weight
rarity	0.10
instability	0.05
entropy_signal	0.10
structural_coverage	0.25
anomaly_strength	0.30
concern_relevance	0.20

Rendering: Plain language. No JSONPath. No technical jargon. Anomalies described in terms of business impact. Structural boilerplate hidden.

Section headers: “Document Summary,” “What Stands Out,” “What This Likely Means.”

vajra essence claim.json --profile staff

Document Summary:
  1 claim with 14 service lines, 1 patient, 2 diagnosis codes.

What Stands Out:
  - 3 service lines are missing allowed amounts.
  - Adjustment reason "CO-45" repeats across 8 of 14 lines.

What This Likely Means:
  - A subset of service lines appears incomplete.
  - The repeated adjustment code suggests a systematic issue.

engineer

For: Engineers who need schema details, structural analysis, and regression signals.

Dimension	Weight
rarity	0.15
instability	0.25
entropy_signal	0.15
structural_coverage	0.15
anomaly_strength	0.15
concern_relevance	0.15

Rendering: Technical. JSONPath paths, type annotations, cardinalities. Diff-style output for drift. Fingerprints displayed.

vajra essence claim.json --profile engineer

Structure: 847 nodes, 23 distinct paths, max depth 6
Fingerprint (path set): a1b2c3d4...

Notable paths:
  $.claims[*].service_lines[*].allowed_amount
    null_rate: 0.214, entropy: 3.12, type: number (100%)

  $.claims[*].service_lines[*].adjustment.reason
    entropy: 1.56, cardinality: 4, dominant: "CO-45" (57.1%)

auditor

For: Auditors and compliance teams who need completeness, traceability, and consistency evidence.

Dimension	Weight
rarity	0.10
instability	0.20
entropy_signal	0.10
structural_coverage	0.10
anomaly_strength	0.20
concern_relevance	0.30

Rendering: Formal vocabulary. Missing fields listed with full paths. Type inconsistencies documented with examples. Drift metrics with severity scores.

Concern relevance boosts: completeness, traceability, required-field absence.

vajra essence claim.json --profile auditor --format markdown

## Audit Essence

### Completeness Assessment
- **21.4%** of service lines are missing `allowed_amount`
  (3 of 14 service line records; field path: `$.claims[*].service_lines[*].allowed_amount`)
- Provider `taxonomy` field: absent
  (expected presence rate in comparable data: 94%)

### Type Consistency
- All paths exhibit 100% type stability. No type inconsistencies detected.

### Pattern Observations
- Adjustment reason code `CO-45` appears in 57.1% of service lines (8 of 14).
  This concentration exceeds typical variance for this field.

ai

For: Downstream LLM consumption. Maximum information density per token.

Dimension	Weight
rarity	0.15
instability	0.10
entropy_signal	0.20
structural_coverage	0.20
anomaly_strength	0.20
concern_relevance	0.15

Rendering: Compact, machine-readable. Motifs collapsed aggressively. Repeated structures represented once with count. Explicit caveats on inferences.

vajra essence claim.json --profile ai --format compact-ai --budget 300

{"v":"vajra/1","n":847,"p":23,"d":6,"motif":{"p":"$.claims[0].service_lines[*]","c":14,"f":["procedure_code","charge_amount","allowed_amount","status","adjustment"]},"a":[{"p":"$.claims[0].service_lines[2,7,11].allowed_amount","t":"miss","s":4.2}],"drill":[{"p":"$.claims[*].service_lines","avail":["stats","anomalies"]}]}

fraud

For: Fraud and risk analysts who need suspicious patterns, outliers, and unusual combinations.

Dimension	Weight
rarity	0.25
instability	0.10
entropy_signal	0.10
structural_coverage	0.05
anomaly_strength	0.35
concern_relevance	0.15

Rendering: Investigative framing. Outliers with full context. Benford’s Law departures. Suspicious value repetition. Unusual co-occurrence patterns.

Concern relevance boosts: numeric anomalies, identifier patterns, timing irregularities.

vajra essence claims_batch.ndjson --profile fraud

=== Fraud Screening Essence ===

Flagged Patterns:
  - charge_amount outlier: $47,250.00 in record 834
    (z_MAD = 6.3, population median = $285.00)
    This value is 165x the median. Review recommended.

  - Status value "voided" in record 419
    (seen once in 1,247 records, self-information = 10.3 bits)
    Extremely rare status. May warrant investigation.

  - Benford's Law departure for charge_amount leading digits
    Chi-squared: 14.2 (p = 0.028)
    Observed leading digit "1": 18% (expected: 30%)
    Observed leading digit "5": 22% (expected: 8%)
    Suggestive of non-natural distribution.

  - Identical charge_amount ($285.00) in 47 records from same provider
    Exact-value concentration: 3.8% of population
    Pattern is unusual for this field's typical variance.

health

For: Project and repository health assessment. Identifies risks, governance patterns, and sustainability signals.

Dimension	Weight
entropy_signal	0.25
concern_relevance	0.25
anomaly_strength	0.20
rarity	0.15
instability	0.10
structural_coverage	0.05

Rendering: Assessment-oriented. Sections organized around risk, governance, and sustainability. Designed for repository and project analysis.

Section headers: “Key Risks,” “Governance Signals,” “Sustainability Assessment.”

vajra essence ./my-repo --profile health

Key Risks:
  - Bus factor: 2 contributors account for 78% of commits.
  - Fix rate declining: 31% of bugs fixed in March vs 18% in January.
  - Mean time to fix increasing: 2.3 days -> 4.1 days over 3 months.

Governance Signals:
  - Review coverage: 64% of PRs received at least one review.
  - Bot contribution: 33% of PRs from automated tools.
  - Consistent commit cadence: 4.2 commits/day (low variance).

Sustainability Assessment:
  - Moderate risk. High contributor concentration and declining fix rates
    suggest capacity constraints. Review coverage is below recommended
    thresholds for projects of this activity level.

Custom Profiles

Define custom profiles in TOML. Load with --config path/to/profiles.toml.

Full TOML Example

[profile.claims_review]
name = "claims-review"
description = "Internal review for claims processing teams"

[profile.claims_review.weights]
rarity = 0.15
instability = 0.20
entropy_signal = 0.10
structural_coverage = 0.10
anomaly_strength = 0.25
concern_relevance = 0.20

[profile.claims_review.rendering]
vocabulary = "plain"           # plain | technical | formal
show_paths = false             # hide JSONPath in output
show_scores = false            # hide numeric scores
motif_collapse_threshold = 3   # collapse motifs repeated > N times
anomaly_threshold = 3.5        # MAD z-score threshold for flagging

[profile.claims_review.concern_boosts]
paths_containing = ["denied", "adjustment", "override", "void"]
observation_types = ["missingness", "type_instability"]
boost_factor = 1.5

Loading Custom Profiles

vajra essence claim.json --profile claims-review --config ./profiles.toml

Multiple Custom Profiles in One File

[profile.claims_review]
name = "claims-review"
description = "Internal claims processing review"
# ... weights, rendering, boosts ...

[profile.vendor_audit]
name = "vendor-audit"
description = "Vendor data feed quality assessment"
# ... weights, rendering, boosts ...

[profile.ml_preprocessing]
name = "ml-preprocessing"
description = "Data quality check before ML pipeline ingestion"
# ... weights, rendering, boosts ...

Listing Available Profiles

vajra profiles

=== Built-in Profiles ===
  staff        Plain vocabulary, narrative rendering; emphasizes anomalies and structural coverage
  engineer     Technical vocabulary, list-based rendering; balanced scoring
  auditor      Formal vocabulary, completeness-focused; emphasizes instability and concern relevance
  ai           Compact terse rendering optimized for machine consumption
  fraud        Investigative framing; emphasizes outliers, rarity, and suspicious patterns
  health       Assessment-oriented; emphasizes risks, governance, and sustainability

=== Custom Profiles ===
  claims-review   Internal claims processing review

vajra profiles --config ./profiles.toml --format json

[
  {"name": "staff", "description": "...", "source": "built-in"},
  {"name": "engineer", "description": "...", "source": "built-in"},
  {"name": "auditor", "description": "...", "source": "built-in"},
  {"name": "ai", "description": "...", "source": "built-in"},
  {"name": "fraud", "description": "...", "source": "built-in"},
  {"name": "claims-review", "description": "Internal claims processing review", "source": "custom"}
]

Rendering Vocabulary

Level	Description	Example
`plain`	No jargon, no paths, business-oriented language	“3 service lines are missing allowed amounts”
`technical`	JSONPath, type annotations, statistical measures	“$.claims[*].service_lines[2,7,11].allowed_amount: null_rate=0.21, anomaly_score=4.2”
`formal`	Full sentences, compliance-appropriate language	“Observations 2, 7, and 11 in the service line array exhibit absent allowed_amount fields.”

Deterministic Tie-Breaking

When two observations have identical composite scores, ties are broken by:

Path depth — shallower paths first (broader impact)
Lexicographic path order — alphabetical by wildcard path

This ensures identical scores always resolve in the same order, regardless of platform or run.

Input Formats

Vajra reads more than JSON. It reads anything that can be interpreted as structured data — and it auto-detects the format so you do not have to tell it.

Supported Formats

Format	Extensions	Detection	Notes
JSON	`.json`	Content starts with `{` or `[`	Primary format. Full DOM and streaming support.
NDJSON	`.ndjson`, `.jsonl`	Multiple JSON objects separated by newlines	Each line is a separate document. Batch analysis native.
YAML	`.yaml`, `.yml`	Content starts with `---` or key-colon pattern	Multi-document YAML supported (separated by `---`).
CSV	`.csv`	Comma-separated with consistent column count	First row treated as headers. Each row becomes a JSON object.
TSV	`.tsv`	Tab-separated with consistent column count	Same as CSV but tab-delimited.
Markdown	`.md`	Markdown structure with tables or code blocks	Tables extracted as arrays of objects. Code blocks parsed if JSON/YAML.
PDF	`.pdf`	PDF magic bytes	Text extracted and parsed for structured content.
Gzip	`.gz`, `.json.gz`	Gzip magic bytes (`1f 8b`)	Decompressed transparently. Inner format auto-detected.
Zstd	`.zst`, `.json.zst`	Zstd magic bytes	Decompressed transparently. Inner format auto-detected.
HTTP URL	`http://`, `https://`	URL scheme prefix	Fetched via blocking HTTP GET. Response body auto-detected.
Source Code	`.rs`, `.py`, `.js`, `.ts`, `.go`, `.java`, `.c`, `.cpp`, `.rb`	File extension matches known language	Parsed via tree-sitter into AST. Requires `vajra-source` feature.
Git Repository	(directory)	Directory contains `.git/`	Reads commit history directly. See flags below.
V8 CPU Profile	`.cpuprofile`	File extension	Parses V8 `.cpuprofile` JSON into analyzable structure.
strace Summary	—	Content contains `% time` header	Parses `strace -c` summary output into structured records.
Stdin	`-`	Explicit `-` argument	Content auto-detected from first bytes.

Auto-Detection Logic

When no --input-format is specified, Vajra detects the format in this order:

Check the argument. If it is -, read from stdin. If it starts with http:// or https://, fetch via HTTP. If it is a directory containing .git/, treat as a git repository.
Check the extension. .json -> JSON. .ndjson/.jsonl -> NDJSON. .yaml/.yml -> YAML. .csv -> CSV. .tsv -> TSV. .md -> Markdown. .pdf -> PDF. .cpuprofile -> V8 CPU Profile. .rs/.py/.js/.go/etc. -> Source Code (via tree-sitter).
Check for compression. If the extension is .gz or .zst, decompress and re-detect the inner format from the next extension (e.g., .json.gz -> decompress -> JSON).
Check content. If the extension is ambiguous or missing, read the first bytes:
- Starts with { or [ after whitespace -> JSON
- Multiple {...}\n sequences -> NDJSON
- Starts with --- or matches key: value pattern -> YAML
- Consistent comma-separated columns -> CSV
- PDF magic bytes (%PDF) -> PDF
- Contains % time column header -> strace summary
Fall back to JSON. If nothing else matches, attempt JSON parsing.

Format Override

Force a specific format with --input-format:

vajra inspect data.txt --input-format json
vajra stats records.log --input-format ndjson
vajra inspect data.bin --input-format yaml

This overrides all auto-detection. Useful when files have nonstandard extensions.

Format Details

JSON

The primary format. Parsed by simd-json in DOM mode (full random access, rich analysis) or streaming mode (bounded memory, SAX-style events).

vajra inspect claim.json

echo '{"patient": "Martinez", "status": "active"}' | vajra inspect -

NDJSON (Newline-Delimited JSON)

Each line is an independent JSON document. Natural format for logs, event streams, and batch data.

vajra anomalies claims.ndjson

NDJSON records are aggregated into a single array for analysis. Commands like stats, anomalies, invariants, and essence compute across all records as a unified population.

Example input:

{"claim_id": "C001", "status": "adjudicated", "amount": 285.00}
{"claim_id": "C002", "status": "denied", "amount": 0.00}
{"claim_id": "C003", "status": "adjudicated", "amount": 47250.00}

YAML

Single-document and multi-document YAML both supported. Parsed via serde_yaml and converted to Vajra’s internal document model.

vajra inspect config.yaml

Multi-document YAML (separated by ---):

---
claim_id: C001
status: adjudicated
amount: 285.00
---
claim_id: C002
status: denied
amount: 0.00

vajra anomalies multi_claims.yaml

CSV

The first row is treated as column headers. Each subsequent row becomes a JSON object with header names as keys.

vajra stats claims.csv

Example input:

claim_id,status,charge_amount,allowed_amount
C001,adjudicated,285.00,210.00
C002,denied,125.00,
C003,adjudicated,890.00,675.00

Vajra converts this to:

[
  {"claim_id": "C001", "status": "adjudicated", "charge_amount": "285.00", "allowed_amount": "210.00"},
  {"claim_id": "C002", "status": "denied", "charge_amount": "125.00", "allowed_amount": ""},
  {"claim_id": "C003", "status": "adjudicated", "charge_amount": "890.00", "allowed_amount": "675.00"}
]

Empty cells are preserved as empty strings, allowing missingness analysis to detect them.

TSV

Identical to CSV but tab-delimited. Same header-to-object conversion.

vajra stats data.tsv
vajra stats data.txt --input-format tsv

Markdown

Vajra extracts structured content from Markdown files:

Tables are parsed into arrays of objects (headers become keys)
JSON/YAML code blocks are parsed as embedded documents

vajra inspect report.md

PDF

Text is extracted from PDF files and parsed for any structured content (embedded tables, JSON fragments, structured text patterns).

vajra inspect document.pdf

PDF support depends on the pdf-extract crate. Complex layouts may lose structure during extraction.

Source Code

Vajra can analyze source code from any language supported by tree-sitter. The source file is parsed into a concrete syntax tree (CST), converted to a JSON structure, and analyzed through the full Vajra pipeline — entropy, anomalies, fingerprinting, drift, motifs, and essence all work on code.

vajra inspect main.rs                       # auto-detect Rust
vajra stats app.py                          # auto-detect Python
vajra drift v1/server.go v2/server.go       # code structural drift
vajra essence lib.rs --profile engineer     # code essence
vajra inspect main.rs --lang rust           # explicit language
vajra inspect code.txt --input-format source --lang python  # override format + language

Supported languages (each enabled by a feature flag, all on by default):

Language	Extensions	Feature Flag
Rust	`.rs`	`rust`
Python	`.py`, `.pyi`	`python`
JavaScript	`.js`, `.mjs`, `.cjs`, `.jsx`	`javascript`
TypeScript	`.ts`, `.tsx`, `.mts`, `.cts`	`typescript`
Go	`.go`	`go`
Java	`.java`	`java`
C	`.c`, `.h`	`c`
C++	`.cpp`, `.cc`, `.cxx`, `.hpp`	`cpp`
Ruby	`.rb`	`ruby`

What Vajra reveals on code:

Analysis	What It Finds
Entropy of AST node types	Structural diversity — boilerplate vs complex code
Rarity of node types	Unusual constructs — `goto`, `unsafe`, `eval`
Nesting depth anomalies	Complexity hotspots
Fingerprint comparison	Structural clones across files
Drift between versions	Added functions, removed classes, changed signatures
Motifs	Repeated structural patterns — copy-paste code

Source code analysis requires the vajra-source crate (included by default). The companion vajra-domain-source plugin adds recognizers for naming conventions (snake_case, camelCase, PascalCase) and code structure relationships.

Semantic Paths

The --semantic-paths flag maps tree-sitter node kinds to human-readable labels in the output. Instead of raw AST node names like function_item or impl_item, you see function and implementation.

vajra inspect main.rs --semantic-paths

Without --semantic-paths:

$.program.function_item[0].identifier         "process_record"
$.program.function_item[0].parameters.parameter[0]   "record: &Record"
$.program.impl_item[0].identifier             "Pipeline"

With --semantic-paths:

$.program.function[0].name                    "process_record"
$.program.function[0].parameters.param[0]     "record: &Record"
$.program.implementation[0].name              "Pipeline"

Covers 9 languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, and Ruby.

Git Repository

When the input is a directory containing a .git/ subdirectory, Vajra reads the commit history directly — no export step required.

vajra stats ./my-repo
vajra cascade ./my-repo --entity-field '$.author' --time-field '$.date' --event-field '$.type' --response-values 'fix,revert'

Each commit becomes a JSON record with fields like author, date, message, files_changed, and insertions/deletions.

Flags:

Flag	Description	Default
`--git-limit <N>`	Maximum number of commits to read	500
`--git-branch <branch>`	Branch to read from	current HEAD

vajra stats ./my-repo --git-limit 1000 --git-branch main

Auto-detection is based on the presence of .git/ in the input directory. To override, use --input-format git.

V8 CPU Profile

Vajra parses .cpuprofile files produced by V8-based tools (Chrome DevTools, Node.js --prof). The profile’s call tree is converted to a flat array of records with function name, source location, hit count, and self/total time.

vajra stats profile.cpuprofile
vajra anomalies profile.cpuprofile

Auto-detected by the .cpuprofile extension.

strace Summary

Vajra parses the summary table produced by strace -c. Each syscall row becomes a record with fields for time percentage, seconds, calls, errors, and syscall name.

strace -c ls 2>&1 | vajra stats -
vajra stats strace_output.txt --input-format strace

Auto-detected when content contains the % time column header characteristic of strace -c output.

Compressed Files (Gzip, Zstd)

Compression is transparent. Vajra decompresses on the fly and auto-detects the inner format.

vajra inspect claims.json.gz
vajra stats archive.json.zst

This works with any inner format — claims.ndjson.gz, data.yaml.zst, report.csv.gz.

HTTP URLs

Vajra fetches the URL via blocking HTTP GET and analyzes the response body.

vajra inspect https://api.example.com/v1/claims/12345
vajra stats https://data.example.com/feed.ndjson

The response content type and body are used for format detection. No authentication headers are supported in the current version — for authenticated endpoints, fetch with curl and pipe to stdin:

curl -H "Authorization: Bearer $TOKEN" https://api.example.com/data | vajra inspect -

Stdin

The - argument reads from standard input. Format is auto-detected from the content.

cat claim.json | vajra inspect -
curl https://api.example.com/data | vajra stats -
jq '.claims[]' data.json | vajra anomalies -
zcat claims.json.gz | vajra inspect -

Multi-Document Formats

NDJSON and multi-document YAML naturally contain multiple documents. NDJSON records are now aggregated into a single array, so all commands — including stats, anomalies, invariants, and essence — compute across all records as a unified population.

vajra anomalies claims.ndjson          # analyzes all lines as a batch
vajra stats claims.ndjson              # computes stats across all records

Directory Input

When the input is a directory path, Vajra discovers all supported files:

vajra batch ./claims/                  # processes all files in the directory
vajra cluster ./claims/                # clusters all files in the directory

Subdirectories are not traversed recursively by default.

The Engine

Vajra processes structured data through a six-layer pipeline. Each layer depends on the one before it. Each layer’s outputs are independently useful. The pipeline can exit early at any layer depending on the command.

The Six Layers

Raw Input
  -> [1] Parse + Normalize
  -> [2] Structural Analysis
  -> [3] Statistical Analysis
  -> [4] Semantic Lifting
  -> [5] Concern-Oriented Scoring
  -> [6] Deterministic Essence Rendering

Layer 1: Parse + Normalize

Responsibility: Take raw bytes and produce a traversable document model.

What happens:

Format detection. Auto-detect or apply --input-format override. See Input Formats.
Decompression. Gzip and Zstd payloads are decompressed transparently.
Parsing. JSON via simd-json (DOM mode) or SAX-style streaming. YAML, CSV, TSV, Markdown, PDF converted to JSON-equivalent internal representation.
Canonicalization. RFC 8785 (JSON Canonicalization Scheme) applied: lexicographic key ordering, deterministic number formatting, Unicode NFC normalization.
Input hardening. Maximum nesting depth enforced (default 256). Maximum string length enforced. Malformed input produces clean errors with byte offset locations.

Output: A Document — the parsed value tree plus metadata (node count, depth, raw size, content hash).

Complexity: O(n) time. O(n) memory in DOM mode, O(1) in streaming.

Commands that stop here: None. Every command needs at least a parsed document.

Layer 2: Structural Analysis

Responsibility: Extract the structural skeleton — every path, every type, every parent-child relationship.

What happens:

Path extraction. DFS traversal computes full JSONPath for every node. Array indices normalized to [*] for wildcard paths.
Path trie construction. Wildcard paths stored in a trie. Each trie node holds aggregated metadata: count, type distribution, depth, parent type, sibling count.
Fingerprinting. BLAKE3 path set hash, typed path hash, and Merkle subtree hashes computed in a single bottom-up traversal.
Motif detection. Subtree hashes that appear more than once identify repeated structural patterns. Ranked by frequency times subtree size.
Array morphology. Per-array cardinality distribution, type homogeneity, element uniqueness, nested shape diversity.

Output: Path trie, fingerprints, motif index, array morphology profiles.

Complexity: O(n) time, O(p) memory where p = distinct wildcard paths.

Commands that exit here: inspect, fingerprint.

Layer 3: Statistical Analysis

Responsibility: Quantify the distribution of every observable quantity in the document.

What happens:

Frequency analysis. Per-path value frequencies via exact counting (or Count-Min Sketch in streaming mode). Top-k values via Space-Saving.
Entropy computation. Shannon entropy and normalized entropy per path. The most informative universal signal in the system.
Missingness profiling. Null rate, absent rate, empty rate, type instability rate per path. Identifies quasi-required fields and suspicious omissions.
Numeric distributions. Min, max, mean, median, MAD, percentiles via DDSketch. Skewness proxy. Heavy-tail indicator.
Co-occurrence. Pointwise Mutual Information (PMI) between field pairs for the top-k most frequent paths.

Output: Per-path feature vectors stored in the feature store. The statistical backbone of everything downstream.

Complexity: O(n) time, O(p + v) memory where v = distinct values per path (bounded by sketches in streaming mode).

Commands that exit here: stats, anomalies.

Layer 4: Semantic Lifting

Responsibility: Infer likely semantic types from raw JSON scalar types and discover cross-field relationships.

What happens:

Type inference. DFA bank runs against values: dates, currency-like values, identifiers, enum-like fields, code tokens, phone numbers, free text. Each inference carries a confidence label (definite, dominant, heuristic, unclassified).
Relationship discovery. Conditional entropy between field pairs identifies functional dependencies. PMI identifies co-occurrence patterns.
Domain plugin integration. Registered plugins contribute additional type recognizers and relationship hints. The medical plugin recognizes ICD-10, CPT, NPI, HCPCS patterns.
Temporal analysis. When date/datetime fields are detected, inter-event intervals, monotonicity, gaps, and chronology violations are analyzed.

Output: Semantic type annotations, relationship graph, temporal observations, domain hints.

Complexity: O(n) for type inference, O(k^2 * n) for relationship discovery where k = top-k field screening threshold (default 50).

Commands that exit here: invariants, query.

Layer 5: Concern-Oriented Scoring

Responsibility: Score every observation against the active concern profile’s weight vector and select what matters.

What happens:

Candidate collection. Every notable observation from layers 2-4 becomes a candidate: high-entropy fields, anomalies, motifs, relationship discoveries, drift observations.
Signal normalization. Each of the six scoring dimensions normalized to [0, 1].
Composite scoring. Weighted sum using the profile’s weight vector.
Ranking. Candidates sorted by composite score with deterministic tie-breaking (path depth, then lexicographic).
Token budget enforcement. If --budget N is set, greedy knapsack selection by score-per-token.

Output: Ranked, budgeted list of observations ready for rendering.

Complexity: O(c log c) where c = number of candidates (typically a few dozen to a few hundred).

Commands that exit here: None directly — this feeds rendering.

Layer 6: Deterministic Essence Rendering

Responsibility: Transform the scored, ranked observations into the final output.

What happens:

Motif collapsing. Repeated structures represented once with count and variation notes.
Template application. The profile’s rendering configuration (vocabulary level, section headers, formatting rules) is applied.
Format rendering. Output produced in the requested format: text, JSON, Markdown, or compact-AI.
Redaction. If --redact is enabled, pattern-based redaction applied before final emission.
Provenance attachment. Every essence includes: Vajra version, profile used, input hash, config hash, timestamp.

Output: The essence — a compressed, prioritized, faithful representation of the input data.

Complexity: O(c) where c = number of included observations.

Commands that exit here: essence, drift, cluster, batch.

Data Flow Diagram

                    +-----------+
                    | Raw Input |
                    +-----+-----+
                          |
                    [1] Parse + Normalize
                          |
                   +------v------+
                   |  Document   |
                   | (value tree |
                   |  + metadata)|
                   +------+------+
                          |
                    [2] Structural Analysis
                          |
         +-------+--------+--------+--------+
         |       |        |        |        |
      Path    Finger-   Motif   Array    Domain
      Trie    prints    Index   Morph.   Hints
         |       |        |        |        |
         +-------+--------+--------+--------+
                          |
                    [3] Statistical Analysis
                          |
                   +------v------+
                   | Feature     |
                   | Store       |
                   | (per-path   |
                   |  vectors)   |
                   +------+------+
                          |
                    [4] Semantic Lifting
                          |
         +-------+--------+--------+
         |       |        |        |
      Type    Relation-  Temporal  Plugin
      Infer.  ships      Patterns  Hints
         |       |        |        |
         +-------+--------+--------+
                          |
                    [5] Scoring + Selection
                          |
                   +------v------+
                   | Ranked      |
                   | Observations|
                   +------+------+
                          |
                    [6] Rendering
                          |
                   +------v------+
                   |   Essence   |
                   +-------------+

Early Exit Points

Not every command runs all six layers. The engine exits as early as possible:

Command	Layers Used
`inspect`	1, 2
`fingerprint`	1, 2
`stats`	1, 2, 3
`anomalies`	1, 2, 3
`invariants`	1, 2, 3, 4
`query`	1, 2, 3, 4
`essence`	1, 2, 3, 4, 5, 6
`drift`	1, 2, 3 (both docs), then comparison
`cluster`	1, 2 (all docs), then similarity
`batch`	1, 2, 3 (all docs), then aggregation

This is why inspect is fast and essence is slower — inspect exits after structural analysis while essence runs the full pipeline.

Deep Dives

Algorithms — every algorithm with provenance, complexity, and what it replaced
Streaming — how the engine handles documents that exceed memory
Determinism — how every source of nondeterminism is eliminated

Algorithms

This is the technical heart of Vajra. Every algorithm here was selected against three gates. Any algorithm that failed any gate was cut.

The Three Gates

Gate 1: Scale

O(n) or O(n log n) time complexity. Bounded or streaming-compatible memory. If an algorithm cannot handle a billion nodes without choking, it does not enter.

Gate 2: Battle-Tested

Published, peer-reviewed, deployed in production systems at scale. No novel algorithms. No research prototypes. No “clever tricks” that have not survived contact with real data.

Gate 3: Deterministic

Same input, same output. If an algorithm requires random sampling without seed control, or produces platform-dependent results, or depends on iteration order of an unordered collection — it does not enter.

The Algorithms

BLAKE3 Hashing

Provenance: O’Connor, Aumasson, Neves, Wilcox-O’Hearn, 2020. Rust-native reference implementation.

What it does: All hashing in Vajra. Path set fingerprints, typed path fingerprints, Merkle subtree hashing, content hashing, MinHash hash functions.

Why BLAKE3 over alternatives:

Contender	Why Rejected
SHA-256	3-7x slower on modern hardware. No parallelism.
SHA-3	Slower than BLAKE3 on all platforms. No parallel tree structure.
xxHash / FNV	Not cryptographic. Collision resistance matters for fingerprinting.
SipHash	Designed for hash tables, not content addressing. Slower for bulk data.

Why BLAKE3 wins: 3-7x faster than SHA-256. Internally parallelizable via Bao tree structure. 256-bit output with cryptographic strength. Rust-native. Deterministic. One algorithm for every hashing need in the system.

Complexity: O(n) time, O(1) memory per hash. Internally parallel for large inputs.

simd-json Parsing

Provenance: Langdale & Lemire, 2019. Based on the simdjson C++ library, ported to Rust.

What it does: DOM-mode JSON parsing at 2+ GB/s using SIMD instructions for structural character classification and string validation.

Why simd-json over alternatives:

Contender	Why Rejected
serde_json	400-800 MB/s. Adequate but not operational speed for large documents.
sonic-rs	Young ecosystem. Less battle-tested.
Manual SAX parser	Necessary for streaming mode, but DOM mode needs random access.

Why simd-json wins: Measured 2+ GB/s on modern hardware. Uses SIMD for structural indexing. Operates on mutable borrowed slices for zero-copy access to string values.

Complexity: O(n) time. O(n) memory for the DOM.

RFC 8785 Canonicalization (JCS)

Provenance: IETF RFC 8785, published 2020. The standard for deterministic JSON serialization.

What it does: Removes irrelevant representational variance. Lexicographic key ordering by UTF-16 code unit sequence. Deterministic number formatting. No whitespace.

Extensions beyond the RFC:

Unicode NFC normalization (UAX #15) for string comparison stability
Null vs. absent distinction preservation (RFC 8785 does not address this; Vajra tracks it explicitly)
Configurable array order policy: preserve (default), set (unordered deduplicated), multiset (unordered with duplicates)

Complexity: O(n log k) where k = maximum keys per object. Memory: O(n).

Shannon Entropy

Provenance: Shannon, 1948. The foundation of information theory. Sixty-eight years of deployment in every field that measures information.

What it does: For each path, measures the information content of observed values.

H(X) = -sum p(x) * log2(p(x))

Normalized:

H_norm(X) = H(X) / log2(|support|)

Why entropy is the strongest universal primitive: It distinguishes boilerplate from signal without domain knowledge. A constant field (H = 0) is noise. A uniform random field (H_norm = 1) is unstructured. Meaningful variation lives in the middle — identifiers, dates, codes, status values.

Streaming computation: Maintained via running counts per value per path. When the value space exceeds memory, entropy is estimated from Count-Min Sketch frequency approximations.

Complexity: O(n) time, O(v) space where v = distinct values per path.

Count-Min Sketch (CMS) with Conservative Update

Provenance: Cormode & Muthukrishnan, 2005 (original CMS). Conservative update: Estan & Varghese, 2002.

What it does: Streaming frequency estimation when exact counts would exceed memory. Maintains a 2D array of counters with multiple hash functions.

Conservative update improvement: Instead of incrementing all d counters, only increment counters currently equal to the minimum. Provably reduces over-estimation error without changing the data structure.

Parameters:

Width w = ceil(e / epsilon) where epsilon = desired error rate (default 0.001)
Depth d = ceil(ln(1 / delta)) where delta = failure probability (default 0.01)
Default: w = 2,718, d = 5

Guarantees: Estimated count satisfies: true count <= estimate <= true count + epsilon * N with probability >= 1 - delta, where N = total count.

What it replaced:

Contender	Why Rejected
Exact hash maps	Unbounded memory on high-cardinality paths.
Bloom filters	Cannot count — only membership testing.
Count Sketch	Returns negative estimates. CMS guarantees non-negative.

Complexity: O(d) per update. O(w * d) memory. Both constants independent of data size.

Space-Saving Algorithm

Provenance: Metwally, Agrawal, El Abbadi, 2005.

What it does: Identifies the top-k most frequent elements in a stream using exactly k counters.

Mechanism: When a new element arrives that is not tracked, evict the element with the smallest count, replace it, and increment. Despite its simplicity, every element whose true frequency exceeds N/k is guaranteed to be in the summary.

What it replaced:

Contender	Why Rejected
Frequent algorithm (Misra-Gries)	Weaker error bounds for the same space.
Lossy Counting	Higher space complexity. More complex implementation.
Full sorting	O(n log n) and requires all data in memory.

Complexity: O(1) amortized per update with a min-heap. O(k) memory.

DDSketch

Provenance: Masson, Rim, Lee, 2019. Developed at Datadog. Deployed in production across billions of data points per second.

What it does: Streaming quantile estimation with relative error guarantees. For any quantile q, the returned value satisfies |estimate - true| <= alpha * |true| where alpha is the relative accuracy parameter.

Why DDSketch over alternatives:

Contender	Why Rejected
t-digest (Dunning, 2019)	No formal error guarantees. Accuracy is empirically good but theoretically unbounded.
Fixed-width histograms	Absolute error is meaningless when values span orders of magnitude (cents to millions).
Random sampling	No guarantees on tail quantiles — precisely where anomalies live.
GK sketch (Greenwald-Khanna)	Provides absolute error, not relative. DDSketch adapts to data scale.

Critical property: mergeability. DDSketch instances can be merged exactly, preserving accuracy guarantees. This enables parallel batch processing: analyze partitions independently, merge sketches for global statistics with zero accuracy loss.

Parameters: alpha = 0.01 (1% relative error) by default. Memory: O(log(max/min) / log(1 + alpha)) buckets — typically a few hundred for financial data spanning cents to millions.

Complexity: O(1) per insertion. O(1) per quantile query.

Median Absolute Deviation (MAD)

Provenance: Hampel, 1974. Standard robust statistics.

What it does: Robust measure of dispersion.

MAD = median(|x_i - median(X)|)

Modified z-score:

z_MAD = 0.6745 * (x_i - median(X)) / MAD

Why MAD over standard deviation: Standard deviation has a 0% breakdown point — a single extreme value inflates sigma enough to mask every other anomaly. MAD has a 50% breakdown point — half the data can be arbitrarily corrupted before MAD gives a misleading result. This is the strongest possible breakdown point for any location/scale estimator.

Anomaly threshold: |z_MAD| > 3.5 flags an anomaly candidate (Iglewicz & Hoaglin, 1993). Configurable per profile.

Streaming computation: Exact MAD requires sorted data. Running approximate median via DDSketch enables streaming MAD estimation with bounded relative error.

Complexity: O(n) with sorting, or O(n) streaming via DDSketch.

MinHash

Provenance: Broder, 1997. The foundation of modern near-duplicate detection and similarity search.

What it does: Estimates Jaccard similarity between sets in constant time per comparison.

Vajra computes MinHash signatures over wildcard path sets using k independent hash functions (k = 128 by default). For memory efficiency, the b-bit MinHash variant (Li & Konig, 2011) stores only the lowest b bits of each hash value.

What it replaced:

Contender	Why Rejected
Exact pairwise Jaccard	O(n^2) for batch comparison. Fine for < 1,000 docs, breaks above that.
Random projection (SimHash)	Better for cosine similarity. MinHash is optimal for Jaccard.

Complexity: O(n) for signature computation. O(k) for pairwise comparison. O(k) memory per document.

SimHash

Provenance: Charikar, 2002.

What it does: Fixed-width fingerprints where Hamming distance approximates cosine distance. Used for near-motif detection — subtrees that are semantically the same but differ in one or two fields.

SimHash operates over (key, value_type) feature pairs within each subtree. Subtrees whose SimHash values have Hamming distance <= t (default t = 3 out of 64 bits) are grouped as near-motifs.

Complexity: O(n) time. O(1) per fingerprint comparison.

Locality-Sensitive Hashing (LSH)

Provenance: Indyk & Motwani, 1998. Banded variant as described by Leskovec, Rajaraman & Ullman.

What it does: Partitions MinHash signatures into b bands of r rows, hashing each band into buckets. Documents sharing a bucket in any band are candidate pairs.

The S-curve probability: P(candidate) = 1 - (1 - s^r)^b

With k = 128, b = 16, r = 8: documents with Jaccard similarity > 0.5 have > 98% chance of being found. Documents with similarity < 0.2 have < 2% false positive rate.

What it replaced:

Contender	Why Rejected
Hierarchical agglomerative clustering	O(n^2 log n) time, O(n^2) memory. Breaks on > 10K documents.
k-means / k-medoids	Requires specifying k in advance. Vajra cannot know the cluster count.

Complexity: O(n) amortized for LSH indexing.

Jensen-Shannon Divergence (JSD)

Provenance: Lin, 1991. Square root metric property: Endres & Schindelin, 2003; Osterreicher & Vajda, 2003.

What it does: Measures distributional drift between two value distributions.

JSD(P || Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)

where M = 0.5 * (P + Q).

Why JSD over alternatives:

Contender	Why Rejected
KL divergence	Asymmetric. Infinite when Q(x) = 0 and P(x) > 0.
Chi-squared test	Sensitive to bin size. Not a proper metric.
Kolmogorov-Smirnov	Measures maximum deviation only, not overall distribution shape.
Total variation distance	Does not account for similarity between nearby values.

Why JSD wins: Symmetric. Always finite. Bounded to [0, 1] with log base 2. Square root is a proper metric satisfying the triangle inequality. Drift magnitudes can be meaningfully compared.

Complexity: O(v) where v = union of value supports.

1D Wasserstein Distance (Earth Mover’s Distance)

Provenance: Kantorovich, 1942. Computational formulation: O(n log n) via CDF sorting.

What it does: For numeric paths, measures “how far did values move” — not just that the distribution changed, but the magnitude of the shift.

Why included alongside JSD: JSD measures probability mass redistribution. Wasserstein captures the distance of the shift. A distribution that shifts entirely from $100 to $100.01 has low Wasserstein but potentially high JSD. A distribution that shifts from $100 to $10,000 has high Wasserstein. Both measures together give a complete picture.

Complexity: O(n log n) via sorting.

Pointwise Mutual Information (PMI)

Provenance: Church & Hanks, 1989 (in NLP). Rooted in Shannon, 1948.

What it does: Measures association strength between field pairs.

PMI(x, y) = log2(P(x, y) / (P(x) * P(y)))

Positive = co-occur more than chance. Negative = avoid each other. Zero = independent.

Complexity: O(1) per pair given precomputed frequencies.

Conditional Entropy

Provenance: Shannon, 1948.

What it does: Measures how much knowing field X reduces uncertainty about field Y.

H(Y|X) = -sum p(x,y) * log2(p(y|x))

Low H(Y|X) means X predicts Y. H(Y|X) near 0 means functional dependency.

Complexity: O(n) to compute from co-occurrence counts.

Benford’s Law Analysis

Provenance: Newcomb, 1881. Benford, 1938. Formalized by Hill, 1995. Forensic application: Nigrini, 1996.

What it does: Tests whether leading digit distribution matches the expected logarithmic distribution:

P(d) = log10(1 + 1/d)

Departure measured via chi-squared goodness-of-fit. Effective for financial amounts, counts, and quantities. Not applicable to identifiers, codes, or constrained-range values — Vajra applies this only to paths classified as numeric with high cardinality and range spanning at least one order of magnitude.

Complexity: O(n) — single pass to count leading digits.

The Rejection List

These algorithms were evaluated and rejected. Each had a specific reason.

Algorithm	Reason for Rejection
Isolation Forest (Liu et al., 2008)	Non-deterministic without careful seeding. O(n log n) per tree. Contamination parameter requires tuning. MAD + rarity + structural distance cover the same space with stronger interpretability.
Local Outlier Factor (Breunig et al., 2000)	O(n^2) naive, O(n log n) with spatial indexing. Sensitive to k parameter. Breaks universality on large datasets.
t-digest (Dunning, 2019)	No formal error guarantees. Accuracy is empirically good but theoretically unbounded. DDSketch provides provable bounds.
Hierarchical agglomerative clustering	O(n^2 log n) time, O(n^2) memory. Replaced by LSH-based component detection at O(n).
k-means / k-medoids	Requires specifying k in advance. Also O(n^2) per iteration for k-medoids.
Any method requiring training data	Vajra operates on cold data with no prior. Every method must work on a single document or batch with no history.
Any method requiring labeled examples	Same constraint. Unsupervised only.
Any method without deterministic output	The determinism guarantee is non-negotiable.

Information Theory

Vajra’s analytical core is an information-theoretic pipeline. Every measure of diversity, anomaly, drift, similarity, and dependency traces back to a concept from information theory. This chapter covers the full stack — from foundational primitives through composite metrics to the scoring model that turns bits into insights.

Foundation: Shannon Entropy

The starting point. Shannon entropy measures the average surprise per observation:

H(X) = -sum p(x) * log2(p(x))

0 bits: constant field (no information)
log2(k) bits: uniform distribution over k values (maximum information)
Between: the interesting space where identifiers, dates, and codes live

Normalized entropy scales to [0, 1]:

H_norm(X) = H(X) / log2(|support|)

This is the single most important signal in Vajra. A field with H_norm near 0 is noise. A field with H_norm near 1 is unstructured randomness. Meaningful variation lives in the middle.

Files: vajra-stats/src/entropy.rs

The Renyi Spectrum

Shannon entropy is one point on a continuous family parameterized by alpha:

H_alpha(X) = (1 / (1 - alpha)) * log2(sum p(x)^alpha)

alpha	Name	What It Measures
0	Hartley	`log2(support size)` — how many distinct values exist
1	Shannon	average surprise (limit as alpha approaches 1)
2	Collision	`-log2(sum p^2)` — probability two random draws match
infinity	Min-entropy	`-log2(max p)` — worst-case unpredictability

Why a spectrum? A single entropy number hides the shape of the distribution. The Renyi spectrum reveals it:

High Shannon, low min-entropy: long tail with one dominant value
All orders equal: near-uniform distribution
Large divergence (H0 - H_inf): heavy concentration with many rare values

Security application: Min-entropy is the correct measure for cryptographic key strength — not Shannon. A key with high Shannon but low min-entropy has a predictable most-likely value.

Spectrum divergence (H0 - H_inf) is itself a signal: it quantifies how far the distribution is from uniform. Zero divergence = uniform. High divergence = concentrated.

Complexity: O(n), same as Shannon. Computed from the same frequency counts.

Files: vajra-stats/src/renyi.rs

Structural Complexity: Lempel-Ziv

Shannon entropy measures average information per symbol. It cannot distinguish:

Input	Shannon Entropy	LZ Complexity
Random UUIDs	High	High
`PROJ-001`, `PROJ-002`, …	High	Low
Repeated `"active"`	Low	Low

Lempel-Ziv complexity (LZ76) measures the number of distinct subpatterns needed to describe a sequence. The LZ76 algorithm scans left-to-right, extending the current phrase until it hasn’t been seen before:

Normalized C_LZ = phrase_count / (n / log2(n))

The entropy-complexity plane has four quadrants:

	Low LZ	High LZ
High entropy	Structured (patterned identifiers)	Random (UUIDs, hashes)
Low entropy	Constant (repeated values)	Anomalous (theoretically unlikely)

A field in the “structured” quadrant (high entropy, low complexity) is a generated identifier with a pattern. A field in the “random” quadrant is truly unpredictable. Shannon alone cannot tell them apart.

Complexity: O(n) single pass. No external dependencies.

Files: vajra-stats/src/lz_complexity.rs

Relationships: Conditional Entropy and PMI

Conditional Entropy H(Y|X)

How much knowing X reduces uncertainty about Y:

H(Y|X) = -sum p(x,y) * log2(p(y|x))

H(Y|X) = 0: X completely determines Y (functional dependency)
H(Y|X) = H(Y): X tells you nothing about Y (independence)

Relationship strength normalizes this:

strength = 1 - H(Y|X) / H(Y)

Clamped to [0, 1]. Zero = independent. One = deterministic.

Pointwise Mutual Information

Measures co-occurrence strength between specific value pairs:

PMI(x, y) = log2(P(x,y) / (P(x) * P(y)))

Positive = co-occur more than chance. Negative = avoid each other. Zero = independent.

Total Correlation

Pairwise measures miss higher-order structure. Three fields can be independent in pairs but jointly constrained (city + state + zip). Total correlation captures this:

TC(X1,...,Xn) = sum H(Xi) - H(X1,...,Xn)

TC = 0: all fields are independent
High TC: the schema has deep internal structure
TC / sum H(Xi): normalized to [0, 1]

Total correlation answers: “how much redundancy exists across all fields simultaneously?” This is the gap between pairwise analysis and true multivariate dependency.

Complexity: O(n) for marginals. Joint entropy estimated via binning, bounded to 8-field subsets for tractability.

Files: vajra-stats/src/relationships.rs, vajra-stats/src/total_correlation.rs

Distributional Drift: JSD and Wasserstein

Jensen-Shannon Divergence

Symmetric, bounded, and a proper metric (via square root):

JSD(P, Q) = 0.5 * KL(P || M) + 0.5 * KL(Q || M)

where M = 0.5 * (P + Q) and KL is Kullback-Leibler divergence.

JSD in [0, 1] with log base 2
sqrt(JSD) satisfies the triangle inequality (Endres & Schindelin 2003)
Used for categorical distribution drift

1D Wasserstein Distance

For numeric distributions, measures the “earth mover’s distance”:

W1 = integral |CDF_a(x) - CDF_b(x)| dx

JSD tells you the distributions changed. Wasserstein tells you by how much — it captures the magnitude of the shift, not just its existence.

When to use which:

Data Type	Metric	Why
Categorical (strings, enums)	JSD	Probability mass redistribution
Numeric (amounts, counts)	Wasserstein	Shift magnitude in original units

Files: vajra-drift/src/jsd.rs, vajra-drift/src/wasserstein.rs

Directed Information Flow: Transfer Entropy

Transfer entropy measures how much knowing the past of X reduces uncertainty about Y’s future, beyond what Y’s own past already tells you:

TE(X->Y) = H(Y_t | Y_{t-1}^k) - H(Y_t | Y_{t-1}^k, X_{t-1}^l)

Key properties:

Directional: TE(X->Y) != TE(Y->X) — reveals causal flow
Non-negative: information can only help prediction
Granger causality generalized: captures nonlinear dependencies

This transforms cascade detection from temporal pattern matching into rigorous directed information flow quantification. Instead of “A happened before B,” transfer entropy says “A’s past carries 2.3 bits of information about B’s future that B’s own history doesn’t contain.”

Net information flow = TE(X->Y) - TE(Y->X). Positive means X drives Y. Negative means Y drives X.

Complexity: O(n * k) where k is history depth. Deterministic with fixed binning.

Files: vajra-stats/src/transfer_entropy.rs

Universal Similarity: NCD

Normalized Compression Distance approximates the normalized information distance — provably the most general similarity metric:

NCD(x, y) = (C(xy) - min(C(x), C(y))) / max(C(x), C(y))

where C is a real compressor (zstd at fixed level 3).

Why NCD is strictly more powerful than feature-based similarity: MinHash captures set overlap. SimHash captures angular proximity. Both require choosing features. NCD captures all computable regularities — structure, patterns, naming conventions, content — with zero feature engineering.

Two documents that share structural patterns but zero literal values will have low NCD. Two documents with random shared tokens but different structure will have high NCD.

NCD(x, x) approaches 0 (self-similarity)
NCD(x, random) approaches 1 (dissimilarity)
Symmetric: NCD(x, y) = NCD(y, x)
Deterministic given fixed compressor and level

Complexity: O(n) per compression. O(n^2) for all-pairs matrix with C(x) caching.

Files: vajra-fingerprint/src/ncd.rs

Anomaly Scoring

Self-Information (Surprisal)

The rarity of a single observation:

I(x) = -log2(p(x))

A value seen once in 10,000 observations carries 13.3 bits of rarity. This is the information-theoretic foundation of rare value detection.

MAD-Based Outlier Detection

Median Absolute Deviation with modified z-scores:

z_MAD = 0.6745 * (x - median) / MAD

Values with |z_MAD| > 3.5 are flagged. MAD has a 50% breakdown point — half the data can be corrupted before it gives misleading results.

Benford’s Law

Leading digit distribution for numeric fields:

P(d) = log10(1 + 1/d)

Conformity tested via chi-squared and Nigrini’s MAD score. Non-conformity (MAD > 0.015) signals potentially fabricated or unusual numeric data.

The Six-Dimensional Scoring Model

Every observation is scored across six information-theoretic dimensions:

Dimension	Source	Range
rarity	Self-information, cardinality	[0, 1]
instability	Type distribution: 1 - (dominant/total)	[0, 1]
entropy_signal	Normalized Shannon entropy	[0, 1]
structural_coverage	Null rate, enum-like patterns	[0, 1]
anomaly_strength	MAD z-scores, rarity magnitude	[0, 1]
concern_relevance	Domain-specific importance	[0, 1]

The composite score is a weighted sum:

score = sum weight_i * dimension_i

Weights depend on the concern profile:

Profile	Rarity	Instability	Entropy	Coverage	Anomaly	Concern
Engineer	0.15	0.15	0.15	0.15	0.15	0.15
Staff	0.10	0.10	0.10	0.25	0.30	0.15
Fraud	0.25	0.10	0.10	0.10	0.35	0.10

The Integration Pipeline

JSON Document
  |
  v
[Stats Analyzer] --- entropy, Renyi spectrum, LZ complexity, cardinality, rarity
  |
  v
[Anomaly Analyzer] --- rare values (surprisal), type instabilities, MAD outliers
  |
  v
[Relationship Discovery] --- conditional entropy, PMI, total correlation
  |
  v
[Drift Analyzer] --- JSD (categorical), Wasserstein (numeric), severity
  |
  v
[Cascade Analyzer] --- transfer entropy, directed information flow
  |
  v
[Feature Store] --- PathFeatures with all information-theoretic signals
  |
  v
[Essence Builder] --- ScoredObservations across 6 dimensions
  |
  v
[Profile Scorer] --- Weighted composite score
  |
  v
[EssenceData] --- Prioritized findings for humans and AI

Every anomaly signal, every drift measurement, every relationship discovery, and every cascade detection is rooted in information theory. The entire system is fundamentally an information-theoretic lens on structured data.

Streaming

Vajra handles JSON of any size. A 50 KB medical claim and a 10 GB event log enter the same pipeline. The streaming engine is what makes this possible.

Two Modes

DOM Mode

For documents that fit in memory. The parser builds a full in-memory tree with random access to every node. All analysis passes can access any part of the document at any time.

Parser: simd-json at 2+ GB/s.
Memory: O(n) where n = document size.
Activates: By default, for documents below the streaming threshold (default 100 MB, configurable).

Streaming Mode

For documents that exceed available memory. SAX-style event parsing with bounded memory. The parser emits events (start-object, key, value, end-object, start-array, end-array) and the analyzers update their accumulators incrementally.

Memory: O(p + s) where p = distinct paths and s = sum of sketch sizes. For typical JSON with < 1,000 distinct paths: < 10 MB regardless of document size.
Activates: Automatically when document size exceeds the streaming threshold. Force with --streaming.

The Two-Pass Hybrid Strategy

Streaming mode does not mean single-pass-only. Vajra uses a hybrid strategy that balances memory efficiency with analysis depth.

Pass 1: Profile the Document

A single streaming pass collects:

Path extraction. Every wildcard path discovered and registered in the path trie.
Frequency counting. Value frequencies per path via Count-Min Sketch (conservative update).
Top-k identification. Most frequent values per path via Space-Saving algorithm.
Type profiling. Type distribution per path tracked via simple counters.
Numeric sketches. DDSketch accumulators for every numeric path — percentiles, median, MAD.
Null and missingness tracking. Per-path counters for null, absent, empty.
Entropy estimation. Computed from CMS frequency estimates when exact counting exceeds memory.
Fingerprint accumulation. Merkle hashes built incrementally as subtrees complete.

After Pass 1, Vajra has a complete statistical profile of the document without having held more than one event in memory at a time.

Pass 2 (Optional): Selective DOM for High-Signal Subtrees

If the command requires rich analysis that streaming cannot provide (motif analysis, essence generation with deep context), Vajra can selectively parse high-signal subtrees into DOM.

The decision is based on Pass 1 results:

Subtrees with high anomaly density are candidates for DOM parsing.
Subtrees with high entropy fields that need value-level analysis.
The dominant motif’s representative instance.

Pass 2 is optional. Commands like stats and fingerprint need only Pass 1. Commands like essence may invoke Pass 2 for targeted depth.

Sketch Data Structures in Streaming Mode

DDSketch

Role: Numeric distribution analysis — percentiles, median, MAD.

One DDSketch per numeric path. Each sketch maintains O(log(max/min) / log(1 + alpha)) buckets. With alpha = 0.01 and financial data spanning $0.01 to $1,000,000, this is roughly 700 buckets — a few KB of memory per path.

Key property: Mergeability. When processing a batch in parallel, per-file DDSketch instances merge into a global sketch with zero accuracy loss.

#![allow(unused)]
fn main() {
// Streaming numeric stats
let mut stats = StreamingStatsAccumulator::default();
for event in parser {
    stats.on_event(&event?)?;
}
let result = stats.finalize()?;
// result.numeric_stats contains DDSketch-derived percentiles
}

Count-Min Sketch (CMS)

Role: Frequency estimation for values, paths, and key names when cardinality exceeds configurable thresholds.

Default configuration: width = 2,718, depth = 5. Total memory: ~54 KB per sketch. Error guarantee: estimated count within 0.1% of total count with 99% probability.

Activation: Exact counting is preferred when it fits in memory. CMS activates as a fallback when distinct value count per path exceeds the threshold (default: 10,000 distinct values).

Space-Saving

Role: Identifying top-k most frequent elements without storing all elements.

Maintains exactly k counters (default k = 100). Guaranteed to include every element whose true frequency exceeds N/k. Memory: k entries, a few KB.

Memory Budget

The total streaming memory budget is bounded:

Component	Memory
Path trie	O(p) where p = distinct wildcard paths
DDSketch (per numeric path)	~3 KB per path
CMS (per high-cardinality path)	~54 KB per path
Space-Saving (per path)	~4 KB per path (k=100)
Type counters (per path)	~48 bytes per path
Null/absent counters (per path)	~32 bytes per path
Fingerprint accumulator	O(current depth)

For a document with 500 distinct paths, 100 numeric paths, and 50 high-cardinality paths:

Path trie:           ~100 KB
DDSketch:            ~300 KB  (100 paths x 3 KB)
CMS:                 ~2.7 MB  (50 paths x 54 KB)
Space-Saving:        ~2.0 MB  (500 paths x 4 KB)
Type/null counters:  ~40 KB   (500 paths x 80 bytes)
Fingerprint:         ~10 KB
---
Total:               ~5.2 MB

This budget holds regardless of whether the document is 100 MB or 100 GB. The streaming guarantee: bounded memory independent of input size.

DOM vs. Streaming: What Changes

Capability	DOM Mode	Streaming Mode
Parsing speed	2+ GB/s (simd-json)	~500 MB/s (event parser)
Random access	Full	None (sequential events)
Exact frequency counts	Yes	Only when cardinality fits in memory; CMS otherwise
Exact percentiles	Yes (via sorting)	Approximate (DDSketch, 1% relative error)
Exact entropy	Yes	Approximate (from CMS estimates)
Motif detection	Full (Merkle subtree hashing)	Partial (incremental, no lookback)
Relationship discovery	Full (random access to value pairs)	Partial (co-occurrence counters)
Essence quality	Full	Slightly reduced (no selective subtree re-parse in Pass 1)

Every streaming approximation carries formal error bounds. The output explicitly labels which statistics are exact and which are approximate.

When Each Mode Activates

Document size < streaming_threshold (default 100 MB)
  -> DOM mode

Document size >= streaming_threshold
  -> Streaming mode (automatic)

--streaming flag present
  -> Streaming mode (forced, regardless of size)

The threshold is configurable in the TOML config:

[parsing]
streaming_threshold = 104_857_600  # 100 MB

The StreamAnalyzer Trait

Any analyzer that implements StreamAnalyzer can participate in streaming mode:

#![allow(unused)]
fn main() {
pub trait StreamAnalyzer {
    type Accumulator: Default;
    type Output;

    fn on_event(&self, event: &JsonEvent, acc: &mut Self::Accumulator) -> Result<()>;
    fn finalize(&self, acc: Self::Accumulator) -> Result<Self::Output>;
}
}

The accumulator holds all state. Events arrive one at a time. finalize produces the result when the stream ends.

This trait is the key to extensibility. Custom analyzers that implement it automatically work in both DOM and streaming modes — DOM mode simply feeds all events from the pre-parsed tree.

Differential Testing: DOM vs. Streaming

For every document in the test corpus, Vajra runs both modes and asserts:

CMS frequency estimates are within proven error bounds of exact counts
DDSketch quantile estimates are within relative accuracy of exact quantiles
Path sets are identical
Fingerprints are identical
Type distributions are identical

This ensures streaming mode is not a second-class citizen. It is a formally bounded approximation of DOM mode, not a degraded fallback.

Determinism

Determinism in Vajra is not a feature. It is a structural guarantee.

Given identical input bytes, identical configuration, and identical Vajra version, the output is identical — byte for byte. Fingerprints, scores, orderings, essence text, anomaly rankings. Every run. Every platform. Every time.

This is what makes Vajra trustworthy in CI pipelines, audits, compliance workflows, and AI systems that depend on stable preprocessing.

The Guarantee

Identical:

Input bytes
Configuration (profile, flags, config file)
Vajra version

Produces identical:

Fingerprints
Scores (to floating-point bit-level)
Orderings
Essence text (byte-for-byte)
Anomaly rankings

Sources of Nondeterminism and How Each Is Eliminated

The HashMap Rule

Problem: Rust’s HashMap uses a random seed for its hash function (SipHash with random key by default). Iteration order is nondeterministic. Any code that iterates a HashMap and includes the iteration order in output produces nondeterministic results.

Mitigation: BTreeMap is used for all externally-visible orderings. HashMap is permitted only for internal scratch computations where iteration order is never observed in output.

This is enforced by code review and tested by the determinism test suite. If a HashMap iteration order leaks into output, the 10-run determinism test catches it immediately.

The Thread Scheduling Rule

Problem: Rayon’s parallel batch processing schedules work across threads nondeterministically. If results are merged in arrival order, the output depends on thread scheduling.

Mitigation: Deterministic merge order. After parallel analysis, results are sorted by input identity (file path or record index) before merging. Parallel execution affects speed, never output.

#![allow(unused)]
fn main() {
// Parallel analysis
let results: Vec<_> = files.par_iter()
    .map(|f| analyze(f))
    .collect();

// Deterministic merge — sorted by input identity, not arrival order
let mut results = results;
results.sort_by_key(|r| r.input_path.clone());
}

The Floating-Point Accumulation Rule

Problem: Floating-point addition is not associative. (a + b) + c can differ from a + (b + c) at the bit level. If summation order varies (due to thread scheduling, hash map iteration, or unstable sorting), floating-point results drift.

Mitigation: Fixed traversal order. All traversals are DFS, left-to-right. All summations occur in deterministic order defined by the path trie’s BTreeMap-based key ordering. The traversal order is a function of the input alone.

The Seed Rule

Problem: MinHash and SimHash use hash functions that can be seeded. Different seeds produce different signatures (and different similarity estimates, cluster assignments, etc.).

Mitigation: Default seed is 0. The --seed flag provides explicit control.

vajra cluster batch/*.json              # seed = 0 (default, deterministic)
vajra cluster batch/*.json --seed 42    # seed = 42 (different but still deterministic)

Same seed + same input = same output. Different seed = potentially different output. Both are deterministic within their seed.

The ryu Rule

Problem: Floating-point to string conversion varies across platforms. Rust’s default Display for f64 can produce different decimal representations on different architectures or with different optimization levels.

Mitigation: All float-to-string conversion uses the ryu crate — Ulf Adams’ algorithm (2018) for shortest round-trip-safe decimal representation. ryu is deterministic and platform-independent. The same f64 bit pattern produces the same string on every platform Vajra supports.

#![allow(unused)]
fn main() {
// Not this:
format!("{}", value)         // platform-dependent

// This:
ryu::Buffer::new().format(value)  // deterministic, platform-independent
}

The Canonicalization Rule

Problem: JSON objects are unordered by specification. {"a": 1, "b": 2} and {"b": 2, "a": 1} are semantically identical but textually different. Any operation that depends on key order (hashing, fingerprinting, rendering) must first impose a deterministic order.

Mitigation: RFC 8785 canonicalization. Keys sorted by UTF-16 code unit sequence (the RFC’s specified ordering). Numbers formatted deterministically. Unicode NFC normalized. Applied before any hashing, fingerprinting, or comparison operation.

The Unicode Rule

Problem: The same visual string can have multiple Unicode representations. “e with acute accent” can be a single codepoint (U+00E9) or a base character plus combining mark (U+0065 U+0301). If these are treated as different strings, frequency counts, entropy, and fingerprints diverge.

Mitigation: Unicode NFC normalization (UAX #15) applied during canonicalization. All string comparisons, frequency counting, and hashing operate on NFC-normalized forms.

Verifying Determinism: The 10-Run Test

The determinism test suite runs every command on every document in the test corpus:

Run Vajra N times (N >= 10) with identical configuration.
Assert byte-identical output across all runs.
Run with --seed 0 and --seed 42 — outputs may differ between seeds.
Run each seed N times — assert identical within-seed output.

# Manual verification
for i in $(seq 1 10); do
  vajra essence claim.json --profile engineer --format json > "run_$i.json"
done

# All files must be identical
md5sum run_*.json
# Every line shows the same hash

If any two runs produce different output, the determinism contract is broken. This test runs in CI on every commit.

What Determinism Costs

The determinism guarantee imposes real engineering costs:

Constraint	Cost	Payoff
BTreeMap everywhere	~10-20% slower than HashMap for insertion-heavy code	Deterministic iteration order
Fixed traversal order	Cannot parallelize within-document traversal for speed	Deterministic accumulation
ryu for float formatting	Additional dependency	Platform-independent output
Seeded PRNG for MinHash	Cannot use hardware RNG for “better” randomness	Reproducible signatures
Deterministic merge order	Sorting step after parallel batch processing	Reproducible batch results

Every cost is paid gladly. Determinism is not negotiable. Speed optimizations that violate it are rejected.

What Determinism Does NOT Cover

Determinism applies to the mapping from (input, config, version) to output. It does not mean:

Different versions produce the same output. Algorithm changes, bug fixes, and threshold adjustments may change output between versions. The version is part of the contract.
Different configs produce the same output. Changing the profile, the seed, the budget, or any flag may change output. The config is part of the contract.
Streaming mode matches DOM mode exactly. Streaming mode uses approximate algorithms (DDSketch, CMS) that produce bounded approximations of DOM mode’s exact results. Both modes are internally deterministic. They may differ from each other within the documented error bounds.

For Library Users

The determinism guarantee extends to the Rust library API. If you call the same analyzer with the same Document and the same configuration, you get the same result.

#![allow(unused)]
fn main() {
use vajra_core::Document;
use vajra_stats::StatsAnalyzer;
use vajra_types::Analyzer;

let doc = Document::parse_file("claim.json")?;

let stats1 = StatsAnalyzer.analyze(&doc)?;
let stats2 = StatsAnalyzer.analyze(&doc)?;

// stats1 and stats2 are identical at the bit level
assert_eq!(
    serde_json::to_string(&stats1)?,
    serde_json::to_string(&stats2)?
);
}

Domain Plugins

Core Vajra is domain-agnostic. It analyzes structure, statistics, and deviation from norms — without knowing what the data represents. Domain intelligence enters through plugins that extend the engine without contaminating it.

A plugin does not change what Vajra computes. It enriches what Vajra knows.

The Plugin Architecture

Plugins contribute four kinds of extensions:

Type recognizers — pattern matchers that identify domain-specific value types (ICD-10 codes, NPIs, SWIFT codes)
Concern profiles — custom scoring weight vectors and rendering templates
Relationship hints — domain knowledge about which fields form logical groups
Custom renderers — domain-specific essence rendering templates

Plugins cannot modify the core analysis pipeline, access the filesystem beyond their own configuration, make network calls, or mutate the input document. They are additive. They are isolated.

The VajraPlugin Trait

#![allow(unused)]
fn main() {
pub trait VajraPlugin: Send + Sync {
    /// Plugin identifier.
    fn name(&self) -> &str;

    /// Plugin version string.
    fn version(&self) -> &str;

    /// Additional type recognizers beyond the core DFA bank.
    /// These run alongside the core recognizers during semantic lifting.
    fn type_recognizers(&self) -> Vec<Box<dyn TypeRecognizer>> {
        vec![]
    }

    /// Additional concern profile definitions.
    /// These appear alongside built-in profiles in `vajra profiles`.
    fn concern_profiles(&self) -> Vec<Box<dyn ConcernProfile>> {
        vec![]
    }

    /// Field relationship heuristics.
    /// Example: "code + description + system = coded concept"
    fn relationship_hints(&self) -> Vec<RelationshipHint> {
        vec![]
    }

    /// Custom rendering templates for essence output.
    fn renderers(&self) -> Vec<Box<dyn EssenceRenderer>> {
        vec![]
    }
}
}

Every method has a default implementation that returns empty. A plugin can implement only the capabilities it needs.

TypeRecognizer

Type recognizers extend Vajra’s semantic lifting layer. They match raw string values against domain-specific patterns.

#![allow(unused)]
fn main() {
pub trait TypeRecognizer: Send + Sync {
    /// The name of the recognized type (e.g., "ICD-10-CM", "CPT", "NPI").
    fn type_name(&self) -> &str;

    /// Returns true if the value matches this type's pattern.
    fn matches(&self, value: &str) -> bool;

    /// Optional confidence level for the match.
    fn confidence(&self, value: &str) -> f64 {
        if self.matches(value) { 1.0 } else { 0.0 }
    }
}
}

Type recognizers run during Layer 4 (Semantic Lifting) of the engine pipeline. They are evaluated after the core DFA bank, allowing domain-specific patterns to augment — not override — the core type inference.

RelationshipHint

Relationship hints tell Vajra that certain field combinations form logical groups:

#![allow(unused)]
fn main() {
pub struct RelationshipHint {
    /// Fields that form a logical group when co-located.
    pub field_patterns: Vec<String>,

    /// Name for this relationship.
    pub name: String,

    /// Description of what the group represents.
    pub description: String,
}
}

Example from the medical plugin:

#![allow(unused)]
fn main() {
RelationshipHint {
    field_patterns: vec![
        "code".to_string(),
        "system".to_string(),
        "display".to_string(),
    ],
    name: "coded-concept".to_string(),
    description: "A coded value with its coding system and human-readable display".to_string(),
}
}

When Vajra finds code, system, and display as sibling keys in an object, the medical plugin’s relationship hint identifies this as a coded concept — not three independent strings.

The Medical Plugin: vajra-domain-med

The medical plugin is the reference implementation. It demonstrates every plugin capability.

Type Recognizers

Recognized Type	Pattern	Example Values
ICD-10-CM	`[A-Z][0-9]{2}(\.[0-9A-Z]{1,4})?`	`E11.9`, `J44.1`, `M54.5`
ICD-10-PCS	`[0-9A-HJ-NP-Z]{7}`	`0SG00ZJ`
CPT	`[0-9]{5}` (with known range validation)	`99213`, `99214`, `27447`
HCPCS	`[A-V][0-9]{4}`	`J0129`, `G0438`
NDC	`[0-9]{4,5}-[0-9]{3,4}-[0-9]{1,2}`	`0069-0770-01`
NPI	`[0-9]{10}` (with Luhn check)	`1234567893`
Denial Reason	`(CO\|PR\|OA\|PI\|CR)-[0-9]{1,3}`	`CO-45`, `PR-1`, `OA-23`

Relationship Hints

Hint	Fields	Meaning
Coded Concept	`code`, `system`, `display`	A value from a terminology system
Service Line	`procedure_code`, `charge_amount`, `service_date`, `status`	A line item on a claim
Patient Identity	`patient.id`, `patient.name`, `patient.dob`	Patient demographic group
Provider Identity	`provider.npi`, `provider.name`, `provider.taxonomy`	Provider identification group
Adjudication	`allowed_amount`, `paid_amount`, `status`, `adjustment`	Payment determination group

What It Enables

With the medical plugin loaded, vajra inspect on a medical claim produces:

=== Domain Type Recognition ===
  $.claims[*].diagnosis[*].code           E11.9      ICD-10-CM
  $.claims[*].diagnosis[*].code           J44.1      ICD-10-CM
  $.claims[*].service_lines[*].procedure_code  99213  CPT
  $.claims[*].provider.npi                1234567890 NPI
  $.claims[*].service_lines[*].adjustment.reason  CO-45  Denial Reason

Without the plugin, those values are just strings. With it, they are clinically meaningful codes.

Building Your Own Plugin

Step 1: Create a Crate

cargo new vajra-domain-finance --lib

Step 2: Depend on vajra-types

# Cargo.toml
[dependencies]
vajra-types = { version = "0.1", path = "../vajra-types" }

Step 3: Implement the Trait

#![allow(unused)]
fn main() {
use vajra_types::traits::{VajraPlugin, TypeRecognizer, RelationshipHint};

pub struct FinancePlugin;

impl VajraPlugin for FinancePlugin {
    fn name(&self) -> &str { "finance" }
    fn version(&self) -> &str { "0.1.0" }

    fn type_recognizers(&self) -> Vec<Box<dyn TypeRecognizer>> {
        vec![
            Box::new(SwiftCodeRecognizer),
            Box::new(IbanRecognizer),
            Box::new(CurrencyCodeRecognizer),
        ]
    }

    fn relationship_hints(&self) -> Vec<RelationshipHint> {
        vec![
            RelationshipHint {
                field_patterns: vec![
                    "amount".to_string(),
                    "currency".to_string(),
                ],
                name: "monetary-value".to_string(),
                description: "Amount with its currency denomination".to_string(),
            },
        ]
    }
}

struct SwiftCodeRecognizer;

impl TypeRecognizer for SwiftCodeRecognizer {
    fn type_name(&self) -> &str { "SWIFT/BIC" }

    fn matches(&self, value: &str) -> bool {
        let len = value.len();
        (len == 8 || len == 11)
            && value[..4].chars().all(|c| c.is_ascii_uppercase())
            && value[4..6].chars().all(|c| c.is_ascii_uppercase())
            && value[6..8].chars().all(|c| c.is_ascii_alphanumeric())
    }
}
}

Step 4: Register the Plugin

Static plugins are compiled into the binary at build time by adding the crate to vajra-cli’s dependencies.

Dynamic plugins are loaded at runtime via libloading from the plugin directory (default: ~/.vajra/plugins/).

Error Isolation

Plugins run in an isolation boundary. If a plugin panics or returns an error:

The panic is caught at the plugin boundary (via std::panic::catch_unwind).
Core analysis continues without the plugin’s contributions.
The plugin failure is recorded in the output’s provenance metadata.
A diagnostic message is emitted to stderr.

vajra: plugin "finance" failed during type recognition: index out of bounds
vajra: continuing analysis without finance plugin contributions

No plugin failure can crash Vajra. No plugin can corrupt the core analysis. The isolation is structural, not aspirational.

Plugin Constraints

A plugin may:

Register type recognizers, profiles, relationship hints, and renderers
Read its own configuration files
Use any safe Rust code internally

A plugin may not:

Modify the core analysis pipeline
Access the filesystem beyond its own config directory
Make network calls
Mutate the input document
Introduce nondeterminism (all plugin methods must be deterministic)

Shipped Plugins

Six domain plugins ship with Vajra, all enabled by default via feature flags:

Domain	Plugin	Type Recognizers	Hints
Medical / EDI	`vajra-domain-med`	ICD-10, CPT, HCPCS, NDC, NPI, Diagnosis Code	6 (claim service line, diagnosis, patient, provider, adjudication, denial)
Security	`vajra-domain-sec`	CVE, IPv4, IPv6, CIDR, MAC, SHA-256, SHA-1, MD5, JWT, MITRE ATT&CK Technique, MITRE Tactic, CVSS	6 (network flow, alert classification, vulnerability, auth, process execution, DNS)
DevOps	`vajra-domain-devops`	Container ID, Semver, Git SHA, Docker Image, AWS ARN, GCP Resource, CIDR, Cron, K8s Namespace, Terraform Resource	6 (K8s pod spec, deployment metadata, service endpoint, Terraform, CI pipeline, container spec)
Source Code	`vajra-domain-source`	snake_case, camelCase, PascalCase, SCREAMING_SNAKE, import paths, source file paths	6 (function definition, class definition, import statement, parameter list, conditional, loop)
Encoding	`vajra-domain-encoding`	Base64, Base64URL, hex, URL-encoded, HTML entities, Unicode escapes, PEM, data URI, quoted-printable, MIME encoded word, Punycode, double-encoded, mixed-encoding	3 (content+encoding, transfer encoding, encoded/decoded pairs)
GitHub	`vajra-domain-github`	PR number, issue number, GitHub username, repo slug, commit SHA, branch name, label, milestone, review state, merge method	7 (pull request, issue, review, commit, release, workflow run, discussion)

Feature Flags

# vajra-cli/Cargo.toml
[features]
default = ["medical", "security", "devops", "source", "encoding", "github"]
medical = ["vajra-domain-med"]
security = ["vajra-domain-sec"]
devops = ["vajra-domain-devops"]
source = ["vajra-source", "vajra-domain-source"]
encoding = ["vajra-domain-encoding"]
github = ["vajra-domain-github"]
all-plugins = ["medical", "security", "devops", "source", "encoding", "github"]

Build without a plugin: cargo build --no-default-features --features security,devops

The Security Plugin: vajra-domain-sec

The security plugin recognizes types commonly found in SIEM events, vulnerability scans, threat intelligence feeds, and network flow data.

Type Recognizers

Recognized Type	Pattern	Example Values
CVE ID	`CVE-YYYY-NNNNN`	`CVE-2024-3400`, `CVE-2023-44487`
IPv4	Dotted-quad, each octet 0-255	`192.168.1.1`, `10.0.0.1`
IPv6	Full, compressed, mixed notation	`2001:db8::1`, `::1`
CIDR	IPv4/prefix (0-32)	`10.0.0.0/8`, `192.168.1.0/24`
MAC Address	Colon or hyphen separated	`aa:bb:cc:dd:ee:ff`
SHA-256	64 lowercase hex chars	`e3b0c44298fc1c14...`
SHA-1	40 lowercase hex chars	`da39a3ee5e6b4b0d...`
MD5	32 lowercase hex chars	`d41d8cd98f00b204...`
JWT	`eyJ...\.eyJ...\.sig`	JSON Web Tokens
MITRE ATT&CK Technique	`T\d{4}(.\d{3})?`	`T1059`, `T1059.001`
MITRE ATT&CK Tactic	`TA\d{4}`	`TA0001`, `TA0040`
CVSS Vector	`CVSS:3.x/AV:.../...`	Full CVSS v3 vector strings

The DevOps Plugin: vajra-domain-devops

The DevOps plugin recognizes types in Kubernetes manifests, Terraform state, CI/CD pipeline output, Docker configurations, and cloud infrastructure JSON.

Type Recognizers

Recognized Type	Pattern	Example Values
Container ID	12 or 64 lowercase hex chars	`a1b2c3d4e5f6`
Semver	`v?MAJOR.MINOR.PATCH(-pre)?(+build)?`	`v1.2.3`, `1.0.0-beta.1`
Git SHA	7-12 or 40 lowercase hex chars	`a1b2c3d`, full 40-char SHA
Docker Image	`[registry/]repo:tag` or `repo@sha256:digest`	`nginx:latest`, `gcr.io/proj/img:v1`
AWS ARN	`arn:aws:service:region:account:resource`	`arn:aws:s3:::my-bucket`
GCP Resource	`projects//...` or `organizations//...`	`projects/my-proj/topics/t`
CIDR Block	IPv4/prefix (0-32)	`10.0.0.0/16`
Cron Expression	5-field cron pattern	`0 /6 * *`
K8s Namespace	DNS-1123 labels, known system namespaces	`kube-system`, `my-app-staging`
Terraform Resource	`provider_type.name`	`aws_instance.web`

The Source Code Plugin: vajra-domain-source

The source code plugin recognizes patterns in the JSON trees produced by vajra-source (tree-sitter CST-to-JSON output). It works alongside vajra-source, which handles the parsing.

Type Recognizers

Recognized Type	Pattern	Example Values
snake_case identifier	`[a-z][a-z0-9]*(_[a-z0-9]+)+`	`my_function`, `get_value`
camelCase identifier	`[a-z]...[A-Z]...`	`myFunction`, `getValue`
PascalCase identifier	`[A-Z][a-zA-Z0-9]+`	`MyClass`, `HttpClient`
SCREAMING_SNAKE_CASE	`[A-Z][A-Z0-9]*(_[A-Z0-9]+)+`	`MAX_SIZE`, `HTTP_STATUS`
Import path	`mod::path` or `pkg.Class` or `@scope/pkg`	`std::collections::HashMap`
Source file path	Path ending in `.rs`, `.py`, `.go`, etc.	`src/main.rs`, `lib/utils.py`

Relationship Hints

Hint	Pattern	Meaning
Function definition	name + parameters + body	A function or method
Class definition	name + body + inheritance	A class or struct
Import statement	path + optional alias	A use/import declaration
Parameter list	type + name pairs	Function parameters
Conditional block	condition + consequence + alternative	An if/else construct
Loop block	condition/iterator + body	A for/while loop

The Encoding Plugin: vajra-domain-encoding

The encoding plugin detects data encodings embedded in JSON string values. It identifies Base64, hex, URL encoding, HTML entities, PEM certificates, and more — including adversarial patterns like double encoding and mixed encoding used for evasion.

Type Recognizers (3 Tiers)

Tier 1 — Definite confidence (structural markers, near-zero false positives):

Recognized Type	Pattern	Example Values
PEM block	`-----BEGIN ...-----` prefix/suffix	Certificates, private keys
Data URI	`data:mime;base64,...`	Embedded images, payloads
MIME encoded word	`=?charset?B/Q?...?=`	Email header encoding
Punycode	`xn--` prefix	Internationalized domain names

Tier 2 — Dominant confidence (strong patterns, low false positives):

Recognized Type	Pattern	Example Values
URL encoded	2+ `%XX` sequences + trial decode	`hello%20world%21`
Quoted-printable	3+ `=XX` sequences	MIME email encoding
HTML entity	2+ `&...;` entities	`<script>`
Unicode escape	2+ `\uXXXX` or `\xNN`	`\u0048\u0065`
Base64URL	16+ chars, URL-safe alphabet	API tokens, URL-safe data

Tier 3 — Heuristic (aggressive false positive gating):

Recognized Type	Detection	Security Signal
Base64	24+ chars, div-by-4, trial decode, entropy gate	Obfuscated payloads, exfiltration
Hex encoded	32+ chars, excludes known hash lengths	Shellcode, binary blobs
Double encoded	Decode reveals another encoding	Evasion technique (`%253C` → `%3C` → `<`)
Mixed encoding	2+ encoding types in one value	Obfuscation, WAF bypass

Layer Peeling API

Beyond type recognition, the plugin provides detect_encoding_layers() for recursive analysis:

#![allow(unused)]
fn main() {
use vajra_domain_encoding::detect_encoding_layers;

let layers = detect_encoding_layers("%2548ello%2520world", 5);
// Returns: [url_encoded(depth=0), url_encoded(depth=1)]
}

Bounded at depth 5, decode capped at 4KB per layer. Catches base64(url(hex(payload))).

The GitHub Plugin: vajra-domain-github

The GitHub plugin recognizes types commonly found in GitHub API responses, webhook payloads, and exported repository data (PRs, issues, commits, reviews, releases, workflow runs).

Type Recognizers

Recognized Type	Pattern	Priority	Confidence	Example Values
PR Number	`#\d+` or bare integer in PR context	10	0.90	`#142`, `1587`
Issue Number	`#\d+` or bare integer in issue context	10	0.90	`#23`, `456`
GitHub Username	`[a-zA-Z0-9](-?[a-zA-Z0-9]){0,38}`	20	0.75	`copyleftdev`, `octocat`
Repo Slug	`owner/repo` pattern	15	0.85	`copyleftdev/vajra`, `rust-lang/rust`
Commit SHA	7-40 hex chars in commit context	10	0.95	`a1b2c3d`, full 40-char SHA
Branch Name	Ref-like strings with `/` separators	25	0.70	`main`, `feature/cascade-cmd`
Label	Known label patterns (bug, enhancement, etc.)	30	0.65	`bug`, `good first issue`
Milestone	Version-like or sprint-like strings	30	0.60	`v1.0`, `Sprint 12`
Review State	One of: approved, changes_requested, commented, dismissed	5	1.00	`approved`, `changes_requested`
Merge Method	One of: merge, squash, rebase	5	1.00	`squash`, `rebase`

Relationship Hints

Hint	Field Patterns	Meaning
Pull Request	`number`, `title`, `state`, `author`, `base`, `head`	A pull request record
Issue	`number`, `title`, `state`, `labels`, `assignees`	An issue record
Review	`author`, `state`, `body`, `submitted_at`	A PR review
Commit	`sha`, `message`, `author`, `date`	A commit record
Release	`tag_name`, `name`, `published_at`, `assets`	A release record
Workflow Run	`name`, `status`, `conclusion`, `run_number`	A CI workflow run
Discussion	`title`, `author`, `category`, `answer`	A GitHub discussion

Future Plugin Domains

The architecture supports any domain:

Domain	Plugin	Type Recognizers
Financial	`vajra-domain-finance`	SWIFT, IBAN, CUSIP, currency codes
Telecom	`vajra-domain-telecom`	E.164 numbers, IMSI, CDR fields
IoT / Sensor	`vajra-domain-iot`	Sensor types, unit patterns, device IDs

Architecture

Vajra is a Rust workspace of 17 crates. Each crate has a single responsibility. Dependencies flow downward. Nothing cycles.

The 17-Crate Workspace

vajra/
├── vajra-types/          Shared types, traits, contracts
├── vajra-core/           Parsing, traversal, canonicalization, path extraction
├── vajra-fingerprint/    BLAKE3 hashing, Merkle trees, MinHash, SimHash, LSH
├── vajra-stats/          CMS, Space-Saving, DDSketch, MAD, entropy, frequency
├── vajra-anomaly/        Outlier scoring, instability, rarity, structural anomaly
├── vajra-drift/          JSD, Wasserstein, path diff, drift classification
├── vajra-motif/          Motif counting, near-motif grouping, motif compression
├── vajra-essence/        Profiles, scoring, ranking, rendering, templates
├── vajra-query/          Expression parsing, path filtering, analysis functions
├── vajra-source/         Source code parsing via tree-sitter (Rust, Python, Go, JS, +5)
├── vajra-cli/            CLI argument parsing, command dispatch, output formatting
├── vajra-domain-med/     Medical/EDI type recognizers (ICD-10, CPT, NPI, NDC, HCPCS)
├── vajra-domain-sec/     Security type recognizers (CVE, MITRE ATT&CK, IPs, hashes, JWT)
├── vajra-domain-devops/  DevOps type recognizers (K8s, Docker, Terraform, ARN, semver)
├── vajra-domain-source/  Source code recognizers (naming conventions, import paths)
├── vajra-domain-encoding/ Encoding detection (Base64, hex, URL, PEM, layers)
└── Cargo.toml            Workspace root

Dependency Graph

                    vajra-types
                   /     |     \
                  /      |      \
           vajra-core    |    vajra-domain-{med,sec,devops}
            /    \       |       /
           /      \      |      /
  vajra-fingerprint  vajra-stats
       |          \   /   |
       |           \ /    |
       |      vajra-anomaly
       |           |
       |      vajra-drift
       |           |
       |      vajra-motif
       |         / |
       |        /  |
       vajra-essence
            |
       vajra-query
            |
       vajra-cli

Root crates (no internal dependencies):

vajra-types — shared types, trait definitions, result contracts
vajra-core depends only on vajra-types

Leaf crate (depends on everything):

vajra-cli — the binary. It orchestrates all other crates.

Crate Responsibilities

vajra-types

The foundation. Shared types that every crate depends on.

Document — the parsed document model (value tree + path trie + metadata)
WildcardPath — normalized path representation with [*] array indices
PathTrie — trie data structure for efficient path storage and lookup
FeatureStore — per-path feature vectors
JsonType — enum of JSON types (object, array, string, number, boolean, null)
Core traits: Analyzer, StreamAnalyzer, FeatureExtractor, ConcernProfile, Fingerprinter, DriftDetector

#![allow(unused)]
fn main() {
pub trait Analyzer {
    type Output;
    fn analyze(&self, doc: &Document) -> Result<Self::Output>;
}

pub trait StreamAnalyzer {
    type Accumulator: Default;
    type Output;
    fn on_event(&self, event: &JsonEvent, acc: &mut Self::Accumulator) -> Result<()>;
    fn finalize(&self, acc: Self::Accumulator) -> Result<Self::Output>;
}
}

vajra-core

Parsing, traversal, and the foundational index.

simd-json integration for DOM-mode parsing
Multi-format input support (JSON, NDJSON, YAML, CSV, TSV, Markdown, PDF)
Compression handling (gzip, zstd)
HTTP URL fetching
RFC 8785 canonicalization
DFS path extraction and path trie construction
Unicode NFC normalization
Redaction engine (vajra_core::redact)
Input hardening (depth limits, string length limits, size limits)

vajra-fingerprint

Structural identity.

BLAKE3 path set fingerprint
BLAKE3 typed path fingerprint
Merkle subtree hashing (shape fingerprint)
MinHash signature computation (k = 128)
SimHash for near-motif detection
LSH bucketing for scalable similarity search
Cluster computation from LSH candidates
StreamingFingerprintAccumulator for streaming mode

vajra-stats

The statistical engine.

Shannon entropy (exact and CMS-approximate)
Normalized entropy
Count-Min Sketch with conservative update
Space-Saving top-k
DDSketch for streaming quantiles
MAD and modified z-scores
Frequency analysis (key, path, value)
Missingness profiling (null rate, absent rate, empty rate)
Numeric distribution summary (min, max, mean, median, percentiles)
Co-occurrence and PMI computation
Benford’s Law leading digit analysis
StreamingStatsAccumulator for streaming mode

vajra-anomaly

Deviation detection.

Numeric outlier detection (MAD-based z-scores)
Rarity scoring (self-information)
Structural deviation detection (Jaccard distance from mode)
Type instability detection
Composite anomaly scoring
Anomaly report generation

vajra-drift

Change detection between documents.

Path set symmetric difference (structural drift)
Type drift detection
Jensen-Shannon Divergence for distributional drift
1D Wasserstein distance for numeric drift magnitude
Drift classification (additive, subtractive, type-mutative, distributional, cardinality-shift, null-rate-shift)
Severity scoring with profile-dependent weights

vajra-motif

Repeated structure analysis.

Motif counting from Merkle subtree hash frequencies
Near-motif grouping via SimHash Hamming distance
Motif ranking (frequency x subtree size)
Motif compression for essence generation
Array morphology analysis (homogeneity, uniqueness, shape diversity)

vajra-essence

The rendering engine.

Built-in profiles: StaffProfile, EngineerProfile, AuditorProfile, AiProfile, FraudProfile
Custom profile loading from TOML
Six-dimensional scoring model
Candidate collection and ranking
Token budget enforcement (greedy knapsack)
Text, JSON, Markdown, and compact-AI renderers
Motif collapsing
--explain score decomposition
Provenance metadata attachment

vajra-query

Path-based query engine.

Expression parser for path filters and analysis functions
entropy(path), rarity(path, value), instability(path), null_rate(path), stats(path), anomaly_score(path), motif(path)
Conditional expression evaluation (e.g., entropy($.status) > 0.5)
Integration with stats, anomaly, and motif analyzers

vajra-cli

The command-line interface.

Clap-based argument parsing
Command dispatch (inspect, stats, anomalies, fingerprint, essence, drift, cluster, invariants, query, batch, profiles)
Output format rendering (text, JSON, Markdown, compact-AI)
Redaction integration
Streaming mode selection
Custom profile loading
Batch processing with Rayon parallelism

vajra-domain-med

The medical/EDI domain plugin.

ICD-10-CM and ICD-10-PCS pattern recognizers
CPT and HCPCS code recognizers
NDC (National Drug Code) recognizer
NPI (National Provider Identifier) recognizer with Luhn check
Denial reason code recognizer (CO, PR, OA, PI, CR)
Claim, service line, patient, provider, and adjudication relationship hints
Implements VajraPlugin trait

Core Traits

The trait system is the architectural backbone. Each trait is small, composable, and independently testable.

Trait	Defined In	Purpose
`Analyzer`	vajra-types	DOM-mode analysis: document in, typed output out
`StreamAnalyzer`	vajra-types	Streaming analysis: events in, accumulator maintained, output finalized
`FeatureExtractor`	vajra-types	Extract features into the shared feature store
`ConcernProfile`	vajra-types	Define scoring weights and rendering behavior
`Fingerprinter`	vajra-types	Compute structural fingerprints
`DriftDetector`	vajra-types	Compare two analyzed documents for drift
`VajraPlugin`	vajra-types	Plugin extension point
`TypeRecognizer`	vajra-types	Domain-specific value type recognition

Navigating the Codebase

“I want to understand how parsing works.”
Start at vajra-core/src/. The input module handles multi-format loading. The parse module handles JSON parsing. The canon module handles canonicalization.

“I want to understand the statistical engine.”
Start at vajra-stats/src/. Each statistical primitive has its own module. StatsAnalyzer composes them.

“I want to add a new profile.”
Look at vajra-essence/src/. The built-in profiles (StaffProfile, EngineerProfile, etc.) implement ConcernProfile. Follow the pattern.

“I want to add a domain plugin.”
Look at vajra-domain-med/ as the reference implementation. Implement VajraPlugin in a new crate.

“I want to add a new command.”
Start at vajra-cli/src/main.rs. Each command is a function (cmd_inspect, cmd_stats, etc.). Add a new variant to the Command enum and implement the handler.

“I want to understand how essences are built.”
Start at vajra-essence/src/. The EssenceBuilder collects observations from stats, anomaly, and motif analyzers, scores them, and renders the result.

Build and Run

# Build the entire workspace
cargo build --release

# Run tests across all crates
cargo test --workspace

# Run the CLI
./target/release/vajra inspect claim.json

# Run benchmarks
cargo bench --workspace

External Dependencies

Dependency	Version	Purpose
`serde` / `serde_json`	1.x	Serialization
`serde_yaml`	0.9	YAML input format
`csv`	1.x	CSV/TSV input format
`blake3`	1.x	All hashing
`clap`	4.x	CLI argument parsing
`ryu`	1.x	Deterministic float formatting
`unicode-normalization`	0.1	Unicode NFC normalization
`toml`	0.8	Config and profile loading
`regex`	1.x	Pattern matching (redaction, type recognition)
`rayon`	1.x	Parallel batch processing
`thiserror` / `anyhow`	2.x / 1.x	Error handling
`flate2`	1.x	Gzip decompression
`zstd`	0.13	Zstd decompression
`pulldown-cmark`	0.12	Markdown input parsing
`pdf-extract`	0.10	PDF text extraction
`ureq`	2.x	HTTP URL fetching
`proptest`	1.x	Property-based testing
`criterion`	0.5	Benchmarks

All dependencies are Rust-native. No C bindings, no FFI, no system library requirements beyond a standard Rust toolchain.

Lints

The workspace enforces strict Clippy lints:

[workspace.lints.clippy]
pedantic = { level = "warn", priority = -1 }
nursery = { level = "warn", priority = -1 }
unwrap_used = "deny"    # No .unwrap() — use Result
expect_used = "deny"    # No .expect() — use Result
panic = "deny"          # No panic!() — ever

No panics on any input. No unwraps. No expects. Every error path returns a Result.

Testing

Vajra’s test suite is not an afterthought. It is a structural guarantee. 1075 tests across 7 testing strategies ensure that every algorithm, every command, and every output contract works as specified — and continues to work as the codebase evolves.

The Test Philosophy

Every algorithm has a unit test that verifies it against known inputs with expected outputs. No algorithm ships without a proof that it computes correctly.
Every property that should hold universally is tested with random inputs. Canonicalization is idempotent. Fingerprints are key-order-independent. Drift detection is symmetric. These are not checked on one example — they are checked on thousands of generated inputs.
Every failure mode is tested. Malformed JSON, deeply nested documents, pathologically wide objects, adversarial strings. If Vajra can encounter it in the wild, the fuzzer has already thrown it.
Determinism is tested directly. Same input, same config, 10 runs, byte-identical output. This runs in CI on every commit.
Streaming and DOM modes are tested against each other. They must agree within documented error bounds. If they diverge, the streaming approximation is broken.

Test Categories

Unit Tests

1075 tests across all 17 crates. Each primitive, each algorithm, each data structure has targeted tests with known inputs and expected outputs. Domain plugins (medical, security, DevOps) each carry their own property tests, determinism tests, and golden corpus validation.

Examples from each crate:

vajra-core:

#![allow(unused)]
fn main() {
#[test]
fn canonicalization_sorts_keys_lexicographically() {
    let input = r#"{"b": 2, "a": 1, "c": 3}"#;
    let doc = Document::parse_str(input).unwrap();
    let canonical = doc.canonical_json();
    assert_eq!(canonical, r#"{"a":1,"b":2,"c":3}"#);
}

#[test]
fn path_extraction_normalizes_array_indices() {
    let input = r#"{"items": [{"id": 1}, {"id": 2}]}"#;
    let doc = Document::parse_str(input).unwrap();
    let paths = doc.trie().all_paths();
    assert!(paths.iter().any(|p| p.to_string() == "$.items[*].id"));
}

#[test]
fn malformed_json_returns_error_not_panic() {
    let input = r#"{"unclosed": "string"#;
    let result = Document::parse_str(input);
    assert!(result.is_err());
}
}

vajra-stats:

#![allow(unused)]
fn main() {
#[test]
fn shannon_entropy_of_uniform_distribution() {
    // 4 equally likely values -> entropy = 2.0 bits
    let values = vec!["a", "b", "c", "d"];
    let counts: BTreeMap<&str, u64> = values.iter()
        .map(|v| (*v, 25u64))
        .collect();
    let entropy = shannon_entropy(&counts);
    assert!((entropy - 2.0).abs() < 1e-10);
}

#[test]
fn mad_of_known_distribution() {
    let values = vec![1.0, 2.0, 3.0, 4.0, 5.0, 100.0];
    let median = 3.5;
    let mad = compute_mad(&values);
    // MAD = median(|1-3.5|, |2-3.5|, |3-3.5|, |4-3.5|, |5-3.5|, |100-3.5|)
    //     = median(2.5, 1.5, 0.5, 0.5, 1.5, 96.5) = 1.5
    assert!((mad - 1.5).abs() < 1e-10);
}

#[test]
fn ddsketch_quantiles_within_relative_accuracy() {
    let mut sketch = DDSketch::new(0.01); // 1% relative accuracy
    for v in &known_distribution {
        sketch.insert(*v);
    }
    let estimated_p50 = sketch.quantile(0.5).unwrap();
    let true_p50 = exact_median(&known_distribution);
    assert!((estimated_p50 - true_p50).abs() <= 0.01 * true_p50.abs());
}
}

vajra-fingerprint:

#![allow(unused)]
fn main() {
#[test]
fn path_set_fingerprint_is_key_order_independent() {
    let a = Document::parse_str(r#"{"x": 1, "y": 2}"#).unwrap();
    let b = Document::parse_str(r#"{"y": 2, "x": 1}"#).unwrap();
    let fp_a = FingerprintAnalyzer.analyze(&a).unwrap();
    let fp_b = FingerprintAnalyzer.analyze(&b).unwrap();
    assert_eq!(fp_a.path_set, fp_b.path_set);
}

#[test]
fn identical_subtrees_produce_identical_merkle_hashes() {
    let input = r#"{"items": [{"a": 1, "b": 2}, {"a": 3, "b": 4}]}"#;
    let doc = Document::parse_str(input).unwrap();
    let fp = FingerprintAnalyzer.analyze(&doc).unwrap();
    // Both array elements have the same structure -> same subtree hash
    assert_eq!(fp.subtree_hashes[0], fp.subtree_hashes[1]);
}
}

vajra-anomaly:

#![allow(unused)]
fn main() {
#[test]
fn mad_outlier_detection_flags_extreme_values() {
    let values: Vec<f64> = (0..100).map(|i| i as f64).collect();
    let mut values_with_outlier = values.clone();
    values_with_outlier.push(10_000.0);
    let report = AnomalyAnalyzer::detect_numeric_outliers(
        &values_with_outlier, 3.5
    );
    assert!(report.outliers.iter().any(|o| o.value == 10_000.0));
}
}

vajra-drift:

#![allow(unused)]
fn main() {
#[test]
fn jsd_is_symmetric() {
    let p = distribution_a();
    let q = distribution_b();
    let jsd_pq = jensen_shannon_divergence(&p, &q);
    let jsd_qp = jensen_shannon_divergence(&q, &p);
    assert!((jsd_pq - jsd_qp).abs() < 1e-10);
}

#[test]
fn jsd_is_zero_for_identical_distributions() {
    let p = distribution_a();
    let jsd = jensen_shannon_divergence(&p, &p);
    assert!(jsd.abs() < 1e-10);
}
}

Property Tests

Using proptest, Vajra tests invariants that must hold for all valid inputs:

Canonicalization idempotence:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn canonicalize_is_idempotent(json in arb_json()) {
        let once = canonicalize(&json);
        let twice = canonicalize(&once);
        prop_assert_eq!(once, twice);
    }
}
}

Fingerprint stability under key reordering:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn fingerprint_stable_under_key_reorder(obj in arb_json_object()) {
        let original = fingerprint(&obj);
        let shuffled = shuffle_keys(&obj);
        let recomputed = fingerprint(&shuffled);
        prop_assert_eq!(original.path_set, recomputed.path_set);
    }
}
}

Merkle hash determinism:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn merkle_hash_deterministic(json in arb_json()) {
        let hash1 = merkle_subtree_hash(&json);
        let hash2 = merkle_subtree_hash(&json);
        prop_assert_eq!(hash1, hash2);
    }
}
}

Drift symmetry:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn structural_drift_is_symmetric(a in arb_json(), b in arb_json()) {
        let drift_ab = structural_drift(&a, &b);
        let drift_ba = structural_drift(&b, &a);
        prop_assert_eq!(drift_ab.added_paths, drift_ba.removed_paths);
        prop_assert_eq!(drift_ab.removed_paths, drift_ba.added_paths);
    }
}
}

MinHash accuracy convergence:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn minhash_jaccard_converges(
        a in arb_string_set(1..100),
        b in arb_string_set(1..100)
    ) {
        let true_jaccard = exact_jaccard(&a, &b);
        let estimated = minhash_jaccard(&a, &b, 128);
        // With 128 hashes, expected error < 0.1 at 95% confidence
        prop_assert!((true_jaccard - estimated).abs() < 0.15);
    }
}
}

DDSketch relative error guarantee:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn ddsketch_quantile_within_bounds(values in arb_f64_vec(10..1000)) {
        let mut sketch = DDSketch::new(0.01);
        for v in &values { sketch.insert(*v); }
        let estimated = sketch.quantile(0.5).unwrap();
        let exact = exact_median(&values);
        prop_assert!((estimated - exact).abs() <= 0.01 * exact.abs() + 1e-10);
    }
}
}

Scoring determinism:

#![allow(unused)]
fn main() {
proptest! {
    #[test]
    fn scoring_is_deterministic(json in arb_json(), profile in arb_profile()) {
        let score1 = compute_scores(&json, &profile);
        let score2 = compute_scores(&json, &profile);
        prop_assert_eq!(score1, score2);
    }
}
}

Chaos Tests (Fuzzing)

Using cargo-fuzz and AFL, the fuzzer throws adversarial inputs at every entry point:

Input Category	What It Tests
Truncated JSON	`{"key": "valu` — parser graceful failure
Unbalanced braces	`{{{}}` — parser error recovery
Invalid UTF-8	Raw byte sequences — no undefined behavior
Depth 10,000+ nesting	`[[[[[...` — depth limit enforcement
100,000+ keys per object	`{"k1":1,"k2":2,...}` — performance, memory
1M identical array elements	`[1,1,1,...]` — motif detection, sketch behavior
Type chaos	Same path alternates string/number/null — instability detection
Adversarial strings	Null bytes, RTL markers, control characters, multi-byte Unicode
Near-max-size documents	At the streaming threshold boundary — mode switching

Target: Zero panics. Zero undefined behavior. Graceful error on every input.

# Run the fuzzer
cd vajra-core
cargo fuzz run parse_json -- -max_total_time=3600

Differential Tests

Two implementations of the same analysis must agree within documented bounds:

DOM vs. Streaming:

#![allow(unused)]
fn main() {
#[test]
fn dom_and_streaming_stats_agree() {
    let doc = Document::parse_file("corpus/claim.json").unwrap();
    let dom_stats = StatsAnalyzer.analyze(&doc).unwrap();

    let mut acc = StreamingStatsAccumulator::default();
    for event in stream_events("corpus/claim.json") {
        acc.on_event(&event.unwrap()).unwrap();
    }
    let stream_stats = acc.finalize().unwrap();

    // Path sets must be identical
    assert_eq!(dom_stats.paths.keys().collect::<Vec<_>>(),
               stream_stats.paths.keys().collect::<Vec<_>>());

    // CMS estimates within error bounds
    for (path, dom_ps) in &dom_stats.paths {
        let stream_ps = &stream_stats.paths[path];
        for (value, &exact_count) in &dom_ps.value_frequencies {
            let estimated = stream_ps.estimated_frequency(value);
            assert!(exact_count <= estimated);
            assert!(estimated <= exact_count + EPSILON * stream_stats.total_values);
        }
    }
}
}

Exact quantiles vs. DDSketch:

#![allow(unused)]
fn main() {
#[test]
fn ddsketch_within_relative_accuracy() {
    let values = load_test_values("corpus/charge_amounts.json");
    let mut sketch = DDSketch::new(0.01);
    for v in &values { sketch.insert(*v); }

    for q in &[0.01, 0.05, 0.25, 0.5, 0.75, 0.95, 0.99] {
        let exact = exact_quantile(&values, *q);
        let estimated = sketch.quantile(*q).unwrap();
        let relative_error = (estimated - exact).abs() / exact.abs();
        assert!(relative_error <= 0.01,
            "q={}: exact={}, estimated={}, error={}",
            q, exact, estimated, relative_error);
    }
}
}

Determinism Tests

#![allow(unused)]
fn main() {
#[test]
fn ten_run_determinism() {
    let corpus = load_corpus("corpus/");
    for file in &corpus {
        let mut outputs = Vec::new();
        for _ in 0..10 {
            let output = run_vajra(&["essence", file, "--profile", "engineer", "--format", "json"]);
            outputs.push(output);
        }
        for i in 1..outputs.len() {
            assert_eq!(outputs[0], outputs[i],
                "Determinism violation on run {} for file {}", i, file);
        }
    }
}

#[test]
fn different_seeds_may_differ() {
    let output_seed0 = run_vajra(&["cluster", "corpus/", "--seed", "0", "--format", "json"]);
    let output_seed42 = run_vajra(&["cluster", "corpus/", "--seed", "42", "--format", "json"]);
    // May differ — that is fine. But within each seed, must be deterministic.
}

#[test]
fn same_seed_is_deterministic() {
    for seed in &["0", "42", "12345"] {
        let mut outputs = Vec::new();
        for _ in 0..10 {
            let output = run_vajra(&["cluster", "corpus/", "--seed", seed, "--format", "json"]);
            outputs.push(output);
        }
        for i in 1..outputs.len() {
            assert_eq!(outputs[0], outputs[i]);
        }
    }
}
}

Golden Tests

For each profile-format combination, golden output files are committed to the repository:

tests/golden/
├── claim_staff_text.golden
├── claim_staff_json.golden
├── claim_engineer_text.golden
├── claim_engineer_json.golden
├── claim_auditor_markdown.golden
├── claim_ai_compact.golden
├── claim_fraud_text.golden
├── drift_engineer_text.golden
├── anomalies_text.golden
└── ...

CI asserts byte-exact match between current output and golden files:

#![allow(unused)]
fn main() {
#[test]
fn golden_staff_text() {
    let output = run_vajra(&["essence", "corpus/claim.json", "--profile", "staff"]);
    let golden = std::fs::read_to_string("tests/golden/claim_staff_text.golden").unwrap();
    assert_eq!(output, golden, "Golden test failed: staff/text");
}
}

Golden files are updated explicitly — never auto-updated. When output changes intentionally (algorithm improvement, rendering change), the developer updates the golden files and the diff is reviewed in the PR.

This catches: rendering regressions, ordering instabilities, score drift from algorithm changes.

Benchmark Tests

Using criterion, tracking performance across commits:

#![allow(unused)]
fn main() {
fn bench_parse_1mb(c: &mut Criterion) {
    let input = std::fs::read("benches/fixtures/1mb.json").unwrap();
    c.bench_function("parse_1mb", |b| {
        b.iter(|| Document::parse_bytes(black_box(&input)))
    });
}

fn bench_stats_1mb(c: &mut Criterion) {
    let doc = Document::parse_file("benches/fixtures/1mb.json").unwrap();
    c.bench_function("stats_1mb", |b| {
        b.iter(|| StatsAnalyzer.analyze(black_box(&doc)))
    });
}

fn bench_fingerprint_comparison(c: &mut Criterion) {
    let fp_a = /* precomputed */;
    let fp_b = /* precomputed */;
    c.bench_function("fingerprint_compare", |b| {
        b.iter(|| minhash_jaccard(black_box(&fp_a), black_box(&fp_b)))
    });
}
}

Performance targets validated in CI:

Scenario	Target	Test
1 MB JSON, full analysis	< 100 ms	`bench_full_1mb`
100 MB JSON, full analysis	< 5 s	`bench_full_100mb`
10,000 document batch	< 30 s	`bench_batch_10k`
Fingerprint comparison	< 1 us per pair	`bench_fingerprint_compare`

Regressions > 10% fail the build.

Running Everything

# All unit and integration tests
cargo test --workspace

# Property tests (may run longer)
cargo test --workspace -- --include-ignored proptest

# Benchmarks
cargo bench --workspace

# Fuzzing (runs until stopped)
cd vajra-core && cargo fuzz run parse_json

# Determinism check (manual)
for i in $(seq 1 10); do
  vajra essence test/claim.json --format json > "/tmp/run_$i.json"
done
md5sum /tmp/run_*.json
# All hashes must be identical

The Invariant Catalog

These properties are tested across the suite. If any is violated, the build fails.

Invariant	Test Type
Canonicalization is idempotent	Property test
Fingerprints are key-order-independent	Property test
Identical subtrees produce identical Merkle hashes	Property test
Structural drift is symmetric (with direction inversion)	Property test
MinHash Jaccard converges to true Jaccard	Property test
DDSketch quantiles within relative accuracy	Property + differential
CMS estimates within proven error bounds	Differential test
DOM and streaming produce consistent results	Differential test
10 runs produce byte-identical output	Determinism test
No panics on any input	Fuzz test
No undefined behavior	Fuzz test
Golden output is byte-stable	Golden test
Performance within 10% of baseline	Benchmark
Mutation score > 85%	Mutation test

Keyboard shortcuts

Vajra

VAJRA

BLAKE3 Fingerprinting

Shannon Entropy

MAD Outliers

Jensen-Shannon Divergence

DDSketch

MinHash + LSH