Streaming
Vajra handles JSON of any size. A 50 KB medical claim and a 10 GB event log enter the same pipeline. The streaming engine is what makes this possible.
Two Modes
DOM Mode
For documents that fit in memory. The parser builds a full in-memory tree with random access to every node. All analysis passes can access any part of the document at any time.
Parser: simd-json at 2+ GB/s.
Memory: O(n) where n = document size.
Activates: By default, for documents below the streaming threshold (default 100 MB, configurable).
Streaming Mode
For documents that exceed available memory. SAX-style event parsing with bounded memory. The parser emits events (start-object, key, value, end-object, start-array, end-array) and the analyzers update their accumulators incrementally.
Memory: O(p + s) where p = distinct paths and s = sum of sketch sizes. For typical JSON with < 1,000 distinct paths: < 10 MB regardless of document size.
Activates: Automatically when document size exceeds the streaming threshold. Force with --streaming.
The Two-Pass Hybrid Strategy
Streaming mode does not mean single-pass-only. Vajra uses a hybrid strategy that balances memory efficiency with analysis depth.
Pass 1: Profile the Document
A single streaming pass collects:
- Path extraction. Every wildcard path discovered and registered in the path trie.
- Frequency counting. Value frequencies per path via Count-Min Sketch (conservative update).
- Top-k identification. Most frequent values per path via Space-Saving algorithm.
- Type profiling. Type distribution per path tracked via simple counters.
- Numeric sketches. DDSketch accumulators for every numeric path — percentiles, median, MAD.
- Null and missingness tracking. Per-path counters for null, absent, empty.
- Entropy estimation. Computed from CMS frequency estimates when exact counting exceeds memory.
- Fingerprint accumulation. Merkle hashes built incrementally as subtrees complete.
After Pass 1, Vajra has a complete statistical profile of the document without having held more than one event in memory at a time.
Pass 2 (Optional): Selective DOM for High-Signal Subtrees
If the command requires rich analysis that streaming cannot provide (motif analysis, essence generation with deep context), Vajra can selectively parse high-signal subtrees into DOM.
The decision is based on Pass 1 results:
- Subtrees with high anomaly density are candidates for DOM parsing.
- Subtrees with high entropy fields that need value-level analysis.
- The dominant motif’s representative instance.
Pass 2 is optional. Commands like stats and fingerprint need only Pass 1. Commands like essence may invoke Pass 2 for targeted depth.
Sketch Data Structures in Streaming Mode
DDSketch
Role: Numeric distribution analysis — percentiles, median, MAD.
One DDSketch per numeric path. Each sketch maintains O(log(max/min) / log(1 + alpha)) buckets. With alpha = 0.01 and financial data spanning $0.01 to $1,000,000, this is roughly 700 buckets — a few KB of memory per path.
Key property: Mergeability. When processing a batch in parallel, per-file DDSketch instances merge into a global sketch with zero accuracy loss.
#![allow(unused)]
fn main() {
// Streaming numeric stats
let mut stats = StreamingStatsAccumulator::default();
for event in parser {
stats.on_event(&event?)?;
}
let result = stats.finalize()?;
// result.numeric_stats contains DDSketch-derived percentiles
}
Count-Min Sketch (CMS)
Role: Frequency estimation for values, paths, and key names when cardinality exceeds configurable thresholds.
Default configuration: width = 2,718, depth = 5. Total memory: ~54 KB per sketch. Error guarantee: estimated count within 0.1% of total count with 99% probability.
Activation: Exact counting is preferred when it fits in memory. CMS activates as a fallback when distinct value count per path exceeds the threshold (default: 10,000 distinct values).
Space-Saving
Role: Identifying top-k most frequent elements without storing all elements.
Maintains exactly k counters (default k = 100). Guaranteed to include every element whose true frequency exceeds N/k. Memory: k entries, a few KB.
Memory Budget
The total streaming memory budget is bounded:
| Component | Memory |
|---|---|
| Path trie | O(p) where p = distinct wildcard paths |
| DDSketch (per numeric path) | ~3 KB per path |
| CMS (per high-cardinality path) | ~54 KB per path |
| Space-Saving (per path) | ~4 KB per path (k=100) |
| Type counters (per path) | ~48 bytes per path |
| Null/absent counters (per path) | ~32 bytes per path |
| Fingerprint accumulator | O(current depth) |
For a document with 500 distinct paths, 100 numeric paths, and 50 high-cardinality paths:
Path trie: ~100 KB
DDSketch: ~300 KB (100 paths x 3 KB)
CMS: ~2.7 MB (50 paths x 54 KB)
Space-Saving: ~2.0 MB (500 paths x 4 KB)
Type/null counters: ~40 KB (500 paths x 80 bytes)
Fingerprint: ~10 KB
---
Total: ~5.2 MB
This budget holds regardless of whether the document is 100 MB or 100 GB. The streaming guarantee: bounded memory independent of input size.
DOM vs. Streaming: What Changes
| Capability | DOM Mode | Streaming Mode |
|---|---|---|
| Parsing speed | 2+ GB/s (simd-json) | ~500 MB/s (event parser) |
| Random access | Full | None (sequential events) |
| Exact frequency counts | Yes | Only when cardinality fits in memory; CMS otherwise |
| Exact percentiles | Yes (via sorting) | Approximate (DDSketch, 1% relative error) |
| Exact entropy | Yes | Approximate (from CMS estimates) |
| Motif detection | Full (Merkle subtree hashing) | Partial (incremental, no lookback) |
| Relationship discovery | Full (random access to value pairs) | Partial (co-occurrence counters) |
| Essence quality | Full | Slightly reduced (no selective subtree re-parse in Pass 1) |
Every streaming approximation carries formal error bounds. The output explicitly labels which statistics are exact and which are approximate.
When Each Mode Activates
Document size < streaming_threshold (default 100 MB)
-> DOM mode
Document size >= streaming_threshold
-> Streaming mode (automatic)
--streaming flag present
-> Streaming mode (forced, regardless of size)
The threshold is configurable in the TOML config:
[parsing]
streaming_threshold = 104_857_600 # 100 MB
The StreamAnalyzer Trait
Any analyzer that implements StreamAnalyzer can participate in streaming mode:
#![allow(unused)]
fn main() {
pub trait StreamAnalyzer {
type Accumulator: Default;
type Output;
fn on_event(&self, event: &JsonEvent, acc: &mut Self::Accumulator) -> Result<()>;
fn finalize(&self, acc: Self::Accumulator) -> Result<Self::Output>;
}
}
The accumulator holds all state. Events arrive one at a time. finalize produces the result when the stream ends.
This trait is the key to extensibility. Custom analyzers that implement it automatically work in both DOM and streaming modes — DOM mode simply feeds all events from the pre-parsed tree.
Differential Testing: DOM vs. Streaming
For every document in the test corpus, Vajra runs both modes and asserts:
- CMS frequency estimates are within proven error bounds of exact counts
- DDSketch quantile estimates are within relative accuracy of exact quantiles
- Path sets are identical
- Fingerprints are identical
- Type distributions are identical
This ensures streaming mode is not a second-class citizen. It is a formally bounded approximation of DOM mode, not a degraded fallback.