Input Formats
Vajra reads more than JSON. It reads anything that can be interpreted as structured data — and it auto-detects the format so you do not have to tell it.
Supported Formats
| Format | Extensions | Detection | Notes |
|---|---|---|---|
| JSON | .json | Content starts with { or [ | Primary format. Full DOM and streaming support. |
| NDJSON | .ndjson, .jsonl | Multiple JSON objects separated by newlines | Each line is a separate document. Batch analysis native. |
| YAML | .yaml, .yml | Content starts with --- or key-colon pattern | Multi-document YAML supported (separated by ---). |
| CSV | .csv | Comma-separated with consistent column count | First row treated as headers. Each row becomes a JSON object. |
| TSV | .tsv | Tab-separated with consistent column count | Same as CSV but tab-delimited. |
| Markdown | .md | Markdown structure with tables or code blocks | Tables extracted as arrays of objects. Code blocks parsed if JSON/YAML. |
.pdf | PDF magic bytes | Text extracted and parsed for structured content. | |
| Gzip | .gz, .json.gz | Gzip magic bytes (1f 8b) | Decompressed transparently. Inner format auto-detected. |
| Zstd | .zst, .json.zst | Zstd magic bytes | Decompressed transparently. Inner format auto-detected. |
| HTTP URL | http://, https:// | URL scheme prefix | Fetched via blocking HTTP GET. Response body auto-detected. |
| Source Code | .rs, .py, .js, .ts, .go, .java, .c, .cpp, .rb | File extension matches known language | Parsed via tree-sitter into AST. Requires vajra-source feature. |
| Git Repository | (directory) | Directory contains .git/ | Reads commit history directly. See flags below. |
| V8 CPU Profile | .cpuprofile | File extension | Parses V8 .cpuprofile JSON into analyzable structure. |
| strace Summary | — | Content contains % time header | Parses strace -c summary output into structured records. |
| Stdin | - | Explicit - argument | Content auto-detected from first bytes. |
Auto-Detection Logic
When no --input-format is specified, Vajra detects the format in this order:
-
Check the argument. If it is
-, read from stdin. If it starts withhttp://orhttps://, fetch via HTTP. If it is a directory containing.git/, treat as a git repository. -
Check the extension.
.json-> JSON..ndjson/.jsonl-> NDJSON..yaml/.yml-> YAML..csv-> CSV..tsv-> TSV..md-> Markdown..pdf-> PDF..cpuprofile-> V8 CPU Profile..rs/.py/.js/.go/etc. -> Source Code (via tree-sitter). -
Check for compression. If the extension is
.gzor.zst, decompress and re-detect the inner format from the next extension (e.g.,.json.gz-> decompress -> JSON). -
Check content. If the extension is ambiguous or missing, read the first bytes:
- Starts with
{or[after whitespace -> JSON - Multiple
{...}\nsequences -> NDJSON - Starts with
---or matcheskey: valuepattern -> YAML - Consistent comma-separated columns -> CSV
- PDF magic bytes (
%PDF) -> PDF - Contains
% timecolumn header -> strace summary
- Starts with
-
Fall back to JSON. If nothing else matches, attempt JSON parsing.
Format Override
Force a specific format with --input-format:
vajra inspect data.txt --input-format json
vajra stats records.log --input-format ndjson
vajra inspect data.bin --input-format yaml
This overrides all auto-detection. Useful when files have nonstandard extensions.
Format Details
JSON
The primary format. Parsed by simd-json in DOM mode (full random access, rich analysis) or streaming mode (bounded memory, SAX-style events).
vajra inspect claim.json
echo '{"patient": "Martinez", "status": "active"}' | vajra inspect -
NDJSON (Newline-Delimited JSON)
Each line is an independent JSON document. Natural format for logs, event streams, and batch data.
vajra anomalies claims.ndjson
NDJSON records are aggregated into a single array for analysis. Commands like stats, anomalies, invariants, and essence compute across all records as a unified population.
Example input:
{"claim_id": "C001", "status": "adjudicated", "amount": 285.00}
{"claim_id": "C002", "status": "denied", "amount": 0.00}
{"claim_id": "C003", "status": "adjudicated", "amount": 47250.00}
YAML
Single-document and multi-document YAML both supported. Parsed via serde_yaml and converted to Vajra’s internal document model.
vajra inspect config.yaml
Multi-document YAML (separated by ---):
---
claim_id: C001
status: adjudicated
amount: 285.00
---
claim_id: C002
status: denied
amount: 0.00
vajra anomalies multi_claims.yaml
CSV
The first row is treated as column headers. Each subsequent row becomes a JSON object with header names as keys.
vajra stats claims.csv
Example input:
claim_id,status,charge_amount,allowed_amount
C001,adjudicated,285.00,210.00
C002,denied,125.00,
C003,adjudicated,890.00,675.00
Vajra converts this to:
[
{"claim_id": "C001", "status": "adjudicated", "charge_amount": "285.00", "allowed_amount": "210.00"},
{"claim_id": "C002", "status": "denied", "charge_amount": "125.00", "allowed_amount": ""},
{"claim_id": "C003", "status": "adjudicated", "charge_amount": "890.00", "allowed_amount": "675.00"}
]
Empty cells are preserved as empty strings, allowing missingness analysis to detect them.
TSV
Identical to CSV but tab-delimited. Same header-to-object conversion.
vajra stats data.tsv
vajra stats data.txt --input-format tsv
Markdown
Vajra extracts structured content from Markdown files:
- Tables are parsed into arrays of objects (headers become keys)
- JSON/YAML code blocks are parsed as embedded documents
vajra inspect report.md
Text is extracted from PDF files and parsed for any structured content (embedded tables, JSON fragments, structured text patterns).
vajra inspect document.pdf
PDF support depends on the pdf-extract crate. Complex layouts may lose structure during extraction.
Source Code
Vajra can analyze source code from any language supported by tree-sitter. The source file is parsed into a concrete syntax tree (CST), converted to a JSON structure, and analyzed through the full Vajra pipeline — entropy, anomalies, fingerprinting, drift, motifs, and essence all work on code.
vajra inspect main.rs # auto-detect Rust
vajra stats app.py # auto-detect Python
vajra drift v1/server.go v2/server.go # code structural drift
vajra essence lib.rs --profile engineer # code essence
vajra inspect main.rs --lang rust # explicit language
vajra inspect code.txt --input-format source --lang python # override format + language
Supported languages (each enabled by a feature flag, all on by default):
| Language | Extensions | Feature Flag |
|---|---|---|
| Rust | .rs | rust |
| Python | .py, .pyi | python |
| JavaScript | .js, .mjs, .cjs, .jsx | javascript |
| TypeScript | .ts, .tsx, .mts, .cts | typescript |
| Go | .go | go |
| Java | .java | java |
| C | .c, .h | c |
| C++ | .cpp, .cc, .cxx, .hpp | cpp |
| Ruby | .rb | ruby |
What Vajra reveals on code:
| Analysis | What It Finds |
|---|---|
| Entropy of AST node types | Structural diversity — boilerplate vs complex code |
| Rarity of node types | Unusual constructs — goto, unsafe, eval |
| Nesting depth anomalies | Complexity hotspots |
| Fingerprint comparison | Structural clones across files |
| Drift between versions | Added functions, removed classes, changed signatures |
| Motifs | Repeated structural patterns — copy-paste code |
Source code analysis requires the vajra-source crate (included by default). The companion vajra-domain-source plugin adds recognizers for naming conventions (snake_case, camelCase, PascalCase) and code structure relationships.
Semantic Paths
The --semantic-paths flag maps tree-sitter node kinds to human-readable labels in the output. Instead of raw AST node names like function_item or impl_item, you see function and implementation.
vajra inspect main.rs --semantic-paths
Without --semantic-paths:
$.program.function_item[0].identifier "process_record"
$.program.function_item[0].parameters.parameter[0] "record: &Record"
$.program.impl_item[0].identifier "Pipeline"
With --semantic-paths:
$.program.function[0].name "process_record"
$.program.function[0].parameters.param[0] "record: &Record"
$.program.implementation[0].name "Pipeline"
Covers 9 languages: Rust, Python, JavaScript, TypeScript, Go, Java, C, C++, and Ruby.
Git Repository
When the input is a directory containing a .git/ subdirectory, Vajra reads the commit history directly — no export step required.
vajra stats ./my-repo
vajra cascade ./my-repo --entity-field '$.author' --time-field '$.date' --event-field '$.type' --response-values 'fix,revert'
Each commit becomes a JSON record with fields like author, date, message, files_changed, and insertions/deletions.
Flags:
| Flag | Description | Default |
|---|---|---|
--git-limit <N> | Maximum number of commits to read | 500 |
--git-branch <branch> | Branch to read from | current HEAD |
vajra stats ./my-repo --git-limit 1000 --git-branch main
Auto-detection is based on the presence of .git/ in the input directory. To override, use --input-format git.
V8 CPU Profile
Vajra parses .cpuprofile files produced by V8-based tools (Chrome DevTools, Node.js --prof). The profile’s call tree is converted to a flat array of records with function name, source location, hit count, and self/total time.
vajra stats profile.cpuprofile
vajra anomalies profile.cpuprofile
Auto-detected by the .cpuprofile extension.
strace Summary
Vajra parses the summary table produced by strace -c. Each syscall row becomes a record with fields for time percentage, seconds, calls, errors, and syscall name.
strace -c ls 2>&1 | vajra stats -
vajra stats strace_output.txt --input-format strace
Auto-detected when content contains the % time column header characteristic of strace -c output.
Compressed Files (Gzip, Zstd)
Compression is transparent. Vajra decompresses on the fly and auto-detects the inner format.
vajra inspect claims.json.gz
vajra stats archive.json.zst
This works with any inner format — claims.ndjson.gz, data.yaml.zst, report.csv.gz.
HTTP URLs
Vajra fetches the URL via blocking HTTP GET and analyzes the response body.
vajra inspect https://api.example.com/v1/claims/12345
vajra stats https://data.example.com/feed.ndjson
The response content type and body are used for format detection. No authentication headers are supported in the current version — for authenticated endpoints, fetch with curl and pipe to stdin:
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/data | vajra inspect -
Stdin
The - argument reads from standard input. Format is auto-detected from the content.
cat claim.json | vajra inspect -
curl https://api.example.com/data | vajra stats -
jq '.claims[]' data.json | vajra anomalies -
zcat claims.json.gz | vajra inspect -
Multi-Document Formats
NDJSON and multi-document YAML naturally contain multiple documents. NDJSON records are now aggregated into a single array, so all commands — including stats, anomalies, invariants, and essence — compute across all records as a unified population.
vajra anomalies claims.ndjson # analyzes all lines as a batch
vajra stats claims.ndjson # computes stats across all records
Directory Input
When the input is a directory path, Vajra discovers all supported files:
vajra batch ./claims/ # processes all files in the directory
vajra cluster ./claims/ # clusters all files in the directory
Subdirectories are not traversed recursively by default.