Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

anomalyx

crates.io CI License: MIT OR Apache-2.0

Contract-first anomaly detection over arbitrary corpora.

anomalyx is a deterministic Rust CLI built on the thesis of AI Tools Need Contracts, Not Prompts: the executable is the contract. Point it at ~30 formats — logs, security telemetry, packet captures, flow records, observability streams, spreadsheets, and data-lake files (the full set) — and it normalizes each into one typed record model, runs a battery of typed anomaly detectors, and returns a dense, versioned, machine-readable envelope an agent (or a human) can trust — not pretty text that has to be scraped.

$ printf 'id,amount\n1,10\n2,11\n3,9\n4,10\n5,12\n6,11\n7,10\n8,9\n9,9999\n' | anomalyx scan
{"protocol":"anomalyx/tq1",...,"rows":[[0,1,2,1.0,3,4544.43,4]],...,"exit":1}

$ ... | anomalyx explain cell:amount:8
{"evidence":{"kind":"cell","column":"amount","row":8,"value":{"t":"int","v":9999}},"findings":[...]}

Why it exists

Humans paper over vague tools with context and memory; agents can’t. A tool whose behavior lives in prose, convention, and tribal knowledge is one an agent will eventually step on. anomalyx is shaped as an executable contract:

  • A minimal, discoverable surface — four verbs: describe, schema, scan, explain.
  • Typed, dense output — a versioned tq1 JSON envelope with a dictionary-pinned string table and stable evidence handles, not prose.
  • Determinism as UX — same input + same config fingerprint yields byte-identical output. No wall-clock, no RNG in the measurement path.
  • Honest absence — a detector that can’t run says so; it never fabricates a clean result. Exit codes are committed: 0 clean, 1 anomalies, 2 error.

What makes it trustworthy

  • Nine detectors across a seven-class taxonomy — point, distributional, structural, multivariate, contextual, collective, and cadence anomalies.
  • Any corpus — CSV, TSV, NDJSON, JSON, Parquet, and Arrow IPC, all lowered to one engine-independent record model.
  • Proven correct — the statistical core is validated against the NIST Statistical Reference Datasets (certified to 15 digits), and every crate passes a 0-surviving-mutant mutation gate on top of property-based tests.

Start with Install, then the four-verb contract.

Install

From crates.io

cargo install anomalyx

This installs the anomalyx binary. It pulls in the library crates (anomalyx-core, anomalyx-normalize, anomalyx-detect) automatically.

From source

git clone https://github.com/copyleftdev/anomalyx
cd anomalyx
cargo install --path crates/anomalyx

Feature flags

Binary columnar formats (Parquet, Arrow IPC) are read through the Polars backbone, behind the default-on polars feature of anomalyx-normalize. A lean, text-only build drops that heavy dependency:

# text formats only (CSV / TSV / NDJSON / JSON), no Polars
cargo build -p anomalyx-normalize --no-default-features

Without the feature, a Parquet/Arrow input fails cleanly with an explicit “requires the ‘polars’ feature” error rather than misbehaving — honest absence at the build level.

Using the libraries

The detection engine is usable as a library. The crates.io packages are namespaced (anomalyx-*) but expose conventional module names:

[dependencies]
anomalyx-core = "0.1"
anomalyx-detect = "0.1"
anomalyx-normalize = "0.1"
use ax_detect::{Registry, ScanContext, DetectConfig};

let rs = ax_normalize::normalize("data.csv", &bytes)?;
let report = Registry::default_set()
    .run(&ScanContext::single(&rs), &DetectConfig::default());

The four-verb contract

anomalyx exposes a deliberately small, discoverable surface. An agent can answer “what is this, what does it produce, what did it find, and why” with four verbs.

anomalyx describe                                     Protocol metadata
anomalyx schema                                       JSON Schema of scan output
anomalyx scan [--baseline B] [--period N] [--cadence COL] [PATH]
anomalyx explain <HANDLE> [--baseline B] [--period N] [--cadence COL] [PATH]

Input is a PATH or stdin (-). Exit codes are part of the contract:

codemeaning
0clean — no anomalies
1anomalies found
2tool error (bad input, unresolved handle, …)

describe — what this is

Emits protocol metadata: the supported input formats, the registered detectors, the anomaly classes, the exit-code semantics, and the current deterministic config fingerprint. Everything is derived from the same registries scan uses, so the description can’t drift from behavior.

schema — the shape of the output

Emits a JSON Schema (draft 2020-12) pinning the tq1 envelope. Validate against it instead of reverse-engineering field names. See The tq1 envelope.

scan — normalize, then detect

Reads the corpus, normalizes it to the internal record model, runs every detector, and prints one dense tq1 envelope.

$ anomalyx scan sales.csv
{"protocol":"anomalyx/tq1", ... ,"exit":1}

explain — drill into a finding

Findings carry a stable handle (e.g. cell:amount:8, dist:score, row:42, range:ts:20:40). explain resolves one back to its underlying evidence, and re-attaches any findings pointing at it. An unresolvable handle fails cleanly with exit 2 — never a fabricated hit.

$ anomalyx explain cell:amount:8 sales.csv
{"protocol":"anomalyx/tq1","handle":"cell:amount:8",
 "evidence":{"kind":"cell","column":"amount","row":8,"value":{"t":"int","v":9999}},
 "findings":[{"detector":"point.modz","class":"point","confidence":1.0, ... }]}

Stability (1.0)

As of 1.0, the tq1 contract is stable and committed. An agent can rely on these without pinning a patch version:

  • the protocol id anomalyx/tq1 (envelope::PROTOCOL);
  • the exit codes0 clean, 1 anomalies found, 2 error;
  • the dense finding-row layout ([detector, class, handle, confidence, severity, score, reason]) and the dictionary-pinned string table;
  • the handle forms (column: / cell: / row: / range: / dist:) and their canonical string shapes;
  • the envelope’s required fields and the severity ladder (info < low < medium < high < critical).

Breaking any of these requires a major bump and a PROTOCOL change — they will not change quietly under 1.x.

What still evolves additively under 1.x: new detectors, new input formats, new optional CLI flags, and new optional envelope fields (consumers must ignore unknown fields). Anything that changes detector output for a given input — a new threshold default, a recalibration — moves the config_version fingerprint, so “the tool changed” stays distinguishable from “the data changed.” Determinism remains absolute: same input + same config_version ⇒ byte-identical output.

Anomaly taxonomy

“Anomaly” is not one thing. anomalyx classifies every finding into one of seven classes, so you reason about the kind of deviation, not just that “something is off.” Nine detectors implement the taxonomy today.

ClassWhat it catchesDetector(s)
pointa single value far from its column’s distributionpoint.modz
distributionalthe distribution shifted vs. a baselinedist.ks, dist.psi, dist.chi2
structuralschema / type / null-rate / cardinality violationsstruct.schema
multivariatea row that breaks the joint structure across columnsmv.mahalanobis
contextuala value anomalous only in context (seasonal)ctx.seasonal
collectivea subsequence that is jointly anomalous (level shift)coll.cusum
cadencetiming too regular to be organic (automation)cad.regularity

Every detector is deterministic — no RNG, no wall-clock — which is what lets anomalyx meet its byte-reproducibility guarantee. Where an off-the-shelf method would fight that (an isolation forest’s RNG, for instance), anomalyx uses a deterministic equivalent.

point — point.modz

Per-column univariate outliers via the Iglewicz–Hoaglin modified z-score, M = 0.6745·(x − median)/MAD. MAD (median absolute deviation) is robust: a few wild values don’t inflate the spread and mask each other. Falls back to mean/σ when MAD collapses; a truly constant column flags nothing. Emits a cell handle.

distributional — dist.ks / dist.psi / dist.chi2

Compare the current corpus against a --baseline:

  • dist.ks — two-sample Kolmogorov–Smirnov on numeric columns (shape/location shift), with an asymptotic p-value.
  • dist.psi — Population Stability Index over baseline-quantile bins (how much mass moved); the binned cousin of KL divergence.
  • dist.chi2 — chi-square over category frequencies for categorical columns; also surfaces brand-new categories.

Without a baseline these report honest absence. Emit dist handles.

structural — struct.schema

Shape, not values. Single-corpus: columns with conflicting cell types (Mixed) and columns whose null fraction exceeds a threshold. With a --baseline: a schema diff — columns added, dropped, or whose inferred type changed. Emits col handles.

multivariate — mv.mahalanobis

A row can be unremarkable on every axis yet a glaring joint outlier — e.g. it breaks the correlation the rest of the data obeys. The Mahalanobis distance measures distance from the centroid in units that account for each feature’s spread and the covariance between features. Squared distance ~ χ²(d), so a principled per-row p-value falls out. Own deterministic Cholesky solve, no RNG. Emits a row handle.

contextual — ctx.seasonal

A daytime traffic level at 3am; a weekday volume on a Sunday. Given a period --period N, each point is scored only against its own phase (row mod N) — its seasonal peers — using the same robust modified z-score. Seasonality is never guessed: without a period it reports honest absence.

collective — coll.cusum

A sustained shift in level is the canonical collective anomaly. CUSUM finds the change point that maximizes the cumulative deviation from the mean; when the standardized two-segment shift is large, the post-change segment is flagged as a range handle.

cadence — cad.regularity

The inverse of every other detector: timing too regular to be organic — the metronomic signature of automation. On a column named by --cadence COL, the inter-arrival intervals’ coefficient of variation (CV = σ/μ) near zero is the tell. Opt-in, because which column means “time” is never guessed.

Input & normalization

“Given any corpus of information regardless of its format, we’ll normalize it.”

anomalyx meets your data where it already lives. Every supported format — whether a packet capture, a SIEM event stream, a Kubernetes manifest, or a data-lake file — is lowered to one engine-independent record model, a RecordSet of named, typed columns, and the detectors only ever see that. The contract stays stable while the backend underneath it changes.

Supported formats

32 built-in parsers across five domains. Each is an independent plugin (crates/ax-normalize/src/parsers/); adding one doesn’t touch the others.

Tabular & structured data

FormatExtensionsNotes
CSV / TSV.csv, .tsv, .tablean deterministic reader
NDJSON / JSON.ndjson, .jsonl, .jsonarray, object, or one-record-per-line
YAML.yaml, .ymlKubernetes / CI manifests; multi-document
TOML / INI.toml, .ini, .cfg, .confconfig drift via struct.schema
XML.xml, .nessusNessus/OpenVAS, SOAP; repeated element → rows

Columnar, data-lake & databases

FormatExtensionsBackend
Parquet.parquet, .pqPolars / Arrow
Arrow IPC.arrow, .ipc, .featherPolars / Arrow
Avro.avroapache-avro
ORC.orcorc-rust → Arrow
Excel / ODS.xlsx, .xls, .xlsb, .odscalamine (first sheet)
SQLite.db, .sqlite, .sqlite3, .db3rusqlite (first table, in-memory deserialize)

Logs & observability

FormatDetected byAnomaly angle
logfmtkey=value shapestructured app logs
Web access logs (Combined/Common)[time] "request" statusstatus-mix dist, latency point, bursts coll
syslog (RFC 3164 / 5424)<PRI> headerevent-rate dist, off-hours contextual
systemd journaljournalctl -o jsonevent-rate cadence/coll, rare-unit dist
Prometheus / OpenMetricsexposition linesper-series point spikes, dist drift
OpenTelemetry (OTLP/JSON)resourceSpansspan-duration point, error-rate dist, emit cadence

Security telemetry

FormatDetected byAnomaly angle
Zeek (conn.log family)#separator headerconnection analytics
CEF / LEEFCEF: / LEEF: prefixsignature/category mix shift via dist.chi2
auditdmsg=audit(exec/syscall mix dist, bursty activity coll
EVTX (Windows Event Log)ElfFile magicrare event-ID point, logon dist, off-hours contextual
Suricata/Zeek EVEevent_type + timestampalert-type drift via dist.chi2; new classes surface
osquery resultshostIdentifier + columns/snapshotfleet-posture drift via structural/dist
AWS CloudTrailRecords[].eventNameoff-hours contextual/cadence, rare-API dist

Network

FormatDetected byAnomaly angle
PCAP / PCAPNGlibpcap / SHB magicbeaconing/C2 via cadence on inter-arrival times
NetFlow / IPFIX (nfdump CSV)nfdump headerexfil via mv.mahalanobis on (bytes, packets, duration)
AWS VPC Flow Logssrcaddr dstaddr dstport headersame flow anomalies, zero new infra
DNS query logs (dnsmasq)query[TYPE] … fromDGA/exfil via point on name entropy/length + cadence

Several parsers compute the features the detectors want rather than just extracting fields — DNS query-name Shannon entropy and length, flow duration (end - start), span durationNanos, normalized epoch timestamps — and rename cryptic source fields to a canonical schema (e.g. nfdump ibytbytes, tdduration).

Resolution

Format is resolved by file extension first, then by content sniff — binary magic numbers (PAR1, ORC, SQLite format 3\0, …) are checked at high confidence, then distinctive text signatures, then a CSV last-resort fallback. Resolution is deterministic: the highest-confidence match wins, ties break by registration order. An unrecognized stream is an explicit error, never a silent guess.

Several formats deliberately claim no extension (Zeek, syslog content, journald, EVE, osquery, auditd, DNS, NetFlow, VPC) because their files are generically *.log/*.json; pipe them on stdin and the content signature routes them.

Feature flags & the lean build

The binary and heavyweight parsers sit behind default-on feature flags, so a default build reads everything but a --no-default-features build is a lean, text-only normalizer with no binary dependencies:

FeatureParsers
polarsParquet, Arrow IPC
evtxEVTX
pcapPCAP / PCAPNG
xlsxExcel / ODS
sqliteSQLite
datalakeAvro, ORC

The record model

A RecordSet is named columns of equal length, each with an inferred type: Int, Float, Bool, Str, Unknown, or Mixed (conflicting concrete types — itself a structural signal). Values collapse into a small closed set, and absence is explicit: a missing cell is Null, never a sentinel 0.0 that would skew a mean.

amount,tier        →   column "amount": Int   [10, 11, 9, …]
10,a                   column "tier":   Str   ["a", "b", "c", …]
11,b

Binary and library-backed formats live entirely behind this boundary: a Polars DataFrame, an Arrow RecordBatch, a calamine sheet, or a SQLite row is converted to a RecordSet (integers fold to i64, floats to f64 with non-finite → Null, unsupported logical types preserved as their string form), so no library type ever reaches a detector. Text formats touch none of it.

Scan modes

A plain scan runs the single-corpus detectors (point, structural shape checks). Three flags activate the rest; when a flag is absent, the detectors it would enable report honest absence rather than guessing. A fourth pair of flags — --columns / --exclude — narrows which columns are analyzed at all.

--baseline B — drift & schema diff

Compares the current corpus against baseline B. Activates the distributional detectors (dist.ks, dist.psi, dist.chi2) and the schema-diff half of struct.schema.

$ anomalyx scan --baseline last_week.parquet this_week.parquet
# flags columns whose distribution shifted, plus added/dropped/type-changed columns

The envelope gains a baseline field recording the comparison source.

--period N — seasonal / contextual

Treats rows as an ordered time series of period N and runs ctx.seasonal, comparing each point to its phase peers (row mod N).

$ anomalyx scan --period 7 daily_metrics.csv     # weekly seasonality

A value can be perfectly ordinary globally yet wrong for its phase — e.g. a 50 where phase 0 normally sits near 0. Without --period, ctx.seasonal is honestly absent; seasonality is never inferred.

--cadence COL — metronomic timing

Reads column COL as event times and runs cad.regularity, flagging suspiciously regular inter-arrival intervals (automation).

$ anomalyx scan --cadence ts events.csv
# flags COL if its inter-arrival coefficient of variation is near zero

Organic streams are ragged; a metronome is a tell. Opt-in, because which column means “time” is never guessed.

The regularity bar is the inter-arrival coefficient of variation (CV = stddev / mean); cad.regularity fires when CV is below a threshold (default 0.05). Tune it with --cad-max-cv F:

$ anomalyx scan --cadence timestamp beacon.pcap                    # default 0.05
$ anomalyx scan --cadence timestamp --cad-max-cv 0.15 beacon.pcap  # catch jittered beacons

A perfectly periodic beacon has CV ≈ 0; real C2 channels add timing jitter to evade exactly this kind of test. A ~10% jitter (CV ≈ 0.10) slips past the default but is caught at --cad-max-cv 0.15 — at the cost of flagging more merely-regular traffic. The threshold is folded into the envelope’s config_version (cdcv=), so a non-default bar is a versioned, reproducible choice, never a hidden one.

Rows are treated in their given order as the time axis. If your data isn’t already time-ordered, sort it first.

Column roles (and --no-column-roles)

Every scanned column is classified into a rolemeasurement, identifier, categorical, sequence, or constant — and the full map ships in the envelope’s roles array. Detectors consult it to skip columns where their statistic is meaningless: the point detector ignores identifier and sequence columns, because a “large process-id” or the endpoint of a monotonic counter is not an anomaly.

$ anomalyx scan app.log            # roles on (default)
$ anomalyx scan --no-column-roles app.log   # report roles, but skip nothing

Identifiers are recognized by name (*_id, uid, gid, pid, tid, session, uuid, …) — the only reliable signal, since a process-id column is statistically indistinguishable from a discrete measurement. A continuous measurement (fare, durationNanos, DAYS_LOST) is never named like an id, so it is never skipped. Cardinality is deliberately not used to call a numeric column categorical — a column that is one value with a few wild outliers has low cardinality yet is exactly what point detection should catch.

This is heuristic, but never silent: the role of every column is in the envelope (audit it), and --no-column-roles disables the skipping entirely. On a real 20k-entry journald capture it cuts point findings from ~12,500 to ~240 (the _PID/_UID/JOB_ID/timestamp columns) while leaving genuine measurements untouched. The setting is part of config_version (cr=).

--set KEY=VALUE — tune detector config

Every detector threshold is a field of the config that describe reports. --set overrides any of them by name (repeatable):

$ anomalyx scan --set point_threshold=4.0 --set dist_alpha=0.01 data.csv
$ anomalyx describe | jq .config        # the settable keys + their defaults

An unknown key or a value that doesn’t fit the field is a hard error (exit 2). Overrides flow into config_version, so a tuned run is just as reproducible and self-describing as a default one — the knob is never hidden. (The common knobs also have dedicated flags: --fdr, --cad-max-cv, --period, --cadence.)

--top N / --min-severity S — output scoping

Detection can surface tens of thousands of findings on a large corpus. These two flags scope what scan emits without touching what it detects:

$ anomalyx scan --top 50 big.parquet              # the 50 most severe
$ anomalyx scan --min-severity high big.parquet   # only high/critical
$ anomalyx scan --fdr 0.01 --min-severity high --top 25 big.parquet   # compose

--top N keeps the N most severe findings (the row list is already sorted severity-first); --min-severity S keeps findings at or above S (info < low < medium < high < critical).

The scoping is honest. summary (total, by_class, max_severity) and the exit code always describe everything detected — so filtering the view can never make anomalies look absent or flip exit 10. When findings are withheld, the envelope gains a scope block recording the filter and the detected / emitted / dropped counts; rows carries only the emitted subset. Without these flags the block is absent and rows is complete.

This is the volume complement to --fdr (which controls correctness): FDR makes findings statistically defensible, output scoping makes the list consumable. Together: “the top N, ≥ severity S, among the FDR-significant set.”

--fdr Q — false-discovery-rate control (point detector)

By default the point detector flags every cell whose modified z-score clears a fixed cutoff. With thousands of cells, a fixed cutoff has no notion of how many cells were tested. --fdr Q converts each cell’s score to a two-sided p-value and applies the Benjamini–Hochberg procedure within each column, bounding the expected proportion of false flags at Q:

$ anomalyx scan --fdr 0.05 events.parquet     # ≤5% expected false discoveries

This is principled, not arbitrary: a column that is really just noise stops contributing chance flags, and the same outlier can be significant in a small column yet not in a large one (the per-rank bar (k/m)·Q shrinks with the number of cells m). The threshold is folded into config_version (pfdr=), so a non-default level is a versioned, reproducible choice.

--fdr controls correctness, not output volume. On genuinely heavy-tailed data it can flag more cells than the fixed cutoff — those cells really are significant at Q. To cap volume, pair it with --columns/--exclude (and the planned severity / top-N output scoping).

--columns C,.. / --exclude C,.. — column scope

Restrict detection to a chosen set of columns (--columns, an allowlist) or to everything but a set (--exclude, a denylist). The two are mutually exclusive. The projection is applied before any detector runs, and to the --baseline too, so drift comparison stays consistent.

# focus a wide log on the columns that carry signal
$ journalctl -o json | anomalyx scan --columns PRIORITY,_SYSTEMD_UNIT

# or keep everything except journald's identifier/counter/timestamp noise
$ journalctl -o json | anomalyx scan \
    --exclude JOB_ID,_PID,__MONOTONIC_TIMESTAMP,__REALTIME_TIMESTAMP,N_RESTARTS

This is the answer to identifier noise on wide corpora. The point detector will dutifully flag statistical outliers in every numeric column — including JOB_ID, PIDs, monotonic timestamps and restart counters, where an “outlier” is real but meaningless. On a raw 20k-entry journald capture that’s ~10k findings of noise; excluding those fields collapses it to a couple hundred that matter.

The scope is explicit, never heuristic. anomalyx will not auto-guess which columns are “interesting” — that would be a guess, and the obvious guess (drop near-unique columns) would wrongly discard exactly the near-unique numeric measurements the marquee detectors depend on (packet durationNanos, span durations, latencies). You name the scope; the result stays deterministic.

A column named in --columns/--exclude that doesn’t exist in the corpus is a hard error (exit 2), so a typo can’t silently scope a scan down to nothing and read as “clean”. (The baseline is projected leniently — it’s a different corpus and need not carry every scoped column.)

The tq1 envelope

scan emits a single JSON object — the tq1 envelope. It is dense and typed, not pretty text: a dictionary-pinned string table with findings encoded as fixed-shape rows that reference it. Changing any field is an API change and is guarded by a contract test. Run anomalyx schema for the machine-readable JSON Schema.

{
  "protocol": "anomalyx/tq1",
  "config_version": "anomalyx-cfg/5;pt=3.5000;...",
  "source": "sales.csv",
  "format": "csv",
  "baseline": "last_week.csv",        // present only in --baseline mode
  "rows_scanned": 9,
  "dict": ["point.modz", "point", "cell:amount:8", "critical", "amount = 9999 …"],
  "columns": ["detector","class","handle","confidence","severity","score","reason"],
  "rows": [ [0, 1, 2, 1.0, 3, 4544.43, 4] ],
  "absent": [ {"detector":"dist.ks","reason":"no baseline provided …"} ],
  "summary": { "total": 1, "max_severity": "critical", "by_class": [ … ] },
  "exit": 1
}

Fields

  • protocol"anomalyx/tq1". Bumps on any breaking envelope change.
  • config_version — a fingerprint of every setting that affects output. Same input + same fingerprint ⇒ byte-identical output. Lets you tell “the data changed” from “the configuration changed.”
  • dict — the string table. Every repeated string (detector ids, class tokens, handles, severities, reasons) appears once here; rows reference it by index. No magic constants.
  • columns — the fixed column order of each dense finding row.
  • rows — one array per finding, aligned to columns: [detector_idx, class_idx, handle_idx, confidence, severity_idx, score, reason_idx]. confidence is calibrated to one scale across every detector: a logistic of how far the detector’s statistic sits past its firing threshold, measured relatively (so units cancel) — 0.5 at the threshold, rising toward 1.0. A finding “2× past threshold” earns the same confidence whether it came from a modified z-score, a KS p-value, a PSI, or a cadence CV, so severity (derived from confidence) ranks findings from different detectors on one scale. score is the detector’s raw statistic (uncalibrated), for drill-down.
  • absent — detectors that declined to run, each with a machine-readable reason. See honest absence.
  • summary — total count, max severity, and per-class counts for at-a-glance triage.
  • exit — the committed exit code, mirrored into the envelope.

Handles

Findings are compact but drill-able. Each carries a stable handle whose canonical string is consistent across runs, so an agent can cache it and later explain it:

HandleFormUsed by
columncol:<name>structural
cellcell:<column>:<row>point
rangerange:<column>:<start>:<end>collective
distdist:<column>distributional
rowrow:<n>multivariate

Findings are sorted deterministically (severity desc, then class, handle, detector), so the envelope is stable regardless of the order detectors ran.

Determinism & honest absence

Two principles run through the whole tool. Both exist because the primary consumer is an agent, and an agent can’t paper over surprises the way a human can.

Determinism is UX

“Determinism is not just a testing preference. It is user experience for agents.”

Same input + same config_versionbyte-identical output. Concretely:

  • Order-independent reductions. Floating-point addition is neither associative nor commutative, so a naïve sum depends on order. Every reduction (mean, variance, MAD, quantiles, PSI, …) sorts its inputs by total order and accumulates with compensated (Neumaier) summation — the same multiset of values yields the same bits regardless of arrangement. This is exercised on real NIST data under reversal and rotation.
  • No wall-clock, no RNG in the measurement path. Detectors that elsewhere rely on randomness (e.g. isolation forests) are replaced with deterministic equivalents (Mahalanobis distance).
  • Stable interning and sorting. The envelope’s string table and finding order are deterministic, so two runs diff cleanly.
  • A config fingerprint. Any threshold that could change output also changes config_version, so you can always tell the data changed from the tool’s configuration changed.

Honest absence

“An AI-first instrument should not try to sound intelligent.”

A detector that cannot meaningfully run says so — it never fabricates a clean result. Absences are first-class, recorded in the envelope’s absent array with a machine-readable reason:

"absent": [
  {"detector":"dist.ks","reason":"no baseline provided; distributional drift requires --baseline"},
  {"detector":"ctx.seasonal","reason":"contextual detection needs a declared period ≥ 2 (pass --period N)"},
  {"detector":"mv.mahalanobis","reason":"needs at least 2 numeric columns for a multivariate distance"}
]

The same honesty appears at every level:

  • A missing cell is Null, never 0.0.
  • An unavailable detector contributes nothing, not an implied “looks fine.”
  • An unresolved explain handle fails with exit 2, not a fabricated hit.
  • A format built without Polars support rejects Parquet explicitly.

Validation against NIST

Every detector rests on a small set of deterministic reductions (mean, standard deviation, …). “anomalyx is mathematically correct” is therefore a checked claim, not an assertion: those reductions are validated against the NIST Statistical Reference Datasets (StRD) — the canonical, certified-to-15-digits truth for univariate summary statistics. The datasets are vendored offline, so validation is reproducible with no network.

Results are scored by NIST’s own metric, the log relative error (the number of correct significant digits):

  • mean reproduces every certified value to ≥ 15 digits.
  • std_dev reaches ≥ 13 digits on well-conditioned data.

The precision proof

The NumAcc3 / NumAcc4 datasets are torture tests: a mean near 10⁶–10⁷ with a standard deviation of exactly 0.1. The textbook one-pass variance (Σx² − (Σx)²/n) suffers catastrophic cancellation here. anomalyx’s compensated two-pass reduction does not:

datasetanomalyx std (correct digits)naïve one-pass
NumAcc39.461.14
NumAcc48.250.00 — zero correct digits
Michelson13.848.28

On NumAcc4 the textbook formula gets nothing right; anomalyx tracks NIST to ~8 digits — the ceiling imposed by the f64 representation of the inputs themselves, which is all NIST expects. This is a checked demonstration that the determinism-and-precision design is load-bearing, not decorative.

Stress tests

Beyond certified values, the harness verifies behavior against known ground truth:

  • Ground-truth recovery — planted outliers are flagged exactly, with no false positives or negatives.
  • Order independencedet_sum is bit-identical under reversal and rotation on real 5000-point NIST data.
  • Reproducibility at scale — a 40k-row scan serializes identically across runs.

Architecture

A small workspace of focused crates. The guiding rule: the contract is engine-independent, so the heavy machinery can change without the output shape moving.

crates/
  ax-core        contract types: RecordSet, the anomaly taxonomy, the tq1
                 envelope, evidence handles, deterministic reductions.
                 Deliberately no heavy deps — keeps the contract independent
                 and the mutation gate fast. (crate: anomalyx-core)
  ax-normalize   any input format → RecordSet. CSV/TSV/NDJSON/JSON via a lean
                 deterministic reader; Parquet/Arrow IPC via the Polars
                 backbone, behind the default-on `polars` feature.
                 (crate: anomalyx-normalize)
  ax-detect      the Detector trait + registry; the nine detectors and their
                 math (assembled from statrs, not reinvented).
                 (crate: anomalyx-detect)
  anomalyx       the four-verb CLI surface — the installable binary.
  ax-validate    NIST StRD validation + stress harness (publish = false).

Engine independence

Polars lives only inside ax-normalize’s binary-format reader. It reads a DataFrame and lowers it to a RecordSet; no Polars type ever reaches a detector, the envelope, or the contract. That’s what lets the text-only build drop Polars entirely, and what keeps ax-core — where the taxonomy and envelope live — a tiny, dependency-light crate that the mutation gate can sweep quickly.

Adding a format (the parser plugin system)

ax-normalize is a parser-plugin registry. Each format is an independent FormatParser (id, extensions, content sniff, parse) living in its own file under crates/ax-normalize/src/parsers/. The ParserRegistry resolves a byte stream by file extension first, then by the highest-confidence sniff (deterministic: confidences are registered in descending order). Adding a format is a new parsers/<fmt>.rs plus one register(...) line in default_registry — no central match to edit. See the open format issues for the backlog.

The detector contract

A Detector is itself a contract. Given a ScanContext { current, baseline } it either runs and emits Findings, or declares honest Absence. The Registry runs the set deterministically and merges everything into one Report, which the CLI turns into a tq1 envelope. Adding a detector is: implement the trait, register it, and gate it.

Naming

The crates.io packages are namespaced under the brand (anomalyx-core, anomalyx-normalize, anomalyx-detect) because the short ax-* names were taken; the in-source module/import names remain ax_core etc. via Cargo’s dependency-rename, so the code reads cleanly.

Quality gates

Two load-bearing test gates back every change, run locally by scripts/gates.sh and in CI on every push.

Property-based testing

Invariants are pinned across all inputs with proptest, not just hand-picked cases — for example:

  • the point detector is shift-, scale-, and permutation-invariant;
  • KS is symmetric and lies in [0, 1]; PSI is non-negative;
  • Mahalanobis flagging is translation-invariant;
  • reductions are order-independent and reproducible.

Mutation testing

Property tests are only as good as their teeth. cargo-mutants mutates the source and checks that some test fails for each change. The gate is zero surviving mutants across the workspace.

Getting there surfaced — and killed — real test gaps, and forced exact-value pins (e.g. validating reductions against NIST). A handful of mutants are genuinely equivalent (they cannot change observable behavior for any input — a measure-zero p == α boundary, or a sign flip that the Σ(deviations) == 0 identity cancels); those are documented individually in .cargo/mutants.toml, never blanket-suppressed. Loop-bound mutations that hang are detected as timeouts (a hang is caught, not a survivor), so the gate is precisely “no mutant survives.”

CI

.github/workflows/ci.yml runs the fast gates on every push and pull request: cargo fmt --check, cargo clippy -D warnings, the full test suite, and the text-only --no-default-features build.

The mutation gate runs locally, not in CIcargo mutants is far too minutes-expensive on hosted runners. It is enforced before pushing via:

./scripts/gates.sh    # fmt · clippy · test · mutation (0 surviving mutants)

and is the contributor’s responsibility (the gate workflow can fan it out per-crate). Treat a green local mutation run as part of “done.”

Worked examples

The repository’s examples/ directory holds small, runnable programs that use anomalyx on real data. They exist to demonstrate one thing the contract makes possible: an agent (or a 30-line script) can consume the tq1 envelope directly — parse the dense finding rows and the dict-pinned string table, then map each handle back to a row, cell, or timestamp — rather than scraping human-readable text.

They live outside the Cargo workspace and shell out to the installed anomalyx binary, so they have no effect on the build or the quality gates. Each mirrors anomalyx’s exit code (0 clean, 1 anomalies, 2 error).

The examples

ExampleDataWhat it surfaces
stock_anomalies.pyYahoo Finance daily historyanomalous trading days; distributional drift vs. another ticker
journal_anomalies.pyjournalctl -o json (systemd)rare priorities, bursts, per-unit content spikes; drift between two windows
polymarket_anomalies.pyPolymarket public APIsinformation shocks (point/mv) and odds regime shifts (coll.cusum)
synergy_market.pyYahoo Finance + agent-calcanomalyx finds; the exact-math kernel proves (tail probability, a t-test across the regime break, exact correlations)

Each maps the handle in every finding back to a calendar date / timestamp, so the output reads as “this day, this column, this kind of deviation”.

Contracts composing with contracts

synergy_market.py is the clearest illustration of why a machine-readable contract matters. anomalyx is descriptive and assumption-free — it reports which days and regimes broke the pattern (point.modz, mv.mahalanobis, coll.cusum), never assuming a distribution. Its findings then flow, as typed JSON, straight into agent-calc — a sibling contract-first CLI that does exact statistics: the return distribution’s fat-tailed kurtosis, the worst day’s tail probability under a fitted Gaussian (routinely one-in-millions — i.e. the naive risk model is what is broken), a two-sample t-test across the detected regime break (a real shift in the mean, or only the trajectory?), and exact correlations across a basket.

Two executables, two contracts, no prose and no float drift in between — which is the whole thesis: the executable is the contract.

See examples/README.md for the exact commands and prerequisites.

Changelog

All notable changes to this project are documented here. The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

[1.1.2] - 2026-06-01

No library or contract changes — the tq1 envelope, exit codes, and config_version are byte-for-byte identical to 1.1.1. This is a documentation/examples release; it also wires the anomalyx crate’s README so the crates.io landing page finally renders it.

Examples

  • examples/synergy_market.py — pairs anomalyx with agent-calc (a sibling contract-first exact math CLI) on the live market: anomalyx finds the anomalous days and the price regime shift, then agent-calc computes the exact return distribution, the worst day’s tail probability under a fitted Gaussian, a two-sample t-test across the detected CUSUM break, and exact Pearson r of each basket name to the market. Two typed-JSON contracts chained end to end.
  • examples/polymarket_anomalies.py — find information shocks in a Polymarket prediction market: pulls a market’s price history from Polymarket’s public APIs (read-only, no key), enriches with the per-step probability change, and scans — sharp probability jumps (point / mv) and sustained regime shifts in the odds (coll.cusum), each mapped back to its UTC timestamp.

Documentation

  • README’s Examples section now lists all four worked examples (stock, journal, polymarket, synergy) with the agent-calc synergy called out; the journal example is also listed in examples/README.md.
  • New mdbook page “Worked examples” (docs/src/examples.md) framing the examples as consuming the tq1 contract.
  • The anomalyx binary crate now sets readme = "../../README.md", so the crates.io page renders the project README (it had none before).

1.1.1 - 2026-06-01

Fixed

  • Timestamp columns are now recognized as sequences and skipped by the value detectors. Role::Sequence required strict monotonicity, but real clock columns (journald’s __REALTIME_TIMESTAMP/__MONOTONIC_TIMESTAMP, a pcap timestamp) tie or regress just often enough to fail it — so they were treated as measurements, and coll.cusum flagged their “level shift” (time advancing) and point their jumps. A timestamp / ts name token now classifies a column as sequence, kept deliberately narrow so response_time-style measurements (which you do want outliers on) are unaffected. Surfaced by the new journald example. No config_version change — a classifier refinement, like 1.0.1’s procid.

Examples

  • examples/journal_anomalies.py — find anomalies in the systemd journal: point / structural / collective within one capture (e.g. CPU-usage spikes per unit), or distributional drift of _SYSTEMD_UNIT / PRIORITY between two windows (--baseline-since). Pipes journald JSON on stdin (so it sniffs as journal, not plain JSON) and maps findings back to timestamp / unit / message.
  • examples/stock_anomalies.py — fetch a ticker’s daily history from Yahoo Finance and find its anomalous trading days (point / multivariate / collective), or its distributional drift against another ticker (--baseline). A worked example of consuming the tq1 envelope: it parses the dense JSON contract and maps each finding’s handle back to a calendar date.
  • Both live outside the Cargo workspace (they shell out to the installed binary), so they don’t affect the build or gates.

1.1.0 - 2026-06-01

Changed

  • Column roles now gate every value-distribution detector, not just point. ctx.seasonal, coll.cusum, dist.ks / dist.psi / dist.chi2, and mv.mahalanobis now skip identifier and sequence columns (and exclude them from the Mahalanobis feature space). A seasonal subseries, level-shift, drift test, or joint distance over arbitrary ids or a monotonic ramp is noise, not signal — this fixes, e.g., coll.cusum flagging a shift in a syslog procid. A shared Role::skips_value_detection() keeps the rule in one place. (struct.schema stays role-agnostic — null-rate/schema-diff are meaningful for any column; cad.regularity only ever uses the explicit --cadence column.)
  • This changes detector output when column_roles = true, so the config_version fingerprint is bumped (anomalyx-cfg/9). Envelope shape and PROTOCOL are unchanged; --no-column-roles restores the pre-roles behavior across all detectors.

Testing

  • Scoped the parser-robustness harness’s magic-prefixed fuzz test to formats whose decode allocation anomalyx bounds (sqlite). The binary container decoders (parquet/arrow, avro, orc, evtx, pcap) delegate to crates that trust the file’s internal length fields and can attempt a large allocation on adversarial input — a property of binary-format parsing, now documented rather than asserted (it surfaced as an intermittent CI OOM). Those parsers are still fuzzed with arbitrary bytes (rejected at the magic check).

1.0.1 - 2026-06-01

Fixed

  • Syslog: the PRI-less file format now parses. rsyslog/syslog-ng write /var/log/syslog without the <PRI> wire header (an ISO-8601 or BSD timestamp, then host and tag), but the parser’s sniff required a <PRI> — so a real /var/log/syslog was misdetected as ini and collapsed to a single garbage row. It is now recognized (timestamp + host + app) and parses one row per line; facility/severity are present only when a <PRI> is. Found by dogfooding the host’s real syslog (50k lines → ini/1 row, now → syslog/50k rows).
  • Column roles: procid is recognized as an identifier. The syslog procid (process id) column was classed a measurement, so PIDs were flagged as point outliers (~18.5k noise findings on a 50k-line syslog). procid joins the identifier name set, so it is skipped like other ids (→ 1 finding).

1.0.0 - 2026-06-01

First stable release. No code changes from 0.9.0 — this commits the contract.

Stable

  • The tq1 contract is now stable: the protocol id anomalyx/tq1, the exit codes (0/1/2), the dense finding-row layout, the handle forms (column:/cell:/row:/range:/dist:), the required envelope fields, and the severity ladder. Breaking any of these requires a major bump and a PROTOCOL change — they will not change quietly under 1.x. See the contract’s Stability section.
  • Continues to evolve additively under 1.x: new detectors, formats, optional flags, and optional envelope fields. Output-affecting config changes move the config_version fingerprint; determinism (same input + same config_version ⇒ byte-identical output) is absolute. The golden-envelope tests guard all of this against accidental drift.

0.9.0 - 2026-06-01

Added

  • scan / explain gain --set KEY=VALUE (repeatable) — override any detector-config field by name (--set point_threshold=4.0, --set dist_alpha=0.01, --set column_roles=false, …). The settable keys and their defaults are exactly what describe’s config object lists. An unknown key, or a value that doesn’t fit the field’s type, is a hard error (exit 2). Overrides flow into config_version, so a tuned run stays reproducible and self-describing — tuning is never silent. (The common knobs keep their dedicated flags: --fdr, --cad-max-cv, --period, --cadence.)
  • Implemented as a JSON round-trip over the serialized DetectConfig, so every field is settable with no per-field code; no envelope/PROTOCOL change.

Testing

  • Golden-envelope snapshot tests (anomalyx/tests/golden.rs). Run the actual binary and pin its byte-exact stdout for schema, describe, and a representative scan envelope against committed goldens — so any accidental contract drift (renamed field, changed dense-row layout, shifted config_version, recalibrated confidence) fails CI as a visible diff. Regenerate intentional changes with BLESS=1.
  • Million-row scale test (ax-validate): a 1,000,000-row scan must be byte-identical across runs and recover exactly the injected outliers — determinism and correctness verified at scale, not just on toy inputs.

0.8.0 - 2026-06-01

Changed

  • Unified confidence calibration across all detectors. Confidence was computed three incompatible ways (1 − p for the distributional/multivariate detectors, a logistic-over-threshold for point/contextual/collective/PSI, and a linear map for cadence), so a 0.9 meant different things depending on which detector produced it — and severity (and --top / --min-severity) couldn’t rank across detectors. Now every detector routes through one shared function: confidence is a logistic of how far its statistic sits past its firing threshold, measured relatively so units cancel. At the threshold → 0.5, rising toward 1.0; a finding “2× past threshold” earns the same confidence on any detector. New ax_detect::calibrate module (from_exceedance / from_undercut); the duplicated shift_confidence / psi_confidence / robustz::confidence helpers are gone.
  • This recalibrates every published confidence and severity. The config_version fingerprint is bumped (anomalyx-cfg/8) so the change is visible to agents. The envelope shape and PROTOCOL are unchanged.

Testing

  • Parser robustness harness (ax-normalize/tests/robustness.rs). Property tests assert that no parser panics, hangs, or over-allocates on arbitrary, magic-prefixed-garbage, or truncated byte streams — fed both through auto-detection and straight to every registered parser — and that normalization is deterministic over fuzz inputs. Untrusted-input hardening: a malformed file must fail cleanly, never crash.

0.7.0 - 2026-06-01

Added

  • Column roles. Every scanned column is classified into a role — measurement / identifier / categorical / sequence / constant — and the full map ships in the envelope’s new roles array. The point detector skips identifier and sequence columns (a “large process-id” or a counter’s endpoint is not an anomaly), attacking noise at the detection layer. On a real 20k journald capture this cuts point findings from ~12,500 to ~240 while leaving genuine measurements (e.g. a parquet’s heavily-skewed DAYS_LOST) untouched.
  • --no-column-roles disables role-based skipping (roles are still reported). The setting is part of the config_version fingerprint (cr=).

Design

  • Identifiers are recognized by name (*_id, uid, gid, pid, tid, session, uuid, …) — the only reliable signal, since a process-id column is statistically indistinguishable from a discrete measurement. Cardinality is deliberately not used to call a numeric column categorical (a near-constant column with a few outliers has low cardinality yet is exactly what point detection should catch). Heuristic, but never silent: every role is in the envelope and the skipping is one flag away from off.
  • New ax_core::roles module (Role, ColumnRole, Column::role); roles added to the envelope and schema. Additive; PROTOCOL unchanged.

0.6.0 - 2026-06-01

Added

  • scan gains output scoping: --top N and --min-severity S. --top N emits only the N most severe findings; --min-severity S emits only findings at or above S (info/low/medium/high/critical). This is the volume complement to --fdr — on a large corpus it shrinks the envelope dramatically (a real 127k-row parquet: ~3 MB → ~5.6 KB with --top 25) while keeping the full picture in summary.
  • Honest truncation. summary (total, by_class, max_severity) and the exit code always describe everything detected, never the scoped view — so filtering can’t make anomalies look absent or flip exit 10. When findings are withheld, the envelope gains a scope block with the applied filter and detected / emitted / dropped counts; rows carries only the emitted subset. Absent when no scoping was applied (default output unchanged).

Changed

  • The envelope summary.total now reports the number of findings detected (unchanged when no output scoping is applied, since detected == emitted then). rows.len() equals scope.emitted when scoping is active. The scope field and updated schema are additive; PROTOCOL is unchanged.

0.5.0 - 2026-05-31

Added

  • scan / explain gain --fdr Q — false-discovery-rate control for the point detector via the Benjamini–Hochberg procedure, applied per column. When set, each cell’s modified z-score is converted to a two-sided p-value and the fixed point_threshold is replaced by a multiplicity-aware cutoff that bounds the expected proportion of false flags at Q (e.g. --fdr 0.05). Opt-in: omitted, the detector behaves exactly as before. The level is part of the config_version fingerprint (pfdr=), so it is a versioned, reproducible choice.
  • New ax_detect::fdr module: two_sided_p (normal-tail p-value via erfc) and benjamini_hochberg (deterministic step-up cutoff), each property/exact tested and mutation-gated.

Notes

  • FDR is a correctness control, not a volume knob. It replaces an arbitrary fixed cutoff with a principled error-rate guarantee and adapts to how many cells were tested (a noise column stops contributing chance flags; the same outlier can be significant in a small column yet not a large one). On genuinely heavy-tailed data it may flag more cells than the old fixed threshold — those cells really are significant at the chosen Q; the fixed cutoff was simply stringent in an uncalibrated way. To cap output volume, combine with column scoping (--columns/--exclude) and the planned severity / top-N output scoping.
  • The p-value uses the consistent-σ standardized deviation (x − center)/scale (≈ N(0, 1) under the null), not robustz’s display-scaled modified z-score.

0.4.1 - 2026-05-31

Fixed

  • SQLite: WAL-mode databases now read. The parser loads a database from its main-file byte image via SQLite’s read-only deserialize. A database in WAL journal mode carries read-version 2 in its file header (byte 19), and SQLite refuses to open such an image read-only without the -wal companion (which never travels in a byte stream) — failing with unable to open database file (SQLITE_CANTOPEN). Since the main image of a checkpointed WAL database is a complete, valid database, the parser now reinterprets it as legacy (read-version 1) on a private copy and reads its checkpointed state. This unblocks the common case: most production .db files (browsers, peewee, and countless apps) default to WAL. Found by dogfooding real on-disk databases.

0.4.0 - 2026-05-31

Added

  • scan / explain gain --cad-max-cv F — the maximum inter-arrival coefficient of variation below which cad.regularity flags a column as metronomic (automated) timing. Defaults to 0.05 (unchanged behavior). Raise it to catch jittered beacons: a C2 channel with ~10% timing jitter (CV ≈ 0.10) slips past the default but is caught at --cad-max-cv 0.15.
  • The threshold is part of the config_version fingerprint (cdcv=), so overriding it is a visible, versioned change in the envelope — not a silent knob. Same input + same config_version still yields byte-identical output.

Notes

  • Validated against a deterministic jitter sweep: at the default 0.05 the detector fires up to CV ≈ 0.0494 and goes quiet at ≈ 0.0504 (it uses the sample/Bessel-corrected standard deviation); raising the threshold shifts that boundary exactly as expected.

0.3.0 - 2026-05-31

Column scoping — focus detection on the columns that matter in a wide corpus, deterministically and without guessing.

Added

  • scan / explain gain --columns C,.. (analyze only these columns) and --exclude C,.. (analyze every column except these). The two are mutually exclusive. Projection is applied before detection and to the baseline as well, so drift comparison stays consistent.
  • Column scoping is explicit, never heuristic. anomalyx will not guess which columns are “interesting” — a silent auto-skip would itself be a guess, and would wrongly drop exactly the near-unique numeric measurements the marquee detectors rely on (packet durationNanos, span durations, latencies). You name the scope; the result stays deterministic and reproducible.
  • An unknown column name in --columns/--exclude on the primary corpus is a hard error (exit 2) — a typo can never silently scope a scan down to nothing and read as “clean”. The baseline is projected leniently (it is a different corpus and need not carry every scoped column).

Notes

  • This directly tames wide, identifier-heavy corpora. On a real 20k-entry journalctl -o json capture, scan emits ~10k mostly-noise point findings across journald’s many ID/counter/timestamp fields; scan --exclude of those fields (or --columns of the meaningful ones) collapses that to a couple hundred focused findings without touching detector configuration.
  • New RecordSet::select / RecordSet::without projection primitives in ax-core. No envelope or config_version change — column scope is an input-side projection, so the determinism contract is unchanged.

0.2.2 - 2026-05-31

Fixed

  • A plain-text stream that merely starts with [ or { (e.g. an Apache error_log) was grabbed by the JSON parser’s cheap content sniff and then failed with a misleading failed to parse json input. Now a parse failure under a weak (TEXT/FALLBACK) content guess is reported honestly as UnknownFormat — “I don’t recognize this” rather than “your JSON is broken”. A format identified confidently (by file extension, or a MAGIC/STRONG signature) still surfaces a genuine malformed-file parse error as before.

0.2.1 - 2026-05-31

Fixed

  • describe advertised only the original six input_formats (csv/tsv/ndjson/json/parquet/arrow) — a stale literal that never tracked the 26 parsers added since. It now derives the list from the live parser registry, so it reflects exactly what the build reads (all 32 with default features; fewer under --no-default-features). A guard test asserts describe’s formats equal the registry, so it can’t drift again.

Added

  • anomalyx --version (-V / version) prints the crate version.

0.2.0 - 2026-05-31

Format explosion — anomalyx now normalizes ~30 formats spanning logs, security telemetry, network captures, observability streams, spreadsheets, and data-lake files, all behind the same record-model boundary and detector taxonomy.

Added

  • Logs & observability parsers: logfmt, web access logs (Combined/Common), syslog (RFC 3164/5424), systemd journal (journalctl -o json), Prometheus/OpenMetrics, and OpenTelemetry (OTLP/JSON traces).
  • Security telemetry parsers: CEF/LEEF, Linux auditd, EVTX (Windows Event Log), Suricata/Zeek EVE JSON, osquery results, and AWS CloudTrail.
  • Network parsers: PCAP/PCAPNG (beaconing/C2 via cadence), NetFlow/ IPFIX (nfdump CSV), AWS VPC Flow Logs, and DNS query logs (DGA/exfil via point on query-name entropy/length).
  • Structured-data parsers: YAML, TOML/INI, and XML (Nessus/OpenVAS/SOAP).
  • Columnar, data-lake & database parsers: Avro, ORC, Excel/ODS (xlsx/xls/xlsb), and SQLite — joining the existing Parquet/Arrow.
  • Several parsers compute detection features (DNS name entropy/length, flow duration, span durations, normalized epoch timestamps) and rename source fields to a canonical schema.
  • Binary/heavyweight parsers sit behind default-on feature flags (evtx, pcap, xlsx, sqlite, datalake, polars), so --no-default-features is a lean text-only normalizer.

Notes

  • 32 parser plugins total; each ships its own property/exact tests and passes the workspace-wide 0-surviving-mutant gate.

0.1.0 - 2026-05-30

Initial release — a contract-first anomaly-detection CLI over arbitrary corpora.

Added

  • Contract surface (anomalyx): the four discoverable verbs describe, schema, scan, explain; a dense, versioned tq1 JSON envelope with a dictionary-pinned string table and stable evidence handles; committed exit codes (0 clean / 1 anomalies / 2 error); honest absence for detectors that cannot run.
  • Normalization (ax-normalize): CSV, TSV, NDJSON and JSON via a lean deterministic reader; Parquet and Arrow IPC via the Polars backbone (behind the default-on polars feature). Every format is lowered to one engine-independent RecordSet, so detectors never see a Polars type.
  • Detectors (ax-detect) — nine across the full seven-class taxonomy:
    • point.modz — Iglewicz–Hoaglin modified z-score (robust MAD).
    • dist.ks — two-sample Kolmogorov–Smirnov drift.
    • dist.psi — Population Stability Index over baseline-quantile bins.
    • dist.chi2 — chi-square over category frequencies (surfaces new categories).
    • struct.schema — mixed-type and high-null-rate columns; added / dropped / type-changed columns against a baseline.
    • mv.mahalanobis — multivariate Mahalanobis distance (own deterministic Cholesky solve; chi-square p-value).
    • ctx.seasonal — contextual seasonal-subseries modified z-score (--period).
    • coll.cusum — collective CUSUM level-shift detection.
    • cad.regularity — metronomic-cadence (inter-arrival CV) detection (--cadence).
  • Modes: single-corpus scan; --baseline B for distributional drift and schema diff; --period N for seasonal/contextual; --cadence COL for timing.
  • Determinism: order-independent (Neumaier-compensated) reductions, no RNG or wall-clock in the measurement path, and a config-version fingerprint — same input + same fingerprint yields byte-identical output.
  • Validation (ax-validate): the math core is checked against the NIST Statistical Reference Datasets (certified to 15 digits), plus stress tests for ground-truth anomaly recovery and reproducibility at scale.
  • Quality gates: property-based tests (proptest) and a cargo-mutants 0-surviving-mutant gate across the workspace; GitHub Actions CI runs the same gates on every push.
  • Dual-licensed under MIT OR Apache-2.0.