Input & normalization

“Given any corpus of information regardless of its format, we’ll normalize it.”

anomalyx meets your data where it already lives. Every supported format — whether a packet capture, a SIEM event stream, a Kubernetes manifest, or a data-lake file — is lowered to one engine-independent record model, a RecordSet of named, typed columns, and the detectors only ever see that. The contract stays stable while the backend underneath it changes.

Supported formats

32 built-in parsers across five domains. Each is an independent plugin (crates/ax-normalize/src/parsers/); adding one doesn’t touch the others.

Tabular & structured data

Format	Extensions	Notes
CSV / TSV	`.csv`, `.tsv`, `.tab`	lean deterministic reader
NDJSON / JSON	`.ndjson`, `.jsonl`, `.json`	array, object, or one-record-per-line
YAML	`.yaml`, `.yml`	Kubernetes / CI manifests; multi-document
TOML / INI	`.toml`, `.ini`, `.cfg`, `.conf`	config drift via `struct.schema`
XML	`.xml`, `.nessus`	Nessus/OpenVAS, SOAP; repeated element → rows

Columnar, data-lake & databases

Format	Extensions	Backend
Parquet	`.parquet`, `.pq`	Polars / Arrow
Arrow IPC	`.arrow`, `.ipc`, `.feather`	Polars / Arrow
Avro	`.avro`	`apache-avro`
ORC	`.orc`	`orc-rust` → Arrow
Excel / ODS	`.xlsx`, `.xls`, `.xlsb`, `.ods`	`calamine` (first sheet)
SQLite	`.db`, `.sqlite`, `.sqlite3`, `.db3`	`rusqlite` (first table, in-memory deserialize)

Logs & observability

Format	Detected by	Anomaly angle
logfmt	`key=value` shape	structured app logs
Web access logs (Combined/Common)	`[time] "request" status`	status-mix `dist`, latency `point`, bursts `coll`
syslog (RFC 3164 / 5424)	`<PRI>` header	event-rate `dist`, off-hours `contextual`
systemd journal	`journalctl -o json`	event-rate `cadence`/`coll`, rare-unit `dist`
Prometheus / OpenMetrics	exposition lines	per-series `point` spikes, `dist` drift
OpenTelemetry (OTLP/JSON)	`resourceSpans`	span-duration `point`, error-rate `dist`, emit `cadence`

Security telemetry

Format	Detected by	Anomaly angle
Zeek (`conn.log` family)	`#separator` header	connection analytics
CEF / LEEF	`CEF:` / `LEEF:` prefix	signature/category mix shift via `dist.chi2`
auditd	`msg=audit(`	exec/syscall mix `dist`, bursty activity `coll`
EVTX (Windows Event Log)	`ElfFile` magic	rare event-ID `point`, logon `dist`, off-hours `contextual`
Suricata/Zeek EVE	`event_type` + `timestamp`	alert-type drift via `dist.chi2`; new classes surface
osquery results	`hostIdentifier` + `columns`/`snapshot`	fleet-posture drift via `structural`/`dist`
AWS CloudTrail	`Records[].eventName`	off-hours `contextual`/`cadence`, rare-API `dist`

Network

Format	Detected by	Anomaly angle
PCAP / PCAPNG	libpcap / SHB magic	beaconing/C2 via `cadence` on inter-arrival times
NetFlow / IPFIX (nfdump CSV)	nfdump header	exfil via `mv.mahalanobis` on (bytes, packets, duration)
AWS VPC Flow Logs	`srcaddr dstaddr dstport` header	same flow anomalies, zero new infra
DNS query logs (dnsmasq)	`query[TYPE] … from`	DGA/exfil via `point` on name entropy/length + `cadence`

Several parsers compute the features the detectors want rather than just extracting fields — DNS query-name Shannon entropy and length, flow duration (end - start), span durationNanos, normalized epoch timestamps — and rename cryptic source fields to a canonical schema (e.g. nfdump ibyt→bytes, td→duration).

Resolution

Format is resolved by file extension first, then by content sniff — binary magic numbers (PAR1, ORC, SQLite format 3\0, …) are checked at high confidence, then distinctive text signatures, then a CSV last-resort fallback. Resolution is deterministic: the highest-confidence match wins, ties break by registration order. An unrecognized stream is an explicit error, never a silent guess.

Several formats deliberately claim no extension (Zeek, syslog content, journald, EVE, osquery, auditd, DNS, NetFlow, VPC) because their files are generically *.log/*.json; pipe them on stdin and the content signature routes them.

Feature flags & the lean build

The binary and heavyweight parsers sit behind default-on feature flags, so a default build reads everything but a --no-default-features build is a lean, text-only normalizer with no binary dependencies:

Feature	Parsers
`polars`	Parquet, Arrow IPC
`evtx`	EVTX
`pcap`	PCAP / PCAPNG
`xlsx`	Excel / ODS
`sqlite`	SQLite
`datalake`	Avro, ORC

The record model

A RecordSet is named columns of equal length, each with an inferred type: Int, Float, Bool, Str, Unknown, or Mixed (conflicting concrete types — itself a structural signal). Values collapse into a small closed set, and absence is explicit: a missing cell is Null, never a sentinel 0.0 that would skew a mean.

amount,tier        →   column "amount": Int   [10, 11, 9, …]
10,a                   column "tier":   Str   ["a", "b", "c", …]
11,b

Binary and library-backed formats live entirely behind this boundary: a Polars DataFrame, an Arrow RecordBatch, a calamine sheet, or a SQLite row is converted to a RecordSet (integers fold to i64, floats to f64 with non-finite → Null, unsupported logical types preserved as their string form), so no library type ever reaches a detector. Text formats touch none of it.

Keyboard shortcuts

anomalyx