Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Input & normalization

“Given any corpus of information regardless of its format, we’ll normalize it.”

anomalyx meets your data where it already lives. Every supported format — whether a packet capture, a SIEM event stream, a Kubernetes manifest, or a data-lake file — is lowered to one engine-independent record model, a RecordSet of named, typed columns, and the detectors only ever see that. The contract stays stable while the backend underneath it changes.

Supported formats

32 built-in parsers across five domains. Each is an independent plugin (crates/ax-normalize/src/parsers/); adding one doesn’t touch the others.

Tabular & structured data

FormatExtensionsNotes
CSV / TSV.csv, .tsv, .tablean deterministic reader
NDJSON / JSON.ndjson, .jsonl, .jsonarray, object, or one-record-per-line
YAML.yaml, .ymlKubernetes / CI manifests; multi-document
TOML / INI.toml, .ini, .cfg, .confconfig drift via struct.schema
XML.xml, .nessusNessus/OpenVAS, SOAP; repeated element → rows

Columnar, data-lake & databases

FormatExtensionsBackend
Parquet.parquet, .pqPolars / Arrow
Arrow IPC.arrow, .ipc, .featherPolars / Arrow
Avro.avroapache-avro
ORC.orcorc-rust → Arrow
Excel / ODS.xlsx, .xls, .xlsb, .odscalamine (first sheet)
SQLite.db, .sqlite, .sqlite3, .db3rusqlite (first table, in-memory deserialize)

Logs & observability

FormatDetected byAnomaly angle
logfmtkey=value shapestructured app logs
Web access logs (Combined/Common)[time] "request" statusstatus-mix dist, latency point, bursts coll
syslog (RFC 3164 / 5424)<PRI> headerevent-rate dist, off-hours contextual
systemd journaljournalctl -o jsonevent-rate cadence/coll, rare-unit dist
Prometheus / OpenMetricsexposition linesper-series point spikes, dist drift
OpenTelemetry (OTLP/JSON)resourceSpansspan-duration point, error-rate dist, emit cadence

Security telemetry

FormatDetected byAnomaly angle
Zeek (conn.log family)#separator headerconnection analytics
CEF / LEEFCEF: / LEEF: prefixsignature/category mix shift via dist.chi2
auditdmsg=audit(exec/syscall mix dist, bursty activity coll
EVTX (Windows Event Log)ElfFile magicrare event-ID point, logon dist, off-hours contextual
Suricata/Zeek EVEevent_type + timestampalert-type drift via dist.chi2; new classes surface
osquery resultshostIdentifier + columns/snapshotfleet-posture drift via structural/dist
AWS CloudTrailRecords[].eventNameoff-hours contextual/cadence, rare-API dist

Network

FormatDetected byAnomaly angle
PCAP / PCAPNGlibpcap / SHB magicbeaconing/C2 via cadence on inter-arrival times
NetFlow / IPFIX (nfdump CSV)nfdump headerexfil via mv.mahalanobis on (bytes, packets, duration)
AWS VPC Flow Logssrcaddr dstaddr dstport headersame flow anomalies, zero new infra
DNS query logs (dnsmasq)query[TYPE] … fromDGA/exfil via point on name entropy/length + cadence

Several parsers compute the features the detectors want rather than just extracting fields — DNS query-name Shannon entropy and length, flow duration (end - start), span durationNanos, normalized epoch timestamps — and rename cryptic source fields to a canonical schema (e.g. nfdump ibytbytes, tdduration).

Resolution

Format is resolved by file extension first, then by content sniff — binary magic numbers (PAR1, ORC, SQLite format 3\0, …) are checked at high confidence, then distinctive text signatures, then a CSV last-resort fallback. Resolution is deterministic: the highest-confidence match wins, ties break by registration order. An unrecognized stream is an explicit error, never a silent guess.

Several formats deliberately claim no extension (Zeek, syslog content, journald, EVE, osquery, auditd, DNS, NetFlow, VPC) because their files are generically *.log/*.json; pipe them on stdin and the content signature routes them.

Feature flags & the lean build

The binary and heavyweight parsers sit behind default-on feature flags, so a default build reads everything but a --no-default-features build is a lean, text-only normalizer with no binary dependencies:

FeatureParsers
polarsParquet, Arrow IPC
evtxEVTX
pcapPCAP / PCAPNG
xlsxExcel / ODS
sqliteSQLite
datalakeAvro, ORC

The record model

A RecordSet is named columns of equal length, each with an inferred type: Int, Float, Bool, Str, Unknown, or Mixed (conflicting concrete types — itself a structural signal). Values collapse into a small closed set, and absence is explicit: a missing cell is Null, never a sentinel 0.0 that would skew a mean.

amount,tier        →   column "amount": Int   [10, 11, 9, …]
10,a                   column "tier":   Str   ["a", "b", "c", …]
11,b

Binary and library-backed formats live entirely behind this boundary: a Polars DataFrame, an Arrow RecordBatch, a calamine sheet, or a SQLite row is converted to a RecordSet (integers fold to i64, floats to f64 with non-finite → Null, unsupported logical types preserved as their string form), so no library type ever reaches a detector. Text formats touch none of it.