tailx
The live system cognition engine.
tailx reimagines tail from “show me lines” to “what’s happening, what matters, and why?”
47,000 log lines → 92 groups → 38 templates → 2 root causes → 1 diagnosis
In 3.1 seconds. Zero config.
What it does
You point tailx at log files or pipe data in. Without any configuration, it:
- Auto-detects the log format — JSON, logfmt, syslog, or unstructured text
- Parses every line — extracts severity, service, trace ID, structured fields
- Fingerprints messages using the Drain algorithm — collapses thousands of repetitive lines into structural templates
- Groups events by template — ranked by severity × frequency × trend
- Detects anomalies — EWMA rate baselines, CUSUM change-point detection, 3σ threshold
- Correlates signals — temporal proximity analysis linking related anomalies
- Outputs the result — colorized terminal display or structured JSON for AI agents
The proof
We pointed tailx at a production web stack’s logs — 47,000 lines across four services. Without any configuration, rules, or prior knowledge of the system, it identified that a database connection pool exhaustion was the root cause of 71% of all error volume, cascading through the API gateway → payment service → notification service.
Without tailx: manually reading logs, mentally correlating timestamps, recognizing patterns by eye. A 30-minute task for an experienced SRE.
With tailx: one command.
tailx --json -s -n app.log | tail -1
The numbers
| Metric | Value |
|---|---|
| Binary size (stripped) | 144 KB |
| Throughput | 69,000 events/sec |
| Memory (statistical engine) | < 1 MiB |
| Startup time | < 1ms |
| External dependencies | 0 |
| Lines of Zig | 8,347 |
| Tests | 219 |
| Config files required | 0 |
Design principles
- Zero config to start. Point it at a file. It works.
- Local-first. No cloud. No telemetry. No network calls.
- Statistical-first. No LLM in the hot path. Math is fast, deterministic, explainable.
- Zero dependencies. Zig standard library only.
- 144 KB binary. Fits in L2 cache. Starts in microseconds.
Installation
Requirements
- Zig 0.14.0 (no other dependencies)
- Any POSIX system (Linux, macOS)
- No libc required. No runtime. No garbage collector.
Build from source
git clone https://github.com/your-org/tailx.git
cd tailx
zig build -Doptimize=ReleaseSafe
The binary lands in zig-out/bin/tailx. Copy it wherever you like:
cp zig-out/bin/tailx ~/.local/bin/
Build variants
| Mode | Command | Binary size | Notes |
|---|---|---|---|
| Debug | zig build | ~3 MB | Safety checks, slow |
| ReleaseSafe | zig build -Doptimize=ReleaseSafe | 3.1 MB | Safety checks, fast |
| ReleaseSmall | zig build -Doptimize=ReleaseSmall | 144 KB | Stripped, production |
| ReleaseFast | zig build -Doptimize=ReleaseFast | ~2.8 MB | Max speed, no safety |
For production use, ReleaseSafe is recommended. For resource-constrained environments (containers, embedded), ReleaseSmall produces a 144 KB binary that fits in L2 cache.
Run tests
zig build test
This runs all 219 tests across every module: core types, parsers, statistical structures, anomaly detectors, correlation engine, filters, and renderers. All tests pass in under 2 seconds.
Verify installation
tailx --version
# tailx v1.0
tailx --help
# Shows usage, modes, filters, options
No dependencies
tailx uses the Zig standard library exclusively. There are zero external dependencies – no PCRE, no libc (where avoidable), no vendored C code. The entire binary is self-contained.
Quick Start
Basic usage
Tail a file with automatic pattern grouping:
tailx app.log
This follows the file (like tail -f), auto-detects the log format, parses every line, groups events by structural template, and prints a ranked pattern summary when done or periodically during follow mode.
Pipe from stdin
cat app.log | tailx
Any command that produces log lines works:
journalctl -u myservice | tailx
docker logs myapp | tailx
kubectl logs pod/api-7f8b9 | tailx
Multiple files with globs
tailx /var/log/*.log
tailx expands globs, opens all matching files, and merges events across sources. When multiple files are open, each event line is prefixed with the source file path.
Read a full file (no follow)
tailx --from-start --no-follow file.log
# Short form:
tailx -s -n file.log
Reads the entire file from the beginning, processes every line through the full pipeline, prints the events and pattern summary, then exits.
Filter by severity
dmesg | tailx --severity warn
# Short form:
dmesg | tailx -l warn
Only displays events at warn level or above (warn, error, fatal). Events below the threshold are still processed internally – they feed the pattern groups and anomaly detectors. Filtering is display-only.
What the output looks like
In default pattern mode, tailx prints events line-by-line as they arrive, then a pattern summary:
INF [nginx] GET /api/health 200 0.003s
INF [nginx] GET /api/users 200 0.045s
WRN [payments] Connection pool exhausted, waiting
ERR [payments] Connection refused to db-primary:5432
ERR [payments] Transaction failed: connection timeout
INF [nginx] GET /api/health 200 0.002s
──────────────────────────────────────────────────────────────
Pattern Summary 847 events 12 groups 8 templates 4231 ev/s 0.2s
──────────────────────────────────────────────────────────────
✗ [payments] Connection refused to <*> (x34) ↑ rising
⚠ [payments] Connection pool exhausted, waiting (x28) ↑ rising
● [nginx] GET <*> <*> <*> (x612) → stable
● [auth] Token refreshed for user <*> (x89) → stable
● [nginx] GET /api/health <*> <*> (x84) ↓ falling
──────────────────────────────────────────────────────────────
tailx: 847 events, 12 groups, 8 templates, 0 drops
Each group line shows:
- Severity icon:
●info,⚠warn,✗error, a fire icon for fatal - Service name in brackets (if detected)
- Template with
<*>wildcards replacing variable parts - Count in parentheses
- Trend:
↑ rising,→ stable,↓ falling, or✨ new
Your First Triage
This walkthrough demonstrates tailx against a typical production web stack — mixed JSON and syslog logs from an API gateway, payment service, database, and background worker.
The command
tailx -s -n app.log api.log db.log worker.log
-s(--from-start): read from the beginning of each file-n(--no-follow): read to EOF and stop (don’t tail)
What happened
In 3.1 seconds, tailx processed 47,000 events across four files:
tailx: 47283 events, 92 groups, 38 templates, 0 drops
That is over 15,000 events per second on a single core, with full parsing, template extraction, grouping, anomaly detection, and correlation.
The pattern summary
The pattern summary ranked 92 groups by severity, frequency, and trend. The top groups told the story immediately:
──────────────────────────────────────────────────────────────
Pattern Summary 47283 events 92 groups 38 templates 15252 ev/s 3.1s
──────────────────────────────────────────────────────────────
✗ [db] connection pool exhausted, <*> connections available (x8241) ↑ rising
✗ [payments] connection timeout to <*> (x6102) ↑ rising
⚠ [worker] retry queue depth exceeding threshold (x2847) ↑ rising
🔥 [payments] circuit breaker opened for <*> (x312) ✨ new
● [api] GET <*> <*> (x18420) → stable
● [auth] token validated for user <*> (x9102) → stable
...
──────────────────────────────────────────────────────────────
The root cause
Look at the top groups. They form a cascade:
-
Database pool exhaustion — the database connection pool hit zero available connections. This is the highest-severity rising group: 8,241 events.
-
Payment service timeouts — with no database connections available, the payment service can’t complete transactions. Downstream calls to Stripe start timing out. 6,102 events.
-
Worker retry storm — failed payments get queued for retry. The retry queue grows past threshold. 2,847 events.
-
Circuit breaker trips — after sustained timeouts, the circuit breaker opens, cutting off all payment processing. 312 events — low count but FATAL severity.
Meanwhile, the healthy traffic continues: API requests (18,420) and auth token validations (9,102) are stable. The problem is isolated to the database → payment → worker path.
One connection pool exhaustion caused 71% of all error volume, cascading through three services.
The “aha” moment
Without tailx, you would read 47,000 lines across four files. Manually. You would notice the timeout messages are frequent. You might eventually connect them to the database errors. After 30 minutes, you might piece together the cascade.
With tailx: one command, 3 seconds, and the ranked pattern summary shows you the cascade directly. The highest-count error groups are all related. The database pool is the root cause. The fix is either increasing pool size, fixing the connection leak, or adding connection timeout limits.
Getting the JSON triage
For programmatic access to the same analysis:
tailx --json -s -n app.log db.log | tail -1
The last line of JSON output is always the triage_summary object — the full structured analysis including stats, top groups, anomalies, hypotheses, and traces. See JSON Output for the full schema.
Modes
tailx has five display modes. The default is pattern mode.
Pattern mode (default)
tailx app.log
Events are printed line-by-line as they arrive. At the end (or periodically every 500 events in follow mode), a ranked pattern summary is displayed showing the top groups by severity, frequency, and trend.
This is the mode you want for most triage work. It answers: “what patterns exist in these logs and which ones matter?”
ERR [payments] Connection refused to db-primary:5432
ERR [payments] Connection refused to db-primary:5432
INF [nginx] GET /api/health 200 0.002s
──────────────────────────────────────────────────────────────
Pattern Summary 847 events 12 groups 8 templates 4231 ev/s 0.2s
──────────────────────────────────────────────────────────────
✗ [payments] Connection refused to <*> (x34) ↑ rising
● [nginx] GET <*> <*> <*> (x612) → stable
──────────────────────────────────────────────────────────────
Raw mode
tailx --raw app.log
Classic tail behavior. Events are printed line-by-line with severity badges and service names, but no pattern summary, no anomaly alerts, no group rankings. The full pipeline still runs internally (parsing, grouping, anomaly detection), but nothing beyond the event lines is displayed.
Use this when you just want to watch logs scroll by with basic formatting.
Trace mode
tailx --trace app.log
Groups events by trace_id and displays them as request flow trees. Each trace shows its events connected with tree connectors, the total duration, and the outcome (success, failure, timeout, or unknown).
TRACE req-abc-123 245ms FAILURE
├─ INF [gateway] Received POST /api/checkout
├─ INF [auth] Token validated for user-42
├─ INF [payments] Processing payment $49.99
├─ ERR [payments] Connection refused to db-primary:5432
└─ ERR [gateway] 500 Internal Server Error
TRACE req-def-456 12ms success
├─ INF [gateway] Received GET /api/health
└─ INF [gateway] 200 OK
(2 traces)
Events without a trace_id are not shown in trace mode. The pattern summary is still displayed at the end.
Incident mode
tailx --incident app.log
Suppresses all normal event output. Only displays:
- Active anomaly alerts (rate spikes, rate drops, change points)
- The pattern summary with top groups
This is the “pager duty” mode. No noise, just the signals that something changed.
!! ANOMALY: rate spike — observed 450.0 vs expected 120.3 (deviation: 4.2)
──────────────────────────────────────────────────────────────
Pattern Summary 47283 events 92 groups 38 templates 15252 ev/s 3.1s
──────────────────────────────────────────────────────────────
✗ [payments] Connection refused to <*> (x1204) ↑ rising
⚠ [payments] Connection pool exhausted (x891) ↑ rising
──────────────────────────────────────────────────────────────
JSON mode
tailx --json app.log
Outputs JSONL (one JSON object per line). Two types of objects:
- Event objects – one per processed event
- Triage summary – always the last line, contains the full analysis
{"type":"event","severity":"ERROR","message":"Connection refused","service":"payments","template_hash":8234567891234}
{"type":"event","severity":"INFO","message":"GET /api/health 200","service":"nginx","template_hash":1234567890123}
{"type":"triage_summary","stats":{...},"top_groups":[...],"anomalies":[...],"hypotheses":[...],"traces":[...]}
JSON mode is designed for machine consumption – pipe it to jq, feed it to an AI agent, or integrate it as an MCP tool. See JSON Output for the full schema.
Filters & Queries
All filters are display-only. Filtered events still feed the pattern groups, anomaly detectors, and correlation engine. This is a deliberate design decision: you always get the full statistical picture, even when displaying a subset.
Filters combine with AND by default. Every clause must match for an event to be displayed.
Severity filter
tailx --severity warn app.log
tailx -l error app.log
Sets a minimum severity threshold. Only events at or above the given level are displayed. Severity levels in order:
| Level | Numeric | Typical meaning |
|---|---|---|
trace | 0 | Detailed debug tracing |
debug | 1 | Debug information |
info | 2 | Normal operations |
warn | 3 | Potential issues |
error | 4 | Failures |
fatal | 5 | Unrecoverable errors |
Example: --severity warn shows warn, error, and fatal events. Debug and info events are hidden but still processed.
Message substring filter
tailx --grep timeout app.log
tailx -g "connection refused" app.log
Filters events whose message contains the given substring. Uses Boyer-Moore-Horspool for fast matching. Case-sensitive.
# Only events mentioning "OOM"
tailx -g OOM app.log
# Combine with severity
tailx -l error -g timeout app.log
Service filter
tailx --service payments app.log
Exact match on the service name. The service name is extracted automatically by the parser:
- JSON: from
service,service_name,app,application, orcomponentfields - Syslog: from the app name before the PID (
nginx[1234]->nginx) - Unstructured: from bracketed text (
[PaymentService]->PaymentService)
Trace ID filter
tailx --trace-id req-abc-123 app.log
Exact match on the trace ID field. Combined with --trace mode, this lets you inspect a single request flow:
tailx --trace --trace-id req-abc-123 app.log
Field equality filter
tailx --field status=500 app.log
tailx --field user_id=42 app.log
Matches events with a specific field value. Supports both string and integer comparison – if the field contains an integer and the filter value parses as an integer, numeric comparison is used.
# Filter by HTTP status code
tailx --field status=500 access.log
# Filter by host
tailx --field hostname=web01 app.log
Time window filter
tailx --last 5m app.log
tailx --last 1h app.log
tailx --last 30s app.log
tailx --last 2d app.log
Only displays events from within the given time window relative to now. Supported units:
| Suffix | Unit |
|---|---|
s | seconds |
m | minutes |
h | hours |
d | days |
Combining filters
All filters are ANDed together. An event must pass every filter to be displayed:
# Errors from payments service in the last hour
tailx -l error --service payments --last 1h app.log
# Timeout errors from any service
tailx -l error -g timeout app.log
# Specific field value with severity threshold
tailx -l warn --field region=us-east-1 app.log
Important: filtering does not affect counting
This bears repeating: filtered events are still fully processed. They feed template extraction, pattern grouping, anomaly detection, and correlation. The pattern summary reflects all events, not just displayed ones.
This means you can filter the display to errors while still getting accurate group counts and anomaly detection based on the full event stream.
Intent Queries
Intent queries let you describe what you are looking for in natural language, as a positional argument. If the argument is not an existing file path, tailx treats it as an intent query and translates it into filter predicates.
How it works
tailx "errors related to payments" app.log
tailx tokenizes the query, strips filler words, maps keywords to filters, and applies basic stemming.
The above becomes: severity >= error AND message contains “payment” (stemmed from “payments”).
Examples
Severity keywords
tailx "errors related to payments" app.log
# → severity >= error, message contains "payment"
tailx "warnings from nginx" app.log
# → severity >= warn, service = "nginx"
tailx "5xx from nginx" app.log
# → severity >= error, service = "nginx"
tailx "4xx errors" app.log
# → severity >= warn (4xx maps to warn)
The following words are recognized as severity keywords: error/errors (maps to error), warning/warnings (maps to warn), fatal/critical (maps to fatal), 5xx (maps to error), 4xx (maps to warn).
Service targeting with “from”
tailx "5xx from nginx" app.log
# → severity >= error, service = "nginx"
tailx "errors from payments" app.log
# → severity >= error, service = "payments"
The word from followed by a non-filler word creates a service filter.
You can also use the service: prefix:
tailx "timeouts service:payments" app.log
# → message contains "timeout", service = "payments"
Implicit error detection
tailx "why are payments failing" app.log
Even without explicit severity keywords, certain words imply errors: fail, crash, down, broken, bug. When detected, tailx automatically adds a severity >= error filter.
The above becomes: severity >= error AND message contains “payment” AND message contains “failing”.
Simple keyword search
tailx "timeout" app.log
# → message contains "timeout"
tailx "connection refused" app.log
# → message contains "connection" AND message contains "refused"
Any word that is not a filler word, severity keyword, or service pattern becomes a message substring filter.
Filler words
The following words are stripped from queries before processing:
the, a, an, is, are, was, were, in, on, at, to, for, of, with, and, or, but, not, related, about, why, what, how, when, where, show, me, find, get, all, any, some, that, this, those, requests, logs, events, messages
This means "show me all timeout errors" reduces to: severity >= error, message contains “timeout”.
Basic stemming
Trailing s is removed from keywords longer than 3 characters. This handles simple plurals:
payments->paymenterrors-> recognized as severity keyword (not stemmed as a message filter)timeouts->timeout
File vs. query detection
tailx checks whether a positional argument is an existing file path. If the file exists, it is opened as a log source. If the file does not exist, it is treated as an intent query.
tailx app.log # file exists → open as source
tailx "timeout errors" app.log # "timeout errors" doesn't exist → intent query
tailx timeout app.log # "timeout" doesn't exist → intent query
Trace Reconstruction
tailx reconstructs request flows by grouping events that share a trace_id. In --trace mode, these are displayed as tree views showing the full lifecycle of each request.
How traces work
When an event has a trace_id field (extracted from JSON, logfmt, or any supported format), tailx assigns it to a trace in the TraceStore. All events with the same trace_id are grouped into a single Trace object.
Trace IDs are detected from these known field names:
trace_idtraceIdtracex-trace-idrequest_id
Viewing traces
tailx --trace app.log
Each trace is displayed as a tree with connectors showing the event sequence:
TRACE req-abc-123 245ms FAILURE
├─ INF [gateway] Received POST /api/checkout
├─ INF [auth] Token validated for user-42
├─ INF [inventory] Reserved 3 items
├─ INF [payments] Processing payment $49.99
├─ ERR [payments] Connection refused to db-primary:5432
└─ ERR [gateway] 500 Internal Server Error
TRACE req-def-456 12ms success
├─ INF [gateway] Received GET /api/health
└─ INF [gateway] 200 OK
TRACE req-ghi-789 31002ms TIMEOUT
├─ INF [gateway] Received POST /api/export
├─ INF [export] Starting bulk export job
└─ WRN [export] Job still running after 30s
(3 traces)
Trace properties
Each trace tracks:
- trace_id: the explicit ID from the log events
- event_count: number of events in the trace (up to 64 per trace)
- duration: time from the first event to the last event (in milliseconds)
- outcome: determined automatically from the events
Outcome detection
Trace outcomes are determined by the severity of events within the trace:
| Outcome | Condition | Display |
|---|---|---|
| success | No error or fatal events, trace finalized | success (green) |
| failure | Any event with severity >= error | FAILURE (red, bold) |
| timeout | Trace expired without completing | TIMEOUT (yellow, bold) |
| unknown | Trace still active, no errors yet | unknown (dim) |
Outcome escalation is one-way: once a trace sees an error/fatal event, its outcome is permanently set to failure.
Trace lifecycle
- Created when the first event with a given
trace_idis processed - Active while events continue arriving for that
trace_id - Finalized after 30 seconds of inactivity (no new events with that
trace_id)
Finalized traces are moved from the active store (256 slots) to a finalized ring buffer (512 slots). Both active and finalized traces are displayed in --trace mode.
Filtering traces
View a single trace by ID:
tailx --trace --trace-id req-abc-123 app.log
Combine with other filters:
# Only failed traces from payments service
tailx --trace --service payments -l error app.log
Traces in JSON mode
In --json mode, traces appear in the triage_summary object’s traces array. Each trace includes its ID, event count, duration, outcome, and the full list of events with their severity, message, and service. See Triage Summary Schema for details.
JSON Output
The --json flag switches tailx to JSONL output mode. Every line is a valid JSON object. This is the primary integration point for AI agents, scripts, and tooling.
Two object types
1. Event objects
One per processed event, emitted as events arrive:
{
"type": "event",
"severity": "ERROR",
"message": "Connection refused to db-primary:5432",
"service": "payments",
"trace_id": "req-abc-123",
"template_hash": 8234567891234,
"fields": {
"latency_ms": 240,
"hostname": "web01",
"pid": 1234
}
}
Fields present in an event object:
| Field | Type | Always present | Description |
|---|---|---|---|
type | string | yes | Always "event" |
severity | string | yes | TRACE, DEBUG, INFO, WARN, ERROR, FATAL, or UNKNOWN |
message | string | yes | The log message (parsed or raw) |
service | string | no | Service name, if detected |
trace_id | string | no | Trace ID, if detected |
template_hash | integer | no | Drain template hash (0 is omitted) |
fields | object | no | Extracted structured fields (omitted if empty) |
Field values in the fields object can be strings, integers, floats, booleans, or null.
2. Triage summary
Always the last line of output. Contains the full analysis:
{
"type": "triage_summary",
"stats": {
"events": 47283,
"groups": 92,
"templates": 38,
"drops": 0,
"events_per_sec": 15252.0,
"elapsed_ms": 3100
},
"top_groups": [...],
"anomalies": [...],
"hypotheses": [...],
"traces": [...]
}
The triage summary is the “money shot” for AI integration. It contains everything the engine computed, structured for machine reasoning. See Triage Summary Schema for the full schema.
Usage patterns
Read full file to JSON
tailx --json -s -n app.log
--json: JSONL output-s(--from-start): start at beginning of file-n(--no-follow): read to EOF and stop
Get just the triage summary
tailx --json -s -n app.log | tail -1
The last line is always the triage_summary. Use tail -1 to extract it.
Filter events in JSON mode
tailx --json -l error --service payments -s -n app.log
Filters work the same in JSON mode. Only matching events are emitted as event objects, but the triage summary still reflects the full pipeline (all events, not just filtered ones).
Stream processing with jq
# Extract all error messages
tailx --json -s -n app.log | jq -r 'select(.type=="event" and .severity=="ERROR") | .message'
# Get top group exemplars from the triage summary
tailx --json -s -n app.log | tail -1 | jq '.top_groups[].exemplar'
# Count events per service
tailx --json -s -n app.log | jq -r 'select(.type=="event") | .service // "unknown"' | sort | uniq -c | sort -rn
Real triage summary example
From the production log test (47,283 events):
{
"type": "triage_summary",
"stats": {
"events": 47283,
"groups": 92,
"templates": 38,
"drops": 0,
"events_per_sec": 15252.0,
"elapsed_ms": 3100
},
"top_groups": [
{
"exemplar": "Connection pool exhausted, waiting for available connection",
"count": 5765,
"severity": "WARN",
"trend": "rising",
"service": "db"
},
{
"exemplar": "<*> carrier <*> ...",
"count": 4812,
"severity": "WARN",
"trend": "rising",
"service": "NetworkManager"
}
],
"anomalies": [],
"hypotheses": [],
"traces": []
}
Every event goes through the full pipeline
Whether you filter by severity, service, or grep – every event is always:
- Parsed (format detection, field extraction)
- Template-fingerprinted (Drain algorithm)
- Grouped (pattern table)
- Assigned to traces (if trace_id present)
- Fed to anomaly detectors
- Fed to the correlation engine
Filters only control what gets emitted as event objects. The triage summary always reflects the complete picture.
Triage Summary Schema
The triage_summary is always the last line of --json output. It contains everything tailx computed about the log stream, structured for machine consumption.
Top-level structure
{
"type": "triage_summary",
"stats": { ... },
"top_groups": [ ... ],
"anomalies": [ ... ],
"hypotheses": [ ... ],
"traces": [ ... ]
}
stats object
Processing statistics for the entire run.
{
"events": 47283,
"groups": 92,
"templates": 38,
"drops": 0,
"events_per_sec": 15252.0,
"elapsed_ms": 3100
}
| Field | Type | Description |
|---|---|---|
events | integer | Total events processed |
groups | integer | Active pattern groups |
templates | integer | Drain template clusters |
drops | integer | Events dropped (arena OOM) |
events_per_sec | float | Processing throughput |
elapsed_ms | integer | Wall-clock processing time |
top_groups[] array
Up to 20 pattern groups, ranked by score (severity x frequency x trend). Each group represents a cluster of structurally similar log messages.
{
"exemplar": "Connection refused to <*>",
"count": 34,
"severity": "ERROR",
"trend": "rising",
"service": "payments",
"source_count": 3
}
| Field | Type | Always present | Description |
|---|---|---|---|
exemplar | string | yes | Representative message for this group |
count | integer | yes | Total event count in this group |
severity | string | yes | Highest severity seen in the group |
trend | string | yes | rising, stable, falling, new, or gone |
service | string | no | Service name, if all events share one |
source_count | integer | no | Number of distinct sources (omitted if 1) |
Trend values
| Trend | Meaning |
|---|---|
rising | Rate is increasing compared to previous window |
stable | Rate is approximately constant |
falling | Rate is decreasing |
new | Group appeared in the current window |
gone | No events in the current window (previously active) |
anomalies[] array
Active anomaly alerts from the rate detector and CUSUM detector.
{
"kind": "rate_spike",
"score": 0.823,
"observed": 450.0,
"expected": 120.3,
"deviation": 4.2,
"fire_count": 3
}
| Field | Type | Description |
|---|---|---|
kind | string | Anomaly type (see table below) |
score | float | Severity score, 0.0 to 1.0 |
observed | float | The actual measured value |
expected | float | The baseline expected value |
deviation | float | Z-score or normalized deviation |
fire_count | integer | Number of times this alert has fired |
Anomaly kinds
| Kind | Source | Description |
|---|---|---|
rate_spike | RateDetector | Event rate significantly above baseline |
rate_drop | RateDetector | Event rate significantly below baseline |
change_point_up | CusumDetector | Sustained upward shift in event rate |
change_point_down | CusumDetector | Sustained downward shift in event rate |
latency_spike | (reserved) | Latency above baseline |
distribution_shift | (reserved) | Statistical distribution change |
cardinality_spike | (reserved) | Sudden increase in unique values |
new_pattern_burst | (reserved) | Burst of previously unseen templates |
hypotheses[] array
Causal hypotheses from the correlation engine. Each hypothesis explains an anomaly by linking it to temporally proximate signals.
{
"causes": [
{
"label": "DB latency spike",
"strength": 0.742,
"lag_ms": 5000
},
{
"label": "deploy detected",
"strength": 0.381,
"lag_ms": 15000
}
],
"confidence": 0.742
}
| Field | Type | Description |
|---|---|---|
causes[] | array | Candidate causes, ordered by strength |
causes[].label | string | Description of the candidate cause |
causes[].strength | float | Cause strength, 0.0 to 1.0 (closer in time + higher magnitude = stronger) |
causes[].lag_ms | integer | Time between cause and effect in milliseconds |
confidence | float | Overall hypothesis confidence (max cause strength) |
traces[] array
Reconstructed request flows from explicit trace_id matching.
{
"trace_id": "req-abc-123",
"event_count": 5,
"duration_ms": 245,
"outcome": "failure",
"events": [
{
"severity": "INFO",
"message": "Received POST /api/checkout",
"service": "gateway"
},
{
"severity": "ERROR",
"message": "Connection refused to db-primary:5432",
"service": "payments"
}
]
}
| Field | Type | Description |
|---|---|---|
trace_id | string | The trace identifier |
event_count | integer | Number of events in this trace |
duration_ms | integer | Time from first to last event |
outcome | string | success, failure, timeout, or unknown |
events[] | array | Events in the trace, in order |
events[].severity | string | Event severity level |
events[].message | string | Event message |
events[].service | string | Service name (if present) |
MCP & Agent Integration
tailx is designed to be a tool for AI agents. The --json output provides structured triage data that an LLM can reason over directly, without parsing raw log text.
The key insight: the AI does not parse logs. tailx parses logs. The AI reasons over structured triage output.
Subprocess integration
The simplest integration is calling tailx as a subprocess and reading the last line of output.
Python example
import subprocess
import json
result = subprocess.run(
["tailx", "--json", "-s", "-n", "--last", "5m", "app.log"],
capture_output=True,
text=True
)
# The last line is always the triage_summary
lines = result.stdout.strip().split("\n")
triage = json.loads(lines[-1])
print(f"Events: {triage['stats']['events']}")
print(f"Groups: {triage['stats']['groups']}")
print(f"Top issue: {triage['top_groups'][0]['exemplar']}")
Shell example
# Get triage summary as JSON
TRIAGE=$(tailx --json -s -n --last 5m app.log | tail -1)
# Extract top group with jq
echo "$TRIAGE" | jq -r '.top_groups[0].exemplar'
MCP tool definition
tailx can be exposed as an MCP (Model Context Protocol) tool. Here is a tool definition:
{
"name": "tailx_triage",
"description": "Analyze log files for patterns, anomalies, and root causes. Returns structured triage with event groups ranked by severity/frequency, anomaly alerts, causal hypotheses, and request traces. Use this when investigating system issues, outages, or performance problems.",
"input_schema": {
"type": "object",
"properties": {
"files": {
"type": "array",
"items": { "type": "string" },
"description": "Log file paths to analyze (e.g., [\"app.log\", \"db.log\"])"
},
"time_window": {
"type": "string",
"description": "How far back to look (e.g., \"5m\", \"1h\", \"30s\")"
},
"severity": {
"type": "string",
"enum": ["trace", "debug", "info", "warn", "error", "fatal"],
"description": "Minimum severity to include in event output"
},
"grep": {
"type": "string",
"description": "Filter events by message substring"
},
"service": {
"type": "string",
"description": "Filter events by service name"
}
},
"required": ["files"]
}
}
MCP tool implementation
def tailx_triage(files, time_window=None, severity=None, grep=None, service=None):
cmd = ["tailx", "--json", "-s", "-n"]
if time_window:
cmd.extend(["--last", time_window])
if severity:
cmd.extend(["--severity", severity])
if grep:
cmd.extend(["--grep", grep])
if service:
cmd.extend(["--service", service])
cmd.extend(files)
result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
lines = result.stdout.strip().split("\n")
# Return just the triage summary for the AI to reason over
return json.loads(lines[-1])
What the AI receives
When an agent calls tailx_triage(files=["app.log"], time_window="5m"), it receives a structured object like:
{
"type": "triage_summary",
"stats": {
"events": 847,
"groups": 12,
"templates": 8,
"drops": 0,
"events_per_sec": 4231.0,
"elapsed_ms": 200
},
"top_groups": [
{
"exemplar": "Connection refused to <*>",
"count": 34,
"severity": "ERROR",
"trend": "rising",
"service": "payments"
}
],
"anomalies": [
{
"kind": "rate_spike",
"score": 0.823,
"observed": 450.0,
"expected": 120.3,
"deviation": 4.2,
"fire_count": 3
}
],
"hypotheses": [
{
"causes": [
{"label": "DB latency spike", "strength": 0.742, "lag_ms": 5000}
],
"confidence": 0.742
}
],
"traces": []
}
The AI can now reason: “The top pattern group is rising connection refused errors from the payments service (34 occurrences). There’s a rate spike anomaly. The correlation engine suggests a DB latency spike 5 seconds earlier as a likely cause.”
Design rationale
Why not have the AI read raw logs?
- Volume: 47,000 lines of logs would consume an entire context window. The triage summary is a few hundred tokens.
- Signal-to-noise: most production logs are repetitive noise. The AI would waste tokens on irrelevant repetition. tailx collapses 47,000 lines into 38 templates.
- Speed: tailx processes 69,000 events/sec. The pipeline runs in seconds, not minutes.
- Determinism: statistical analysis (z-scores, CUSUM, EWMA) is reproducible. LLM pattern matching is not.
- Cost: one subprocess call is effectively free. Feeding 47,000 lines to an LLM costs tokens and time.
The AI’s job is to interpret the structured triage, suggest fixes, and communicate findings to humans – not to count log lines.
Processing Pipeline
Every log line that enters tailx passes through a 12-stage pipeline. The pipeline is synchronous and single-threaded – no locks, no channels, no thread pools.
Pipeline stages
raw bytes
│
├─ 1. ReadBuffer 64 KiB per-source, in-place line splitting
├─ 2. QuickTimestamp Fast timestamp extraction
├─ 3. MultiLineDetector Continuation line detection
├─ 4. Merger Arena-dupe + push to EventRing
├─ 5. FormatDetector Vote on format, lock after 8 samples
├─ 6. Parser dispatch JSON / KV / Syslog / Fallback
├─ 7. SchemaInferer Track field types/frequencies (first 64 events)
├─ 8. DrainTree Template fingerprinting → template_hash
├─ 9. GroupTable Classify into groups, update counts/trend
├─ 10. TraceStore Assign to trace via trace_id
├─ 11. Anomaly tick RateDetector + CusumDetector (every 1s)
├─ 12. Correlation Feed signals, build hypotheses
│
└─ Event (in ring buffer, ready for rendering)
Stage details
1. ReadBuffer
Each file source gets a 64 KiB ReadBuffer. Raw bytes from read() are appended to the buffer. The buffer yields complete lines (terminated by \n), handling \r\n line endings and partial lines across reads. If the buffer fills without a newline, the entire buffer is yielded as a single long line.
2. QuickTimestamp
Before any parsing, QuickTimestamp.extract() does a fast scan for timestamps at the beginning of the line. Supports:
- ISO 8601:
2024-03-15T14:23:01.123Z - Epoch milliseconds:
1710510181123 - Epoch seconds:
1710510181
If no timestamp is found, the current wall clock time is used.
3. MultiLineDetector
Checks if a line is a continuation of the previous message (stack traces, indented text). Continuation lines are skipped – they do not become new events. This prevents stack trace frames from inflating event counts.
4. Merger (Ingest)
The raw line is copied into the current arena (EventArena) and an Event struct is pushed onto the EventRing. The event starts with the raw line as its message, the extracted timestamp, and the source ID.
5. FormatDetector
Per-source format detection. Each source has its own FormatDetector that votes on the format based on simple heuristics. After 8 samples, the format locks and all future lines from that source use the same parser.
Detection rules:
- JSON: starts with
{, ends with} - Syslog BSD: starts with
<digits> - CLF: IP followed by
-and[date]and" - Logfmt: 3+
key=valuepairs AND haslevel=ANDmsg=/message= - KV pairs: 3+
key=valuepairs - Unstructured: everything else
On tie, the more structured format wins.
6. Parser dispatch
Based on the detected format, one of four parsers extracts structured fields from the raw line:
JsonParser– hand-written JSON scanner with known field mappingKvParser– key=value pair extraction with quoting supportSyslogBsdParser– PRI, BSD timestamp, hostname, app[pid], messageFallbackParser– timestamp prefix skip, severity extraction, bracketed service
Each parser populates the event’s severity, message, service, trace_id, and fields.
7. SchemaInferer
Per-source schema inference from the first 64 events. Tracks field names, types, and frequencies. This information is available for downstream consumers (e.g., adaptive parsing).
8. DrainTree
The Drain algorithm extracts a structural template from the event’s message. Variable parts (tokens containing digits, quoted strings) become <*> wildcards. The template is hashed with FNV-1a to produce a template_hash. Events with the same template hash are structurally identical despite different parameters.
9. GroupTable
The event is classified into a pattern group based on its template_hash. The group’s count, severity, trend, and score are updated. Groups are ranked by a composite score of severity, frequency, and trend direction.
10. TraceStore
If the event has a trace_id, it is assigned to an active trace in the TraceStore. The trace tracks event references (ring buffer indices), duration, and outcome. Active traces expire after 30 seconds of inactivity and are moved to the finalized store.
11. Anomaly tick (periodic)
Every 1 second (by wall clock), the pipeline ticks the anomaly detectors:
- RateDetector: feeds the current event rate to a dual EWMA (10s fast, 5min slow) and computes a z-score against historical statistics. Fires if z-score >= 3.0 and absolute delta exceeds threshold.
- CusumDetector: accumulates normalized deviations. Fires on sustained shifts that z-scores miss. 30-tick cooldown after firing.
Detector results are processed by the SignalAggregator (deduplication, resolution, eviction) and fed to the correlation engine.
12. Correlation
Rising groups and anomaly alerts are recorded as CorrelationSignal objects. The TemporalProximity analyzer finds signals that co-occur within a 5-minute window and ranks them by proximity and magnitude to build Hypothesis objects.
Periodic maintenance
Every 60 seconds, the pipeline runs a window rotation:
GroupTable.windowRotate()– updates trend calculationsTraceStore.expireSweep()– finalizes inactive tracesArenaPool.maybeRotate()– rotates arena generations for bulk memory freeing
Pipeline state
The Pipeline struct owns all mutable state:
EventRing(ring buffer of events)ArenaPool(generation-tagged arena allocators)FormatDetector[64](one per source)SchemaInferer[64](one per source)DrainTree(template extraction)GroupTable(pattern grouping)RateDetector+CusumDetector(anomaly detection)SignalAggregator(alert management)TraceStore(trace reconstruction)TemporalProximity(correlation engine)
All state is allocated once at startup. No allocations occur in the per-event hot path after initialization.
Parsing & Format Detection
tailx auto-detects the log format for each source independently and dispatches to the appropriate parser. No configuration required.
Format detection
The FormatDetector examines lines using simple heuristics. Each source gets its own detector. After 8 samples, the format locks – all subsequent lines from that source use the same parser without re-detection.
Detection rules
| Format | Heuristic |
|---|---|
| JSON | Line starts with { and ends with } (after trimming whitespace) |
| Syslog BSD | Line starts with < followed by digits and > |
| Syslog IETF | Syslog prefix + version digit after > |
| CLF | IP/hostname, then -, then [, then " within first 80 bytes |
| Logfmt | 3+ key=value pairs AND contains level=/lvl= AND msg=/message= |
| KV pairs | 3+ key=value pairs (without logfmt-specific keys) |
| Unstructured | Everything else |
On ties (equal vote counts), the more structured format wins. Structuredness ranking: JSON (6) > logfmt (5) > KV (4) > syslog/CLF (3) > unstructured (0).
JSON parser
Hand-written scanner (no std.json dependency). Parses objects one key-value pair at a time, mapping known keys to Event fields and collecting the rest into the FieldMap.
Known field mapping
| JSON key | Maps to |
|---|---|
timestamp, ts, time, @timestamp, datetime, t | event.timestamp |
level, severity, lvl, loglevel, log_level | event.severity |
message, msg, log, text, body | event.message |
trace_id, traceId, trace, x-trace-id, request_id | event.trace_id |
service, service_name, app, application, component | event.service |
All other keys become entries in the event’s FieldMap with their parsed values.
Value types
The JSON parser handles all JSON value types:
- Strings: extracted with escape sequence handling (
\",\\,\n,\r,\t,\uXXXX) - Integers: parsed as
i64 - Floats: parsed as
f64 - Booleans:
true/false - Null:
null
Timestamp handling
Timestamp values can be:
- String: parsed as ISO 8601 (
2024-03-15T14:23:01.123Z) - Integer > 946684800000: interpreted as epoch milliseconds
- Integer > 946684800: interpreted as epoch seconds
- Float: interpreted as epoch seconds with fractional part
Example
Input:
{"level":"error","msg":"Connection refused","service":"payments","latency_ms":240,"trace_id":"req-001"}
Result:
event.severity= ERRORevent.message= “Connection refused”event.service= “payments”event.trace_id= “req-001”event.fields={"latency_ms": 240}
KV parser
Parses key=value pairs separated by whitespace. Values can be bare words or double-quoted strings.
Known field mapping
Same known keys as the JSON parser. The KV parser also applies:
- Numeric inference: bare values that parse as integers become
i64, as floats becomef64 - Quote stripping:
msg="hello world"extractshello world
Example
Input:
ts=2024-03-15T14:23:01Z level=error msg="Connection refused" service=payments latency_ms=240
Result:
event.timestamp= 2024-03-15T14:23:01Zevent.severity= ERRORevent.message= “Connection refused”event.service= “payments”event.fields={"latency_ms": 240}
Syslog BSD parser
Parses RFC 3164 syslog format. Also handles journalctl output (which omits the PRI).
Format
<PRI>Mon DD HH:MM:SS hostname app[pid]: message
PRI to severity mapping
The PRI value encodes facility and severity per RFC 3164. The severity component (PRI mod 8) maps to:
| PRI mod 8 | Syslog severity | tailx severity |
|---|---|---|
| 0 | Emergency | fatal |
| 1 | Alert | fatal |
| 2 | Critical | fatal |
| 3 | Error | error |
| 4 | Warning | warn |
| 5 | Notice | info |
| 6 | Informational | info |
| 7 | Debug | debug |
Fields extracted
- severity: from PRI value, or inferred from message content
- service: from the app name (e.g.,
nginxfromnginx[1234]) - hostname: stored as a field
- pid: stored as a field (integer if parseable)
- message: everything after
app[pid]:
Severity inference
If no PRI is present (e.g., journalctl output), the parser infers severity from message content by looking for keywords like error, warn, info, debug, critical, and fatal – both bare and in brackets (e.g., [ERROR]).
Example
Input:
<134>Mar 15 14:23:01 web01 nginx[1234]: GET /api 200 0.012
Result:
event.severity= INFO (PRI 134 mod 8 = 6 = informational)event.service= “nginx”event.message= “GET /api 200 0.012”event.fields={"hostname": "web01", "pid": 1234}
Fallback parser
Handles unstructured text logs by extracting what it can.
Extraction order
- Timestamp prefix: skip ISO 8601 or similar date/time prefix
- Severity: look for bare keywords (
ERROR,WARN, etc.) or bracketed ([ERROR],[WARN]) - Service: extract from brackets (
[PaymentService]-> “PaymentService”) - Message: everything remaining after extraction
Example
Input:
2024-03-15 14:23:01 ERROR [PaymentService] Connection refused to db:5432
Result:
event.severity= ERRORevent.service= “PaymentService”event.message= “Connection refused to db:5432”
Multi-line detection
Before parsing, the MultiLineDetector checks if a line is a continuation of a previous message (e.g., stack trace frames, indented continuation lines). Continuation lines are skipped and do not create new events.
This prevents a 50-line Java stack trace from becoming 50 separate events – only the first line (the exception) becomes an event.
Drain Template Extraction
Drain is the algorithm that collapses thousands of repetitive log lines into a handful of structural templates. It is the foundation of pattern grouping – without it, every unique log message would be its own group.
The problem
These three log lines are structurally identical:
Connection to 10.0.0.1 timed out after 30s
Connection to 10.0.0.2 timed out after 45s
Connection to 10.0.0.3 timed out after 12s
They differ only in the IP address and timeout duration. A human sees “connection timeout” immediately. Drain teaches tailx to see the same thing.
How it works
1. Tokenize
Split the message by whitespace into tokens.
["Connection", "to", "10.0.0.1", "timed", "out", "after", "30s"]
2. Classify tokens
Each token is classified as either a literal or a wildcard (<*>):
- Contains any digit -> wildcard. This catches IPs, ports, durations, counts, UUIDs, timestamps.
- Quoted string (starts and ends with
") -> wildcard. - Everything else -> literal.
["Connection", "to", "<*>", "timed", "out", "after", "<*>"]
3. Match against existing clusters
Search existing clusters for one with:
- The same token count
- Similarity >= 0.5 (the sim_threshold)
Similarity is computed as the fraction of positions where both tokens match (both are wildcards, or both are the same literal):
similarity = matching_positions / total_positions
4. Merge or create
If a match is found: merge the new tokens into the existing cluster. Any position where the existing template has a literal but the new line has a different literal gets generalized to <*>.
If no match is found: create a new cluster with the classified tokens.
5. Hash the template
The final template tokens are hashed with FNV-1a to produce a u64 template_hash. All events that map to the same template get the same hash.
"Connection to <*> timed out after <*>" → hash: 0x3a7f...
Example walkthrough
Line 1: Connection to 10.0.0.1 timed out after 30s
Classified: ["Connection", "to", "<*>", "timed", "out", "after", "<*>"]
No existing clusters. Create cluster #0.
Line 2: Connection to 10.0.0.2 timed out after 45s
Classified: ["Connection", "to", "<*>", "timed", "out", "after", "<*>"]
Cluster #0 has 7 tokens, this has 7 tokens. Similarity = 7/7 = 1.0 >= 0.5. Match. All positions agree. Cluster #0 count becomes 2.
Line 3: User logged in from 10.0.0.1 at 14:00
Classified: ["User", "logged", "in", "from", "<*>", "at", "<*>"]
Cluster #0 has 7 tokens, this has 7 tokens. But similarity: position 0 “Connection” vs “User” = mismatch, position 1 “to” vs “logged” = mismatch… similarity < 0.5. No match. Create cluster #1.
Line 4: Error 500 on server web01
Classified: ["Error", "<*>", "on", "server", "<*>"]
Only 5 tokens. Cluster #0 has 7, cluster #1 has 7. Token count mismatch for both. Create cluster #2.
Line 5: Error 404 on server web02
Classified: ["Error", "<*>", "on", "server", "<*>"]
Cluster #2 has 5 tokens, this has 5 tokens. Similarity = 5/5 = 1.0. Match. Cluster #2 count becomes 2.
Configuration
The DrainTree is initialized with:
- max_depth: 4 (controls the depth of the classification tree – in this implementation, used as a parameter but matching is linear across clusters)
- sim_threshold: 0.5 (minimum similarity to match an existing cluster)
- max_clusters: 4096 (hard limit on the number of distinct templates)
When the cluster limit is reached, new messages that don’t match an existing cluster are still hashed (from their classified tokens) but don’t create new clusters.
Why these rules work
The “contains any digit -> wildcard” rule is surprisingly effective because most variable parts in log messages contain digits:
- IP addresses:
10.0.0.1 - Ports:
5432 - Durations:
30s,250ms - Counts:
42 items - HTTP status codes:
200,500 - UUIDs:
550e8400-e29b-41d4-a716-446655440000 - Timestamps:
14:23:01 - PIDs:
[1234]
The few variable tokens without digits (usernames, hostnames) may not get wildcarded, but they will either match literally (same user) or cause a new cluster (different user). Over time, if both forms appear, the merge step generalizes the position to <*>.
Template hash
The hash function is FNV-1a over the concatenated template tokens (with space separators). This is a fast, well-distributed hash that produces a u64 – the template_hash stored on every event.
Events with the same template_hash are grouped together in the GroupTable. The hash is the primary grouping key for all downstream analysis.
Anomaly Detection
tailx uses two complementary anomaly detectors that tick every second. Together they catch both sudden spikes and sustained shifts in event rate.
RateDetector
Dual EWMA (Exponentially Weighted Moving Average) with z-score thresholding.
Architecture
event rate (events/sec)
│
├─ EWMA fast (10s halflife) → "current" rate
├─ EWMA slow (5min halflife) → "baseline" rate
└─ StreamingStats (Welford) → historical mean/variance → z-score
How it works
- Each tick (1 second), the current event count is fed as a sample.
- The sample’s z-score is computed against the running historical statistics (before updating them).
- Both EWMAs are updated with the sample.
- After the warmup period (30 samples), if the z-score >= 3.0 AND the absolute delta between fast and slow EWMA exceeds the minimum threshold (1.0), an anomaly fires.
Spike vs. drop
- z-score >= 3.0:
rate_spike– the event rate is significantly above the historical norm. - z-score <= -3.0 (and baseline > minimum threshold):
rate_drop– the event rate has significantly dropped. Only fires when the baseline is meaningful (above minimum absolute delta).
Warmup
The first 30 samples are used to build the baseline. No anomalies fire during warmup, preventing false positives from cold start.
Score normalization
The raw z-score is mapped to a 0.0 - 1.0 severity score using a logistic-like function:
score = 1.0 - 1.0 / (1.0 + 0.1 * z^2)
This gives:
- z = 3.0 -> score ~0.47
- z = 5.0 -> score ~0.71
- z = 10.0 -> score ~0.91
CusumDetector
Cumulative Sum (CUSUM) change-point detector. Catches sustained shifts that individual z-scores miss.
The problem CUSUM solves
Imagine the event rate gradually climbs from 100/s to 200/s over 30 seconds. No single tick has a z-score >= 3.0 because each increase is small. But the cumulative shift is significant. CUSUM catches this.
How it works
- Each tick, the sample is normalized:
(sample - mean) / stddev. - Two cumulative sums are maintained:
s_high: accumulates upward deviations minus an allowance (0.5)s_low: accumulates downward deviations minus the same allowance
- Both sums are clamped to >= 0 (they cannot go negative).
- If
s_highexceeds the threshold (5.0 standard deviations), firechange_point_upand resets_highto 0. - If
s_lowexceeds the threshold, firechange_point_downand resets_lowto 0.
Cooldown
After firing, a 30-tick cooldown prevents re-firing on the same shift. This avoids alert storms when a new baseline is establishing.
Score
The CUSUM score is:
score = min(1.0, cumulative_sum / (threshold * 2.0))
Capped at 1.0. Higher cumulative sums (larger or longer shifts) produce higher scores.
SignalAggregator
The SignalAggregator manages anomaly alerts across both detectors.
Deduplication
If a detector fires with the same method (e.g., rate_spike) as an existing active alert, the existing alert is updated instead of creating a new one:
last_fired_nsis updatedfire_countis incrementedscoreis set to the max of old and new
Resolution
An active alert transitions to resolved after 30 seconds of not being re-fired. This means the anomalous condition has ended.
Eviction
Resolved alerts are evicted after 5 minutes. This keeps the alert table clean while retaining recent history for the triage summary.
Capacity
The aggregator holds up to 128 alerts simultaneously.
Correlation Engine
The TemporalProximity analyzer connects anomaly signals to possible causes.
Signal sources
Three types of signals feed the correlation engine:
- Anomaly alerts from the RateDetector and CusumDetector
- Rising groups – pattern groups whose trend is
risingin the current window - Rate changes from detector results
Finding causes
For each active anomaly alert, the engine searches for signals that occurred within a 5-minute window before the anomaly. Candidate causes are ranked by:
strength = (1.0 - normalized_lag) * magnitude
Where normalized_lag is the time lag as a fraction of the 5-minute window. Closer signals with higher magnitude rank higher.
Hypothesis building
The ranked causes form a Hypothesis with:
causes[]: up to 8 candidate causes, ordered by strengthconfidence: the maximum cause strength (a measure of how strongly correlated the top cause is)
Example
t=10s: DB latency spike (anomaly_alert, magnitude=0.8)
t=12s: "Connection refused" group rising (group_spike, magnitude=0.6)
t=15s: Error rate spike (anomaly_alert, magnitude=0.9) ← the effect
The hypothesis for the error rate spike would include:
- DB latency spike (5s lag, strength = 0.73) – closest and high magnitude
- “Connection refused” rising (3s lag, strength = 0.57)
This tells the operator (or AI agent): “The error rate spike is likely related to the DB latency spike that started 5 seconds earlier.”
Statistical Structures
All statistical data structures in tailx are O(1) memory and O(1) per-event update. The total statistical engine uses less than 1 MiB of memory.
CountMinSketch
Probabilistic frequency estimator. Answers “how many times have I seen this key?” without storing every key.
Structure
A depth x width matrix of u32 counters. Each row uses a different hash function (wyhash with different seeds). To estimate the count of a key, hash it with each row’s function, look up the counter, and return the minimum across all rows.
Properties
- Memory: fixed at
depth * width * 4bytes - Update: O(depth) – hash and increment one counter per row
- Query: O(depth) – hash and read one counter per row, return min
- Error: overestimates only, never undercounts
- Decay: supports multiplicative decay for sliding window expiry
Usage
Used internally for frequency tracking in the pattern grouping layer.
HyperLogLog
Probabilistic cardinality estimator. Answers “how many distinct values have I seen?” using ~16 KiB of memory.
Configuration
- Precision: p = 14
- Registers: 2^14 = 16,384
- Memory: exactly 16,384 bytes (~16 KiB)
- Standard error: ~3%
Algorithm
- Hash the input key with wyhash -> 64-bit hash
- Upper 14 bits select the register index
- Count leading zeros of the remaining bits + 1
- Store the max of (current register value, leading zeros count)
- Estimate: harmonic mean of 2^(-register) values, with bias correction
Merge
Two HyperLogLog sketches merge by taking the register-wise maximum. This makes it composable across sources.
Small range correction
When many registers are still zero, the standard HLL formula overestimates. Linear counting is used instead: m * ln(m / zeros).
TDigest
Streaming percentile estimator. Computes approximate p50, p95, p99 from a stream without storing all values.
Configuration
- Max centroids: 256
- Memory: ~4 KiB (256 centroids x 16 bytes each)
- Compression parameter: 100
How it works
The TDigest maintains a sorted list of (mean, weight) centroids. New values are merged into the nearest centroid, subject to a compression constraint that keeps more centroids at the tails (for accurate extreme percentiles) and fewer in the middle.
Supported queries
quantile(0.50)– medianquantile(0.95)– 95th percentilequantile(0.99)– 99th percentile- Any quantile between 0.0 and 1.0
Accuracy
Higher accuracy at the tails (p1, p99) where it matters most for latency monitoring. The compression parameter (100) trades memory for accuracy – higher values retain more centroids.
EWMA
Exponentially Weighted Moving Average. Tracks a smoothed rate that adapts to changes.
Configuration
// Fast EWMA: 10-second halflife, 1-second tick interval
EWMA.initWithHalflife(10 * std.time.ns_per_s, std.time.ns_per_s)
// Slow EWMA: 5-minute halflife, 1-second tick interval
EWMA.initWithHalflife(300 * std.time.ns_per_s, std.time.ns_per_s)
Alpha computation
The smoothing factor alpha is computed from the halflife:
alpha = 1 - exp(-tick_interval / halflife * ln(2))
A 10-second halflife means after 10 seconds, the influence of old values has decayed by 50%.
Time-weighted updates
The EWMA handles irregular update intervals by adjusting the effective alpha based on the actual elapsed time since the last update. This prevents drift when ticks are not perfectly regular.
Dual EWMA in anomaly detection
The RateDetector uses two EWMAs:
- Fast (10s halflife): tracks the “current” rate – responds quickly to changes
- Slow (5min halflife): tracks the “baseline” – represents the normal rate
When the fast EWMA diverges significantly from the slow EWMA, something has changed.
StreamingStats
Welford’s online algorithm for running mean, variance, standard deviation, and z-score.
What it computes
- Mean: running average
- Variance: running population variance
- Standard deviation: sqrt(variance)
- Z-score: (value - mean) / stddev
Properties
- Single-pass, numerically stable
- O(1) memory (stores count, mean, M2)
- O(1) per update
- No stored samples – cannot compute percentiles (use TDigest for that)
Usage
Used by both the RateDetector and CusumDetector to compute z-scores of event rate samples against their historical distribution.
TimeWindow
Circular bucket array for time-bucketed statistics.
Structure
TimeWindow {
buckets: []Bucket, // circular array
bucket_count: u16, // number of buckets
duration_ns: i128, // total window span
bucket_duration_ns: i128, // duration per bucket
head: u16, // current bucket index
}
Each Bucket stores:
count: number of recordssum: sum of valuesmin: minimum valuemax: maximum valuestart_ns: bucket start time
Operations
- advance(now_ns): advance the head to the bucket covering
now_ns, clearing expired buckets - record(value): add a value to the current bucket
- rate(): compute the overall rate across all buckets
Usage
Used for time-windowed rate calculations and trend detection in the pattern grouping layer.
Memory budget
| Structure | Size | Count | Total |
|---|---|---|---|
| CountMinSketch (per instance) | depth x width x 4 bytes | varies | < 64 KiB |
| HyperLogLog | 16,384 bytes | 1 | 16 KiB |
| TDigest | ~4 KiB | varies | < 16 KiB |
| EWMA | 48 bytes | 2 (rate detector) | 96 bytes |
| StreamingStats | 32 bytes | 2 (detectors) | 64 bytes |
| TimeWindow | varies by bucket count | varies | < 32 KiB |
| Total statistical engine | < 1 MiB |
CLI Reference
tailx [OPTIONS] [FILES...] [QUERY]
tailx processes log files or stdin, auto-detects formats, extracts structure, groups patterns, detects anomalies, and outputs results to the terminal or as JSON.
Modes
(default) – Pattern mode
tailx app.log
Events are printed line-by-line with severity badges and service names. A ranked pattern summary is displayed at the end (batch mode) or every 500 events (follow mode). This is the mode for interactive triage.
--raw
tailx --raw app.log
Classic tail output. Events are printed with basic formatting (severity badge, service name, message) but no pattern summary, no anomaly alerts, no group rankings. The full pipeline still runs internally.
--trace
tailx --trace app.log
Groups events by trace_id and displays them as tree views with duration and outcome. Events without a trace_id are not shown. The pattern summary is still displayed at the end.
--incident
tailx --incident app.log
Suppresses normal event output. Only displays active anomaly alerts and the pattern summary. Use this for alerting and on-call scenarios where you only want to see signals.
--json
tailx --json app.log
Outputs JSONL (one JSON object per line). Event objects are emitted as events arrive. The triage summary is always the last line. Designed for AI agents and scripts.
Filters
-l, --severity <level>
tailx --severity warn app.log
tailx -l error app.log
Minimum severity threshold for display. Valid levels: trace, debug, info, warn, error, fatal.
Events below the threshold are still processed by the pipeline – filtering is display-only.
-g, --grep <string>
tailx --grep timeout app.log
tailx -g "connection refused" app.log
Filter events whose message contains the given substring. Uses Boyer-Moore-Horspool for fast matching. Case-sensitive.
--service <name>
tailx --service payments app.log
tailx --service nginx app.log
Filter events by exact service name match. The service is auto-detected from the log format (JSON service key, syslog app name, bracketed text in unstructured logs).
--trace-id <id>
tailx --trace-id req-abc-123 app.log
Filter events by exact trace ID match. Best combined with --trace mode to inspect a single request flow.
--field <key=value>
tailx --field status=500 app.log
tailx --field hostname=web01 app.log
tailx --field user_id=42 app.log
Filter events by field value. Supports string and integer comparison – if the event field is an integer and the filter value parses as an integer, numeric comparison is used.
--last <duration>
tailx --last 5m app.log
tailx --last 1h app.log
tailx --last 30s app.log
tailx --last 2d app.log
Only display events from within the given time window. Supported suffixes: s (seconds), m (minutes), h (hours), d (days).
Options
-f, --follow
tailx -f app.log
tailx --follow app.log
Follow files for new data (default behavior). tailx uses poll() to efficiently wait for new data. Detects file truncation (copytruncate) and rotation (new inode at same path).
-n, --no-follow
tailx -n app.log
tailx --no-follow app.log
Read to EOF and stop. Do not wait for new data. Use this for batch analysis of complete files.
-s, --from-start
tailx -s app.log
tailx --from-start app.log
Start reading from the beginning of the file. By default, tailx seeks to the end and only shows new data (like tail -f). Combine with -n for full file analysis:
tailx -s -n app.log
--no-color
tailx --no-color app.log
Disable ANSI color codes in output. Color is also automatically disabled when stdout is not a terminal (piped to a file or another command) or when using --json mode.
--ring-size <n>
tailx --ring-size 131072 app.log
Set the event ring buffer capacity. Default: 65536 (64K events). Must be a power of 2 for efficient bitwise modulo indexing. Larger values retain more history but use more memory.
-h, --help
tailx --help
Display usage information with all modes, filters, options, and examples.
-V, --version
tailx --version
# tailx v1.0
Display the version string.
Positional arguments
Files
tailx app.log
tailx /var/log/*.log
tailx access.log error.log
One or more file paths. Glob patterns (*, ?) are expanded. Multiple files are merged into a single event stream, with source names displayed when more than one file is open.
Intent queries
tailx "errors related to payments" app.log
tailx "5xx from nginx" app.log
tailx "timeout" app.log
If a positional argument is not an existing file path, it is treated as a natural language intent query. Keywords are mapped to filters (severity thresholds, service names, message substrings). See Intent Queries.
Stdin
cat app.log | tailx
journalctl -u myservice | tailx
dmesg | tailx --severity warn
When no files are specified and stdin is not a terminal, tailx reads from stdin. All modes and filters work with stdin input.
Examples
# Tail a file with pattern grouping
tailx app.log
# Full file analysis
tailx -s -n app.log
# Only errors from the payments service
tailx -l error --service payments app.log
# Kernel warnings from dmesg
dmesg | tailx -l warn
# Anomaly-only view across multiple files
tailx --incident *.log
# Trace a specific request
tailx --trace --trace-id req-abc-123 app.log
# JSON output for AI consumption
tailx --json -s -n --last 5m app.log
# Natural language query
tailx "why are payments failing" app.log
# Multiple files with severity filter
tailx -l warn access.log error.log system.log
Supported Formats
tailx auto-detects the log format for each source independently. Detection locks after 8 samples. No configuration required.
JSON / JSONL
{"level":"error","msg":"Connection refused","service":"payments","latency_ms":240,"trace_id":"req-001"}
Detection: line starts with { and ends with } (after trimming whitespace).
Known field keys
| JSON key | Maps to |
|---|---|
timestamp, ts, time, @timestamp, datetime, t | event timestamp |
level, severity, lvl, loglevel, log_level | event severity |
message, msg, log, text, body | event message |
trace_id, traceId, trace, x-trace-id, request_id | event trace_id |
service, service_name, app, application, component | event service |
All other keys become structured fields on the event. Values are parsed as their JSON types: strings, integers (i64), floats (f64), booleans, and null.
Timestamp handling
- String values: parsed as ISO 8601
- Integers > 946684800000: epoch milliseconds
- Integers > 946684800: epoch seconds
- Floats: epoch seconds with fractional part
Logfmt
ts=2024-03-15T14:23:01Z level=error msg="Connection refused" service=payments latency_ms=240
Detection: 3+ key=value pairs AND contains level=/lvl= AND msg=/message=.
Same known field keys as JSON. Values can be bare words (level=error) or double-quoted strings (msg="hello world"). Bare values that parse as numbers are stored as integers or floats.
Syslog BSD (RFC 3164)
<134>Mar 15 14:23:01 web01 nginx[1234]: GET /api 200 0.012
Detection: line starts with < followed by digits and >.
PRI decoding
The PRI value (0-191) encodes facility and severity. Severity = PRI mod 8:
| PRI mod 8 | Syslog severity | tailx severity |
|---|---|---|
| 0 | Emergency | fatal |
| 1 | Alert | fatal |
| 2 | Critical | fatal |
| 3 | Error | error |
| 4 | Warning | warn |
| 5 | Notice | info |
| 6 | Informational | info |
| 7 | Debug | debug |
Extracted fields
- severity: from PRI, or inferred from message keywords
- service: from app name (
nginxfromnginx[1234]) - hostname: stored as a structured field
- pid: stored as a structured field (integer if parseable)
- message: everything after
app[pid]:
Journalctl compatibility
Journalctl output omits the PRI prefix but follows the same BSD syslog structure:
Mar 15 14:23:01 web01 nginx[1234]: GET /api 200 0.012
The parser handles this by treating the PRI as optional. When no PRI is present, severity is inferred from message content (keywords like error, warn, [ERROR], etc.).
Key-Value pairs
host=db01 cpu=0.85 memory=0.72 disk=0.45
Detection: 3+ key=value pairs (without the logfmt-specific level= and msg= keys).
Same known field keys as JSON. Values are bare words or quoted strings. Numeric inference applies to bare values.
CLF (Common Log Format)
10.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.1" 200 2326
Detection: IP/hostname, then -, then [, then " within first 80 bytes.
CLF lines are parsed by the fallback parser, which extracts what it can from the structure.
Unstructured text
2024-03-15 14:23:01 ERROR [PaymentService] Connection refused to db:5432
Detection: everything that does not match the above formats.
The fallback parser extracts:
- Timestamp prefix: ISO 8601 or similar date/time at the start of the line (skipped)
- Severity: bare keywords (
ERROR,WARN,INFO,DEBUG,TRACE,FATAL) or bracketed ([ERROR],[WARN]) - Service: text in brackets (
[PaymentService]) - Message: the remainder after extracting the above
Format mixing
Different sources can have different formats. A single tailx invocation can process JSON from one file and syslog from another:
tailx app.log api.json.log
Each source locks to its detected format independently after 8 lines.
Detection priority on ties
When two formats have equal votes after 8 samples, the more structured format wins:
- JSON / JSONL (highest priority)
- Logfmt
- Key-Value pairs
- Syslog BSD / Syslog IETF / CLF
- Unstructured (lowest priority)
Performance
tailx is built for speed. The entire processing pipeline – parsing, template extraction, grouping, anomaly detection, correlation – runs in the per-event hot path with zero heap allocation after initialization.
Throughput
| Metric | Value |
|---|---|
| End-to-end throughput | 69,000 events/sec (single core) |
| Measured on | 47,000 mixed-format lines in 3.1s |
| Full pipeline | parse + Drain + group + trace + anomaly + correlation |
This is not a synthetic benchmark. It is measured throughput on real production log data through the complete 12-stage pipeline.
Binary size
| Build mode | Size |
|---|---|
| ReleaseSmall (stripped) | 144 KB |
| ReleaseSafe | 3.1 MB |
The 144 KB ReleaseSmall binary fits in L2 cache on most modern CPUs. It contains zero external dependencies – no PCRE, no libc (where avoidable), no vendored C code.
Startup time
Cold start is under 1 millisecond. There is no runtime to initialize, no JIT to warm up, no garbage collector to configure. The first event is processed within microseconds of launch.
Memory
Event storage
| Structure | Memory | Notes |
|---|---|---|
| Event struct | 256 bytes | Fixed size, cache-line friendly |
| EventRing (default) | 16 MB | 65,536 events x 256 bytes |
| ArenaPool | 64 MB max | 16 arenas x 4 MB, generation-tagged |
The EventRing uses power-of-2 capacity for bitwise modulo indexing (index & (capacity - 1) instead of index % capacity). This eliminates a division instruction in the per-event hot path.
Statistical engine
| Structure | Memory |
|---|---|
| CountMinSketch | < 64 KiB |
| HyperLogLog | 16 KiB (exactly 16,384 registers) |
| TDigest | ~4 KiB (256 centroids) |
| EWMA (x2) | 96 bytes |
| StreamingStats (x2) | 64 bytes |
| TimeWindow | < 32 KiB |
| Total | < 1 MiB |
Pattern grouping
| Structure | Memory | Notes |
|---|---|---|
| GroupTable | scales with unique templates | typically 1-5 MiB |
| DrainTree | 4,096 cluster slots | fixed allocation |
Anomaly detection
| Structure | Memory |
|---|---|
| RateDetector | ~200 bytes |
| CusumDetector | ~200 bytes |
| SignalAggregator (128 slots) | ~32 KiB |
Trace reconstruction
| Structure | Memory |
|---|---|
| TraceStore active (256 traces) | ~512 KiB |
| TraceStore finalized (512 traces) | ~1 MiB |
Correlation
| Structure | Memory |
|---|---|
| TemporalProximity (256 signals) | ~64 KiB |
Allocation strategy
tailx uses three allocation strategies:
-
Arena allocation for event data (messages, fields, strings). Generation-tagged arenas allow bulk free on window expiry. Zero per-event free calls.
-
General-purpose allocation for long-lived singletons (EventRing, DrainTree, GroupTable, TraceStore). Allocated once at startup, freed at shutdown.
-
Stack allocation for small fixed-size buffers (< 4 KiB). No heap involvement.
After initialization, the per-event hot path performs zero heap allocations. All event data is copied into the current arena, which is a bump allocator (pointer increment only).
Per-operation targets
| Operation | Target | Achieved |
|---|---|---|
| Event struct size | 256 bytes | 256 bytes |
| EventRing push+get | 1M events correct | Tested |
| Drain template extraction | 0.5 us/line | On target |
| Filter evaluation (3 predicates) | 100 ns/event | On target |
| Group classify (hash lookup) | O(1) | O(1) |
| Anomaly detector tick | 10 ms/tick | < 1 ms |
| Correlation engine tick | 10 ms/tick | < 1 ms |
What makes it fast
-
No GC, no runtime: Zig compiles to native code with no runtime overhead. No stop-the-world pauses.
-
Arena allocation: event data is bump-allocated (pointer increment). No per-event malloc/free.
-
Power-of-2 ring buffer: bitwise AND instead of modulo division for index wrapping.
-
Fixed-size Event struct: 256 bytes, fits in 4 cache lines. No pointer chasing for common fields.
-
Boyer-Moore-Horspool: substring search for
--grepuses a bad-character table for O(n/m) average-case matching. -
FNV-1a template hash: fast, well-distributed hash for template fingerprinting.
-
Inline everything: hot path functions are small enough for the compiler to inline. No virtual dispatch.
-
No external dependencies: the entire binary is self-contained Zig code. No FFI overhead, no dynamic linking.