Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

tailx

The live system cognition engine.

tailx reimagines tail from “show me lines” to “what’s happening, what matters, and why?”

47,000 log lines → 92 groups → 38 templates → 2 root causes → 1 diagnosis
In 3.1 seconds. Zero config.

What it does

You point tailx at log files or pipe data in. Without any configuration, it:

  1. Auto-detects the log format — JSON, logfmt, syslog, or unstructured text
  2. Parses every line — extracts severity, service, trace ID, structured fields
  3. Fingerprints messages using the Drain algorithm — collapses thousands of repetitive lines into structural templates
  4. Groups events by template — ranked by severity × frequency × trend
  5. Detects anomalies — EWMA rate baselines, CUSUM change-point detection, 3σ threshold
  6. Correlates signals — temporal proximity analysis linking related anomalies
  7. Outputs the result — colorized terminal display or structured JSON for AI agents

The proof

We pointed tailx at a production web stack’s logs — 47,000 lines across four services. Without any configuration, rules, or prior knowledge of the system, it identified that a database connection pool exhaustion was the root cause of 71% of all error volume, cascading through the API gateway → payment service → notification service.

Without tailx: manually reading logs, mentally correlating timestamps, recognizing patterns by eye. A 30-minute task for an experienced SRE.

With tailx: one command.

tailx --json -s -n app.log | tail -1

The numbers

MetricValue
Binary size (stripped)144 KB
Throughput69,000 events/sec
Memory (statistical engine)< 1 MiB
Startup time< 1ms
External dependencies0
Lines of Zig8,347
Tests219
Config files required0

Design principles

  • Zero config to start. Point it at a file. It works.
  • Local-first. No cloud. No telemetry. No network calls.
  • Statistical-first. No LLM in the hot path. Math is fast, deterministic, explainable.
  • Zero dependencies. Zig standard library only.
  • 144 KB binary. Fits in L2 cache. Starts in microseconds.

Installation

Requirements

  • Zig 0.14.0 (no other dependencies)
  • Any POSIX system (Linux, macOS)
  • No libc required. No runtime. No garbage collector.

Build from source

git clone https://github.com/your-org/tailx.git
cd tailx
zig build -Doptimize=ReleaseSafe

The binary lands in zig-out/bin/tailx. Copy it wherever you like:

cp zig-out/bin/tailx ~/.local/bin/

Build variants

ModeCommandBinary sizeNotes
Debugzig build~3 MBSafety checks, slow
ReleaseSafezig build -Doptimize=ReleaseSafe3.1 MBSafety checks, fast
ReleaseSmallzig build -Doptimize=ReleaseSmall144 KBStripped, production
ReleaseFastzig build -Doptimize=ReleaseFast~2.8 MBMax speed, no safety

For production use, ReleaseSafe is recommended. For resource-constrained environments (containers, embedded), ReleaseSmall produces a 144 KB binary that fits in L2 cache.

Run tests

zig build test

This runs all 219 tests across every module: core types, parsers, statistical structures, anomaly detectors, correlation engine, filters, and renderers. All tests pass in under 2 seconds.

Verify installation

tailx --version
# tailx v1.0

tailx --help
# Shows usage, modes, filters, options

No dependencies

tailx uses the Zig standard library exclusively. There are zero external dependencies – no PCRE, no libc (where avoidable), no vendored C code. The entire binary is self-contained.

Quick Start

Basic usage

Tail a file with automatic pattern grouping:

tailx app.log

This follows the file (like tail -f), auto-detects the log format, parses every line, groups events by structural template, and prints a ranked pattern summary when done or periodically during follow mode.

Pipe from stdin

cat app.log | tailx

Any command that produces log lines works:

journalctl -u myservice | tailx
docker logs myapp | tailx
kubectl logs pod/api-7f8b9 | tailx

Multiple files with globs

tailx /var/log/*.log

tailx expands globs, opens all matching files, and merges events across sources. When multiple files are open, each event line is prefixed with the source file path.

Read a full file (no follow)

tailx --from-start --no-follow file.log
# Short form:
tailx -s -n file.log

Reads the entire file from the beginning, processes every line through the full pipeline, prints the events and pattern summary, then exits.

Filter by severity

dmesg | tailx --severity warn
# Short form:
dmesg | tailx -l warn

Only displays events at warn level or above (warn, error, fatal). Events below the threshold are still processed internally – they feed the pattern groups and anomaly detectors. Filtering is display-only.

What the output looks like

In default pattern mode, tailx prints events line-by-line as they arrive, then a pattern summary:

INF [nginx] GET /api/health 200 0.003s
INF [nginx] GET /api/users 200 0.045s
WRN [payments] Connection pool exhausted, waiting
ERR [payments] Connection refused to db-primary:5432
ERR [payments] Transaction failed: connection timeout
INF [nginx] GET /api/health 200 0.002s

──────────────────────────────────────────────────────────────
 Pattern Summary  847 events  12 groups  8 templates  4231 ev/s  0.2s
──────────────────────────────────────────────────────────────
  ✗ [payments] Connection refused to <*>  (x34) ↑ rising
  ⚠ [payments] Connection pool exhausted, waiting  (x28) ↑ rising
  ● [nginx] GET <*> <*> <*>  (x612) → stable
  ● [auth] Token refreshed for user <*>  (x89) → stable
  ● [nginx] GET /api/health <*> <*>  (x84) ↓ falling
──────────────────────────────────────────────────────────────

tailx: 847 events, 12 groups, 8 templates, 0 drops

Each group line shows:

  • Severity icon: info, warn, error, a fire icon for fatal
  • Service name in brackets (if detected)
  • Template with <*> wildcards replacing variable parts
  • Count in parentheses
  • Trend: ↑ rising, → stable, ↓ falling, or ✨ new

Your First Triage

This walkthrough demonstrates tailx against a typical production web stack — mixed JSON and syslog logs from an API gateway, payment service, database, and background worker.

The command

tailx -s -n app.log api.log db.log worker.log
  • -s (--from-start): read from the beginning of each file
  • -n (--no-follow): read to EOF and stop (don’t tail)

What happened

In 3.1 seconds, tailx processed 47,000 events across four files:

tailx: 47283 events, 92 groups, 38 templates, 0 drops

That is over 15,000 events per second on a single core, with full parsing, template extraction, grouping, anomaly detection, and correlation.

The pattern summary

The pattern summary ranked 92 groups by severity, frequency, and trend. The top groups told the story immediately:

──────────────────────────────────────────────────────────────
 Pattern Summary  47283 events  92 groups  38 templates  15252 ev/s  3.1s
──────────────────────────────────────────────────────────────
  ✗ [db] connection pool exhausted, <*> connections available  (x8241) ↑ rising
  ✗ [payments] connection timeout to <*>  (x6102) ↑ rising
  ⚠ [worker] retry queue depth exceeding threshold  (x2847) ↑ rising
  🔥 [payments] circuit breaker opened for <*>  (x312) ✨ new
  ● [api] GET <*> <*>  (x18420) → stable
  ● [auth] token validated for user <*>  (x9102) → stable
  ...
──────────────────────────────────────────────────────────────

The root cause

Look at the top groups. They form a cascade:

  1. Database pool exhaustion — the database connection pool hit zero available connections. This is the highest-severity rising group: 8,241 events.

  2. Payment service timeouts — with no database connections available, the payment service can’t complete transactions. Downstream calls to Stripe start timing out. 6,102 events.

  3. Worker retry storm — failed payments get queued for retry. The retry queue grows past threshold. 2,847 events.

  4. Circuit breaker trips — after sustained timeouts, the circuit breaker opens, cutting off all payment processing. 312 events — low count but FATAL severity.

Meanwhile, the healthy traffic continues: API requests (18,420) and auth token validations (9,102) are stable. The problem is isolated to the database → payment → worker path.

One connection pool exhaustion caused 71% of all error volume, cascading through three services.

The “aha” moment

Without tailx, you would read 47,000 lines across four files. Manually. You would notice the timeout messages are frequent. You might eventually connect them to the database errors. After 30 minutes, you might piece together the cascade.

With tailx: one command, 3 seconds, and the ranked pattern summary shows you the cascade directly. The highest-count error groups are all related. The database pool is the root cause. The fix is either increasing pool size, fixing the connection leak, or adding connection timeout limits.

Getting the JSON triage

For programmatic access to the same analysis:

tailx --json -s -n app.log db.log | tail -1

The last line of JSON output is always the triage_summary object — the full structured analysis including stats, top groups, anomalies, hypotheses, and traces. See JSON Output for the full schema.

Modes

tailx has five display modes. The default is pattern mode.

Pattern mode (default)

tailx app.log

Events are printed line-by-line as they arrive. At the end (or periodically every 500 events in follow mode), a ranked pattern summary is displayed showing the top groups by severity, frequency, and trend.

This is the mode you want for most triage work. It answers: “what patterns exist in these logs and which ones matter?”

ERR [payments] Connection refused to db-primary:5432
ERR [payments] Connection refused to db-primary:5432
INF [nginx] GET /api/health 200 0.002s

──────────────────────────────────────────────────────────────
 Pattern Summary  847 events  12 groups  8 templates  4231 ev/s  0.2s
──────────────────────────────────────────────────────────────
  ✗ [payments] Connection refused to <*>  (x34) ↑ rising
  ● [nginx] GET <*> <*> <*>  (x612) → stable
──────────────────────────────────────────────────────────────

Raw mode

tailx --raw app.log

Classic tail behavior. Events are printed line-by-line with severity badges and service names, but no pattern summary, no anomaly alerts, no group rankings. The full pipeline still runs internally (parsing, grouping, anomaly detection), but nothing beyond the event lines is displayed.

Use this when you just want to watch logs scroll by with basic formatting.

Trace mode

tailx --trace app.log

Groups events by trace_id and displays them as request flow trees. Each trace shows its events connected with tree connectors, the total duration, and the outcome (success, failure, timeout, or unknown).

TRACE req-abc-123  245ms  FAILURE
 ├─ INF [gateway] Received POST /api/checkout
 ├─ INF [auth] Token validated for user-42
 ├─ INF [payments] Processing payment $49.99
 ├─ ERR [payments] Connection refused to db-primary:5432
 └─ ERR [gateway] 500 Internal Server Error

TRACE req-def-456  12ms  success
 ├─ INF [gateway] Received GET /api/health
 └─ INF [gateway] 200 OK
(2 traces)

Events without a trace_id are not shown in trace mode. The pattern summary is still displayed at the end.

Incident mode

tailx --incident app.log

Suppresses all normal event output. Only displays:

  • Active anomaly alerts (rate spikes, rate drops, change points)
  • The pattern summary with top groups

This is the “pager duty” mode. No noise, just the signals that something changed.

 !! ANOMALY: rate spike — observed 450.0 vs expected 120.3 (deviation: 4.2)

──────────────────────────────────────────────────────────────
 Pattern Summary  47283 events  92 groups  38 templates  15252 ev/s  3.1s
──────────────────────────────────────────────────────────────
  ✗ [payments] Connection refused to <*>  (x1204) ↑ rising
  ⚠ [payments] Connection pool exhausted  (x891) ↑ rising
──────────────────────────────────────────────────────────────

JSON mode

tailx --json app.log

Outputs JSONL (one JSON object per line). Two types of objects:

  1. Event objects – one per processed event
  2. Triage summary – always the last line, contains the full analysis
{"type":"event","severity":"ERROR","message":"Connection refused","service":"payments","template_hash":8234567891234}
{"type":"event","severity":"INFO","message":"GET /api/health 200","service":"nginx","template_hash":1234567890123}
{"type":"triage_summary","stats":{...},"top_groups":[...],"anomalies":[...],"hypotheses":[...],"traces":[...]}

JSON mode is designed for machine consumption – pipe it to jq, feed it to an AI agent, or integrate it as an MCP tool. See JSON Output for the full schema.

Filters & Queries

All filters are display-only. Filtered events still feed the pattern groups, anomaly detectors, and correlation engine. This is a deliberate design decision: you always get the full statistical picture, even when displaying a subset.

Filters combine with AND by default. Every clause must match for an event to be displayed.

Severity filter

tailx --severity warn app.log
tailx -l error app.log

Sets a minimum severity threshold. Only events at or above the given level are displayed. Severity levels in order:

LevelNumericTypical meaning
trace0Detailed debug tracing
debug1Debug information
info2Normal operations
warn3Potential issues
error4Failures
fatal5Unrecoverable errors

Example: --severity warn shows warn, error, and fatal events. Debug and info events are hidden but still processed.

Message substring filter

tailx --grep timeout app.log
tailx -g "connection refused" app.log

Filters events whose message contains the given substring. Uses Boyer-Moore-Horspool for fast matching. Case-sensitive.

# Only events mentioning "OOM"
tailx -g OOM app.log

# Combine with severity
tailx -l error -g timeout app.log

Service filter

tailx --service payments app.log

Exact match on the service name. The service name is extracted automatically by the parser:

  • JSON: from service, service_name, app, application, or component fields
  • Syslog: from the app name before the PID (nginx[1234] -> nginx)
  • Unstructured: from bracketed text ([PaymentService] -> PaymentService)

Trace ID filter

tailx --trace-id req-abc-123 app.log

Exact match on the trace ID field. Combined with --trace mode, this lets you inspect a single request flow:

tailx --trace --trace-id req-abc-123 app.log

Field equality filter

tailx --field status=500 app.log
tailx --field user_id=42 app.log

Matches events with a specific field value. Supports both string and integer comparison – if the field contains an integer and the filter value parses as an integer, numeric comparison is used.

# Filter by HTTP status code
tailx --field status=500 access.log

# Filter by host
tailx --field hostname=web01 app.log

Time window filter

tailx --last 5m app.log
tailx --last 1h app.log
tailx --last 30s app.log
tailx --last 2d app.log

Only displays events from within the given time window relative to now. Supported units:

SuffixUnit
sseconds
mminutes
hhours
ddays

Combining filters

All filters are ANDed together. An event must pass every filter to be displayed:

# Errors from payments service in the last hour
tailx -l error --service payments --last 1h app.log

# Timeout errors from any service
tailx -l error -g timeout app.log

# Specific field value with severity threshold
tailx -l warn --field region=us-east-1 app.log

Important: filtering does not affect counting

This bears repeating: filtered events are still fully processed. They feed template extraction, pattern grouping, anomaly detection, and correlation. The pattern summary reflects all events, not just displayed ones.

This means you can filter the display to errors while still getting accurate group counts and anomaly detection based on the full event stream.

Intent Queries

Intent queries let you describe what you are looking for in natural language, as a positional argument. If the argument is not an existing file path, tailx treats it as an intent query and translates it into filter predicates.

How it works

tailx "errors related to payments" app.log

tailx tokenizes the query, strips filler words, maps keywords to filters, and applies basic stemming.

The above becomes: severity >= error AND message contains “payment” (stemmed from “payments”).

Examples

Severity keywords

tailx "errors related to payments" app.log
# → severity >= error, message contains "payment"

tailx "warnings from nginx" app.log
# → severity >= warn, service = "nginx"

tailx "5xx from nginx" app.log
# → severity >= error, service = "nginx"

tailx "4xx errors" app.log
# → severity >= warn (4xx maps to warn)

The following words are recognized as severity keywords: error/errors (maps to error), warning/warnings (maps to warn), fatal/critical (maps to fatal), 5xx (maps to error), 4xx (maps to warn).

Service targeting with “from”

tailx "5xx from nginx" app.log
# → severity >= error, service = "nginx"

tailx "errors from payments" app.log
# → severity >= error, service = "payments"

The word from followed by a non-filler word creates a service filter.

You can also use the service: prefix:

tailx "timeouts service:payments" app.log
# → message contains "timeout", service = "payments"

Implicit error detection

tailx "why are payments failing" app.log

Even without explicit severity keywords, certain words imply errors: fail, crash, down, broken, bug. When detected, tailx automatically adds a severity >= error filter.

The above becomes: severity >= error AND message contains “payment” AND message contains “failing”.

tailx "timeout" app.log
# → message contains "timeout"

tailx "connection refused" app.log
# → message contains "connection" AND message contains "refused"

Any word that is not a filler word, severity keyword, or service pattern becomes a message substring filter.

Filler words

The following words are stripped from queries before processing:

the, a, an, is, are, was, were, in, on, at, to, for, of, with, and, or, but, not, related, about, why, what, how, when, where, show, me, find, get, all, any, some, that, this, those, requests, logs, events, messages

This means "show me all timeout errors" reduces to: severity >= error, message contains “timeout”.

Basic stemming

Trailing s is removed from keywords longer than 3 characters. This handles simple plurals:

  • payments -> payment
  • errors -> recognized as severity keyword (not stemmed as a message filter)
  • timeouts -> timeout

File vs. query detection

tailx checks whether a positional argument is an existing file path. If the file exists, it is opened as a log source. If the file does not exist, it is treated as an intent query.

tailx app.log                    # file exists → open as source
tailx "timeout errors" app.log   # "timeout errors" doesn't exist → intent query
tailx timeout app.log            # "timeout" doesn't exist → intent query

Trace Reconstruction

tailx reconstructs request flows by grouping events that share a trace_id. In --trace mode, these are displayed as tree views showing the full lifecycle of each request.

How traces work

When an event has a trace_id field (extracted from JSON, logfmt, or any supported format), tailx assigns it to a trace in the TraceStore. All events with the same trace_id are grouped into a single Trace object.

Trace IDs are detected from these known field names:

  • trace_id
  • traceId
  • trace
  • x-trace-id
  • request_id

Viewing traces

tailx --trace app.log

Each trace is displayed as a tree with connectors showing the event sequence:

TRACE req-abc-123  245ms  FAILURE
 ├─ INF [gateway] Received POST /api/checkout
 ├─ INF [auth] Token validated for user-42
 ├─ INF [inventory] Reserved 3 items
 ├─ INF [payments] Processing payment $49.99
 ├─ ERR [payments] Connection refused to db-primary:5432
 └─ ERR [gateway] 500 Internal Server Error

TRACE req-def-456  12ms  success
 ├─ INF [gateway] Received GET /api/health
 └─ INF [gateway] 200 OK

TRACE req-ghi-789  31002ms  TIMEOUT
 ├─ INF [gateway] Received POST /api/export
 ├─ INF [export] Starting bulk export job
 └─ WRN [export] Job still running after 30s
(3 traces)

Trace properties

Each trace tracks:

  • trace_id: the explicit ID from the log events
  • event_count: number of events in the trace (up to 64 per trace)
  • duration: time from the first event to the last event (in milliseconds)
  • outcome: determined automatically from the events

Outcome detection

Trace outcomes are determined by the severity of events within the trace:

OutcomeConditionDisplay
successNo error or fatal events, trace finalizedsuccess (green)
failureAny event with severity >= errorFAILURE (red, bold)
timeoutTrace expired without completingTIMEOUT (yellow, bold)
unknownTrace still active, no errors yetunknown (dim)

Outcome escalation is one-way: once a trace sees an error/fatal event, its outcome is permanently set to failure.

Trace lifecycle

  1. Created when the first event with a given trace_id is processed
  2. Active while events continue arriving for that trace_id
  3. Finalized after 30 seconds of inactivity (no new events with that trace_id)

Finalized traces are moved from the active store (256 slots) to a finalized ring buffer (512 slots). Both active and finalized traces are displayed in --trace mode.

Filtering traces

View a single trace by ID:

tailx --trace --trace-id req-abc-123 app.log

Combine with other filters:

# Only failed traces from payments service
tailx --trace --service payments -l error app.log

Traces in JSON mode

In --json mode, traces appear in the triage_summary object’s traces array. Each trace includes its ID, event count, duration, outcome, and the full list of events with their severity, message, and service. See Triage Summary Schema for details.

JSON Output

The --json flag switches tailx to JSONL output mode. Every line is a valid JSON object. This is the primary integration point for AI agents, scripts, and tooling.

Two object types

1. Event objects

One per processed event, emitted as events arrive:

{
  "type": "event",
  "severity": "ERROR",
  "message": "Connection refused to db-primary:5432",
  "service": "payments",
  "trace_id": "req-abc-123",
  "template_hash": 8234567891234,
  "fields": {
    "latency_ms": 240,
    "hostname": "web01",
    "pid": 1234
  }
}

Fields present in an event object:

FieldTypeAlways presentDescription
typestringyesAlways "event"
severitystringyesTRACE, DEBUG, INFO, WARN, ERROR, FATAL, or UNKNOWN
messagestringyesThe log message (parsed or raw)
servicestringnoService name, if detected
trace_idstringnoTrace ID, if detected
template_hashintegernoDrain template hash (0 is omitted)
fieldsobjectnoExtracted structured fields (omitted if empty)

Field values in the fields object can be strings, integers, floats, booleans, or null.

2. Triage summary

Always the last line of output. Contains the full analysis:

{
  "type": "triage_summary",
  "stats": {
    "events": 47283,
    "groups": 92,
    "templates": 38,
    "drops": 0,
    "events_per_sec": 15252.0,
    "elapsed_ms": 3100
  },
  "top_groups": [...],
  "anomalies": [...],
  "hypotheses": [...],
  "traces": [...]
}

The triage summary is the “money shot” for AI integration. It contains everything the engine computed, structured for machine reasoning. See Triage Summary Schema for the full schema.

Usage patterns

Read full file to JSON

tailx --json -s -n app.log
  • --json: JSONL output
  • -s (--from-start): start at beginning of file
  • -n (--no-follow): read to EOF and stop

Get just the triage summary

tailx --json -s -n app.log | tail -1

The last line is always the triage_summary. Use tail -1 to extract it.

Filter events in JSON mode

tailx --json -l error --service payments -s -n app.log

Filters work the same in JSON mode. Only matching events are emitted as event objects, but the triage summary still reflects the full pipeline (all events, not just filtered ones).

Stream processing with jq

# Extract all error messages
tailx --json -s -n app.log | jq -r 'select(.type=="event" and .severity=="ERROR") | .message'

# Get top group exemplars from the triage summary
tailx --json -s -n app.log | tail -1 | jq '.top_groups[].exemplar'

# Count events per service
tailx --json -s -n app.log | jq -r 'select(.type=="event") | .service // "unknown"' | sort | uniq -c | sort -rn

Real triage summary example

From the production log test (47,283 events):

{
  "type": "triage_summary",
  "stats": {
    "events": 47283,
    "groups": 92,
    "templates": 38,
    "drops": 0,
    "events_per_sec": 15252.0,
    "elapsed_ms": 3100
  },
  "top_groups": [
    {
      "exemplar": "Connection pool exhausted, waiting for available connection",
      "count": 5765,
      "severity": "WARN",
      "trend": "rising",
      "service": "db"
    },
    {
      "exemplar": "<*> carrier <*> ...",
      "count": 4812,
      "severity": "WARN",
      "trend": "rising",
      "service": "NetworkManager"
    }
  ],
  "anomalies": [],
  "hypotheses": [],
  "traces": []
}

Every event goes through the full pipeline

Whether you filter by severity, service, or grep – every event is always:

  1. Parsed (format detection, field extraction)
  2. Template-fingerprinted (Drain algorithm)
  3. Grouped (pattern table)
  4. Assigned to traces (if trace_id present)
  5. Fed to anomaly detectors
  6. Fed to the correlation engine

Filters only control what gets emitted as event objects. The triage summary always reflects the complete picture.

Triage Summary Schema

The triage_summary is always the last line of --json output. It contains everything tailx computed about the log stream, structured for machine consumption.

Top-level structure

{
  "type": "triage_summary",
  "stats": { ... },
  "top_groups": [ ... ],
  "anomalies": [ ... ],
  "hypotheses": [ ... ],
  "traces": [ ... ]
}

stats object

Processing statistics for the entire run.

{
  "events": 47283,
  "groups": 92,
  "templates": 38,
  "drops": 0,
  "events_per_sec": 15252.0,
  "elapsed_ms": 3100
}
FieldTypeDescription
eventsintegerTotal events processed
groupsintegerActive pattern groups
templatesintegerDrain template clusters
dropsintegerEvents dropped (arena OOM)
events_per_secfloatProcessing throughput
elapsed_msintegerWall-clock processing time

top_groups[] array

Up to 20 pattern groups, ranked by score (severity x frequency x trend). Each group represents a cluster of structurally similar log messages.

{
  "exemplar": "Connection refused to <*>",
  "count": 34,
  "severity": "ERROR",
  "trend": "rising",
  "service": "payments",
  "source_count": 3
}
FieldTypeAlways presentDescription
exemplarstringyesRepresentative message for this group
countintegeryesTotal event count in this group
severitystringyesHighest severity seen in the group
trendstringyesrising, stable, falling, new, or gone
servicestringnoService name, if all events share one
source_countintegernoNumber of distinct sources (omitted if 1)

Trend values

TrendMeaning
risingRate is increasing compared to previous window
stableRate is approximately constant
fallingRate is decreasing
newGroup appeared in the current window
goneNo events in the current window (previously active)

anomalies[] array

Active anomaly alerts from the rate detector and CUSUM detector.

{
  "kind": "rate_spike",
  "score": 0.823,
  "observed": 450.0,
  "expected": 120.3,
  "deviation": 4.2,
  "fire_count": 3
}
FieldTypeDescription
kindstringAnomaly type (see table below)
scorefloatSeverity score, 0.0 to 1.0
observedfloatThe actual measured value
expectedfloatThe baseline expected value
deviationfloatZ-score or normalized deviation
fire_countintegerNumber of times this alert has fired

Anomaly kinds

KindSourceDescription
rate_spikeRateDetectorEvent rate significantly above baseline
rate_dropRateDetectorEvent rate significantly below baseline
change_point_upCusumDetectorSustained upward shift in event rate
change_point_downCusumDetectorSustained downward shift in event rate
latency_spike(reserved)Latency above baseline
distribution_shift(reserved)Statistical distribution change
cardinality_spike(reserved)Sudden increase in unique values
new_pattern_burst(reserved)Burst of previously unseen templates

hypotheses[] array

Causal hypotheses from the correlation engine. Each hypothesis explains an anomaly by linking it to temporally proximate signals.

{
  "causes": [
    {
      "label": "DB latency spike",
      "strength": 0.742,
      "lag_ms": 5000
    },
    {
      "label": "deploy detected",
      "strength": 0.381,
      "lag_ms": 15000
    }
  ],
  "confidence": 0.742
}
FieldTypeDescription
causes[]arrayCandidate causes, ordered by strength
causes[].labelstringDescription of the candidate cause
causes[].strengthfloatCause strength, 0.0 to 1.0 (closer in time + higher magnitude = stronger)
causes[].lag_msintegerTime between cause and effect in milliseconds
confidencefloatOverall hypothesis confidence (max cause strength)

traces[] array

Reconstructed request flows from explicit trace_id matching.

{
  "trace_id": "req-abc-123",
  "event_count": 5,
  "duration_ms": 245,
  "outcome": "failure",
  "events": [
    {
      "severity": "INFO",
      "message": "Received POST /api/checkout",
      "service": "gateway"
    },
    {
      "severity": "ERROR",
      "message": "Connection refused to db-primary:5432",
      "service": "payments"
    }
  ]
}
FieldTypeDescription
trace_idstringThe trace identifier
event_countintegerNumber of events in this trace
duration_msintegerTime from first to last event
outcomestringsuccess, failure, timeout, or unknown
events[]arrayEvents in the trace, in order
events[].severitystringEvent severity level
events[].messagestringEvent message
events[].servicestringService name (if present)

MCP & Agent Integration

tailx is designed to be a tool for AI agents. The --json output provides structured triage data that an LLM can reason over directly, without parsing raw log text.

The key insight: the AI does not parse logs. tailx parses logs. The AI reasons over structured triage output.

Subprocess integration

The simplest integration is calling tailx as a subprocess and reading the last line of output.

Python example

import subprocess
import json

result = subprocess.run(
    ["tailx", "--json", "-s", "-n", "--last", "5m", "app.log"],
    capture_output=True,
    text=True
)

# The last line is always the triage_summary
lines = result.stdout.strip().split("\n")
triage = json.loads(lines[-1])

print(f"Events: {triage['stats']['events']}")
print(f"Groups: {triage['stats']['groups']}")
print(f"Top issue: {triage['top_groups'][0]['exemplar']}")

Shell example

# Get triage summary as JSON
TRIAGE=$(tailx --json -s -n --last 5m app.log | tail -1)

# Extract top group with jq
echo "$TRIAGE" | jq -r '.top_groups[0].exemplar'

MCP tool definition

tailx can be exposed as an MCP (Model Context Protocol) tool. Here is a tool definition:

{
  "name": "tailx_triage",
  "description": "Analyze log files for patterns, anomalies, and root causes. Returns structured triage with event groups ranked by severity/frequency, anomaly alerts, causal hypotheses, and request traces. Use this when investigating system issues, outages, or performance problems.",
  "input_schema": {
    "type": "object",
    "properties": {
      "files": {
        "type": "array",
        "items": { "type": "string" },
        "description": "Log file paths to analyze (e.g., [\"app.log\", \"db.log\"])"
      },
      "time_window": {
        "type": "string",
        "description": "How far back to look (e.g., \"5m\", \"1h\", \"30s\")"
      },
      "severity": {
        "type": "string",
        "enum": ["trace", "debug", "info", "warn", "error", "fatal"],
        "description": "Minimum severity to include in event output"
      },
      "grep": {
        "type": "string",
        "description": "Filter events by message substring"
      },
      "service": {
        "type": "string",
        "description": "Filter events by service name"
      }
    },
    "required": ["files"]
  }
}

MCP tool implementation

def tailx_triage(files, time_window=None, severity=None, grep=None, service=None):
    cmd = ["tailx", "--json", "-s", "-n"]

    if time_window:
        cmd.extend(["--last", time_window])
    if severity:
        cmd.extend(["--severity", severity])
    if grep:
        cmd.extend(["--grep", grep])
    if service:
        cmd.extend(["--service", service])

    cmd.extend(files)

    result = subprocess.run(cmd, capture_output=True, text=True, timeout=30)
    lines = result.stdout.strip().split("\n")

    # Return just the triage summary for the AI to reason over
    return json.loads(lines[-1])

What the AI receives

When an agent calls tailx_triage(files=["app.log"], time_window="5m"), it receives a structured object like:

{
  "type": "triage_summary",
  "stats": {
    "events": 847,
    "groups": 12,
    "templates": 8,
    "drops": 0,
    "events_per_sec": 4231.0,
    "elapsed_ms": 200
  },
  "top_groups": [
    {
      "exemplar": "Connection refused to <*>",
      "count": 34,
      "severity": "ERROR",
      "trend": "rising",
      "service": "payments"
    }
  ],
  "anomalies": [
    {
      "kind": "rate_spike",
      "score": 0.823,
      "observed": 450.0,
      "expected": 120.3,
      "deviation": 4.2,
      "fire_count": 3
    }
  ],
  "hypotheses": [
    {
      "causes": [
        {"label": "DB latency spike", "strength": 0.742, "lag_ms": 5000}
      ],
      "confidence": 0.742
    }
  ],
  "traces": []
}

The AI can now reason: “The top pattern group is rising connection refused errors from the payments service (34 occurrences). There’s a rate spike anomaly. The correlation engine suggests a DB latency spike 5 seconds earlier as a likely cause.”

Design rationale

Why not have the AI read raw logs?

  1. Volume: 47,000 lines of logs would consume an entire context window. The triage summary is a few hundred tokens.
  2. Signal-to-noise: most production logs are repetitive noise. The AI would waste tokens on irrelevant repetition. tailx collapses 47,000 lines into 38 templates.
  3. Speed: tailx processes 69,000 events/sec. The pipeline runs in seconds, not minutes.
  4. Determinism: statistical analysis (z-scores, CUSUM, EWMA) is reproducible. LLM pattern matching is not.
  5. Cost: one subprocess call is effectively free. Feeding 47,000 lines to an LLM costs tokens and time.

The AI’s job is to interpret the structured triage, suggest fixes, and communicate findings to humans – not to count log lines.

Processing Pipeline

Every log line that enters tailx passes through a 12-stage pipeline. The pipeline is synchronous and single-threaded – no locks, no channels, no thread pools.

Pipeline stages

raw bytes
  │
  ├─ 1. ReadBuffer        64 KiB per-source, in-place line splitting
  ├─ 2. QuickTimestamp     Fast timestamp extraction
  ├─ 3. MultiLineDetector  Continuation line detection
  ├─ 4. Merger             Arena-dupe + push to EventRing
  ├─ 5. FormatDetector     Vote on format, lock after 8 samples
  ├─ 6. Parser dispatch    JSON / KV / Syslog / Fallback
  ├─ 7. SchemaInferer      Track field types/frequencies (first 64 events)
  ├─ 8. DrainTree          Template fingerprinting → template_hash
  ├─ 9. GroupTable          Classify into groups, update counts/trend
  ├─ 10. TraceStore        Assign to trace via trace_id
  ├─ 11. Anomaly tick      RateDetector + CusumDetector (every 1s)
  ├─ 12. Correlation       Feed signals, build hypotheses
  │
  └─ Event (in ring buffer, ready for rendering)

Stage details

1. ReadBuffer

Each file source gets a 64 KiB ReadBuffer. Raw bytes from read() are appended to the buffer. The buffer yields complete lines (terminated by \n), handling \r\n line endings and partial lines across reads. If the buffer fills without a newline, the entire buffer is yielded as a single long line.

2. QuickTimestamp

Before any parsing, QuickTimestamp.extract() does a fast scan for timestamps at the beginning of the line. Supports:

  • ISO 8601: 2024-03-15T14:23:01.123Z
  • Epoch milliseconds: 1710510181123
  • Epoch seconds: 1710510181

If no timestamp is found, the current wall clock time is used.

3. MultiLineDetector

Checks if a line is a continuation of the previous message (stack traces, indented text). Continuation lines are skipped – they do not become new events. This prevents stack trace frames from inflating event counts.

4. Merger (Ingest)

The raw line is copied into the current arena (EventArena) and an Event struct is pushed onto the EventRing. The event starts with the raw line as its message, the extracted timestamp, and the source ID.

5. FormatDetector

Per-source format detection. Each source has its own FormatDetector that votes on the format based on simple heuristics. After 8 samples, the format locks and all future lines from that source use the same parser.

Detection rules:

  • JSON: starts with {, ends with }
  • Syslog BSD: starts with <digits>
  • CLF: IP followed by - and [date] and "
  • Logfmt: 3+ key=value pairs AND has level= AND msg=/message=
  • KV pairs: 3+ key=value pairs
  • Unstructured: everything else

On tie, the more structured format wins.

6. Parser dispatch

Based on the detected format, one of four parsers extracts structured fields from the raw line:

  • JsonParser – hand-written JSON scanner with known field mapping
  • KvParser – key=value pair extraction with quoting support
  • SyslogBsdParser – PRI, BSD timestamp, hostname, app[pid], message
  • FallbackParser – timestamp prefix skip, severity extraction, bracketed service

Each parser populates the event’s severity, message, service, trace_id, and fields.

7. SchemaInferer

Per-source schema inference from the first 64 events. Tracks field names, types, and frequencies. This information is available for downstream consumers (e.g., adaptive parsing).

8. DrainTree

The Drain algorithm extracts a structural template from the event’s message. Variable parts (tokens containing digits, quoted strings) become <*> wildcards. The template is hashed with FNV-1a to produce a template_hash. Events with the same template hash are structurally identical despite different parameters.

9. GroupTable

The event is classified into a pattern group based on its template_hash. The group’s count, severity, trend, and score are updated. Groups are ranked by a composite score of severity, frequency, and trend direction.

10. TraceStore

If the event has a trace_id, it is assigned to an active trace in the TraceStore. The trace tracks event references (ring buffer indices), duration, and outcome. Active traces expire after 30 seconds of inactivity and are moved to the finalized store.

11. Anomaly tick (periodic)

Every 1 second (by wall clock), the pipeline ticks the anomaly detectors:

  • RateDetector: feeds the current event rate to a dual EWMA (10s fast, 5min slow) and computes a z-score against historical statistics. Fires if z-score >= 3.0 and absolute delta exceeds threshold.
  • CusumDetector: accumulates normalized deviations. Fires on sustained shifts that z-scores miss. 30-tick cooldown after firing.

Detector results are processed by the SignalAggregator (deduplication, resolution, eviction) and fed to the correlation engine.

12. Correlation

Rising groups and anomaly alerts are recorded as CorrelationSignal objects. The TemporalProximity analyzer finds signals that co-occur within a 5-minute window and ranks them by proximity and magnitude to build Hypothesis objects.

Periodic maintenance

Every 60 seconds, the pipeline runs a window rotation:

  • GroupTable.windowRotate() – updates trend calculations
  • TraceStore.expireSweep() – finalizes inactive traces
  • ArenaPool.maybeRotate() – rotates arena generations for bulk memory freeing

Pipeline state

The Pipeline struct owns all mutable state:

  • EventRing (ring buffer of events)
  • ArenaPool (generation-tagged arena allocators)
  • FormatDetector[64] (one per source)
  • SchemaInferer[64] (one per source)
  • DrainTree (template extraction)
  • GroupTable (pattern grouping)
  • RateDetector + CusumDetector (anomaly detection)
  • SignalAggregator (alert management)
  • TraceStore (trace reconstruction)
  • TemporalProximity (correlation engine)

All state is allocated once at startup. No allocations occur in the per-event hot path after initialization.

Parsing & Format Detection

tailx auto-detects the log format for each source independently and dispatches to the appropriate parser. No configuration required.

Format detection

The FormatDetector examines lines using simple heuristics. Each source gets its own detector. After 8 samples, the format locks – all subsequent lines from that source use the same parser without re-detection.

Detection rules

FormatHeuristic
JSONLine starts with { and ends with } (after trimming whitespace)
Syslog BSDLine starts with < followed by digits and >
Syslog IETFSyslog prefix + version digit after >
CLFIP/hostname, then -, then [, then " within first 80 bytes
Logfmt3+ key=value pairs AND contains level=/lvl= AND msg=/message=
KV pairs3+ key=value pairs (without logfmt-specific keys)
UnstructuredEverything else

On ties (equal vote counts), the more structured format wins. Structuredness ranking: JSON (6) > logfmt (5) > KV (4) > syslog/CLF (3) > unstructured (0).

JSON parser

Hand-written scanner (no std.json dependency). Parses objects one key-value pair at a time, mapping known keys to Event fields and collecting the rest into the FieldMap.

Known field mapping

JSON keyMaps to
timestamp, ts, time, @timestamp, datetime, tevent.timestamp
level, severity, lvl, loglevel, log_levelevent.severity
message, msg, log, text, bodyevent.message
trace_id, traceId, trace, x-trace-id, request_idevent.trace_id
service, service_name, app, application, componentevent.service

All other keys become entries in the event’s FieldMap with their parsed values.

Value types

The JSON parser handles all JSON value types:

  • Strings: extracted with escape sequence handling (\", \\, \n, \r, \t, \uXXXX)
  • Integers: parsed as i64
  • Floats: parsed as f64
  • Booleans: true / false
  • Null: null

Timestamp handling

Timestamp values can be:

  • String: parsed as ISO 8601 (2024-03-15T14:23:01.123Z)
  • Integer > 946684800000: interpreted as epoch milliseconds
  • Integer > 946684800: interpreted as epoch seconds
  • Float: interpreted as epoch seconds with fractional part

Example

Input:

{"level":"error","msg":"Connection refused","service":"payments","latency_ms":240,"trace_id":"req-001"}

Result:

  • event.severity = ERROR
  • event.message = “Connection refused”
  • event.service = “payments”
  • event.trace_id = “req-001”
  • event.fields = {"latency_ms": 240}

KV parser

Parses key=value pairs separated by whitespace. Values can be bare words or double-quoted strings.

Known field mapping

Same known keys as the JSON parser. The KV parser also applies:

  • Numeric inference: bare values that parse as integers become i64, as floats become f64
  • Quote stripping: msg="hello world" extracts hello world

Example

Input:

ts=2024-03-15T14:23:01Z level=error msg="Connection refused" service=payments latency_ms=240

Result:

  • event.timestamp = 2024-03-15T14:23:01Z
  • event.severity = ERROR
  • event.message = “Connection refused”
  • event.service = “payments”
  • event.fields = {"latency_ms": 240}

Syslog BSD parser

Parses RFC 3164 syslog format. Also handles journalctl output (which omits the PRI).

Format

<PRI>Mon DD HH:MM:SS hostname app[pid]: message

PRI to severity mapping

The PRI value encodes facility and severity per RFC 3164. The severity component (PRI mod 8) maps to:

PRI mod 8Syslog severitytailx severity
0Emergencyfatal
1Alertfatal
2Criticalfatal
3Errorerror
4Warningwarn
5Noticeinfo
6Informationalinfo
7Debugdebug

Fields extracted

  • severity: from PRI value, or inferred from message content
  • service: from the app name (e.g., nginx from nginx[1234])
  • hostname: stored as a field
  • pid: stored as a field (integer if parseable)
  • message: everything after app[pid]:

Severity inference

If no PRI is present (e.g., journalctl output), the parser infers severity from message content by looking for keywords like error, warn, info, debug, critical, and fatal – both bare and in brackets (e.g., [ERROR]).

Example

Input:

<134>Mar 15 14:23:01 web01 nginx[1234]: GET /api 200 0.012

Result:

  • event.severity = INFO (PRI 134 mod 8 = 6 = informational)
  • event.service = “nginx”
  • event.message = “GET /api 200 0.012”
  • event.fields = {"hostname": "web01", "pid": 1234}

Fallback parser

Handles unstructured text logs by extracting what it can.

Extraction order

  1. Timestamp prefix: skip ISO 8601 or similar date/time prefix
  2. Severity: look for bare keywords (ERROR, WARN, etc.) or bracketed ([ERROR], [WARN])
  3. Service: extract from brackets ([PaymentService] -> “PaymentService”)
  4. Message: everything remaining after extraction

Example

Input:

2024-03-15 14:23:01 ERROR [PaymentService] Connection refused to db:5432

Result:

  • event.severity = ERROR
  • event.service = “PaymentService”
  • event.message = “Connection refused to db:5432”

Multi-line detection

Before parsing, the MultiLineDetector checks if a line is a continuation of a previous message (e.g., stack trace frames, indented continuation lines). Continuation lines are skipped and do not create new events.

This prevents a 50-line Java stack trace from becoming 50 separate events – only the first line (the exception) becomes an event.

Drain Template Extraction

Drain is the algorithm that collapses thousands of repetitive log lines into a handful of structural templates. It is the foundation of pattern grouping – without it, every unique log message would be its own group.

The problem

These three log lines are structurally identical:

Connection to 10.0.0.1 timed out after 30s
Connection to 10.0.0.2 timed out after 45s
Connection to 10.0.0.3 timed out after 12s

They differ only in the IP address and timeout duration. A human sees “connection timeout” immediately. Drain teaches tailx to see the same thing.

How it works

1. Tokenize

Split the message by whitespace into tokens.

["Connection", "to", "10.0.0.1", "timed", "out", "after", "30s"]

2. Classify tokens

Each token is classified as either a literal or a wildcard (<*>):

  • Contains any digit -> wildcard. This catches IPs, ports, durations, counts, UUIDs, timestamps.
  • Quoted string (starts and ends with ") -> wildcard.
  • Everything else -> literal.
["Connection", "to", "<*>", "timed", "out", "after", "<*>"]

3. Match against existing clusters

Search existing clusters for one with:

  • The same token count
  • Similarity >= 0.5 (the sim_threshold)

Similarity is computed as the fraction of positions where both tokens match (both are wildcards, or both are the same literal):

similarity = matching_positions / total_positions

4. Merge or create

If a match is found: merge the new tokens into the existing cluster. Any position where the existing template has a literal but the new line has a different literal gets generalized to <*>.

If no match is found: create a new cluster with the classified tokens.

5. Hash the template

The final template tokens are hashed with FNV-1a to produce a u64 template_hash. All events that map to the same template get the same hash.

"Connection to <*> timed out after <*>"  →  hash: 0x3a7f...

Example walkthrough

Line 1: Connection to 10.0.0.1 timed out after 30s

Classified: ["Connection", "to", "<*>", "timed", "out", "after", "<*>"]

No existing clusters. Create cluster #0.

Line 2: Connection to 10.0.0.2 timed out after 45s

Classified: ["Connection", "to", "<*>", "timed", "out", "after", "<*>"]

Cluster #0 has 7 tokens, this has 7 tokens. Similarity = 7/7 = 1.0 >= 0.5. Match. All positions agree. Cluster #0 count becomes 2.

Line 3: User logged in from 10.0.0.1 at 14:00

Classified: ["User", "logged", "in", "from", "<*>", "at", "<*>"]

Cluster #0 has 7 tokens, this has 7 tokens. But similarity: position 0 “Connection” vs “User” = mismatch, position 1 “to” vs “logged” = mismatch… similarity < 0.5. No match. Create cluster #1.

Line 4: Error 500 on server web01

Classified: ["Error", "<*>", "on", "server", "<*>"]

Only 5 tokens. Cluster #0 has 7, cluster #1 has 7. Token count mismatch for both. Create cluster #2.

Line 5: Error 404 on server web02

Classified: ["Error", "<*>", "on", "server", "<*>"]

Cluster #2 has 5 tokens, this has 5 tokens. Similarity = 5/5 = 1.0. Match. Cluster #2 count becomes 2.

Configuration

The DrainTree is initialized with:

  • max_depth: 4 (controls the depth of the classification tree – in this implementation, used as a parameter but matching is linear across clusters)
  • sim_threshold: 0.5 (minimum similarity to match an existing cluster)
  • max_clusters: 4096 (hard limit on the number of distinct templates)

When the cluster limit is reached, new messages that don’t match an existing cluster are still hashed (from their classified tokens) but don’t create new clusters.

Why these rules work

The “contains any digit -> wildcard” rule is surprisingly effective because most variable parts in log messages contain digits:

  • IP addresses: 10.0.0.1
  • Ports: 5432
  • Durations: 30s, 250ms
  • Counts: 42 items
  • HTTP status codes: 200, 500
  • UUIDs: 550e8400-e29b-41d4-a716-446655440000
  • Timestamps: 14:23:01
  • PIDs: [1234]

The few variable tokens without digits (usernames, hostnames) may not get wildcarded, but they will either match literally (same user) or cause a new cluster (different user). Over time, if both forms appear, the merge step generalizes the position to <*>.

Template hash

The hash function is FNV-1a over the concatenated template tokens (with space separators). This is a fast, well-distributed hash that produces a u64 – the template_hash stored on every event.

Events with the same template_hash are grouped together in the GroupTable. The hash is the primary grouping key for all downstream analysis.

Anomaly Detection

tailx uses two complementary anomaly detectors that tick every second. Together they catch both sudden spikes and sustained shifts in event rate.

RateDetector

Dual EWMA (Exponentially Weighted Moving Average) with z-score thresholding.

Architecture

event rate (events/sec)
  │
  ├─ EWMA fast  (10s halflife)  → "current" rate
  ├─ EWMA slow  (5min halflife) → "baseline" rate
  └─ StreamingStats (Welford)   → historical mean/variance → z-score

How it works

  1. Each tick (1 second), the current event count is fed as a sample.
  2. The sample’s z-score is computed against the running historical statistics (before updating them).
  3. Both EWMAs are updated with the sample.
  4. After the warmup period (30 samples), if the z-score >= 3.0 AND the absolute delta between fast and slow EWMA exceeds the minimum threshold (1.0), an anomaly fires.

Spike vs. drop

  • z-score >= 3.0: rate_spike – the event rate is significantly above the historical norm.
  • z-score <= -3.0 (and baseline > minimum threshold): rate_drop – the event rate has significantly dropped. Only fires when the baseline is meaningful (above minimum absolute delta).

Warmup

The first 30 samples are used to build the baseline. No anomalies fire during warmup, preventing false positives from cold start.

Score normalization

The raw z-score is mapped to a 0.0 - 1.0 severity score using a logistic-like function:

score = 1.0 - 1.0 / (1.0 + 0.1 * z^2)

This gives:

  • z = 3.0 -> score ~0.47
  • z = 5.0 -> score ~0.71
  • z = 10.0 -> score ~0.91

CusumDetector

Cumulative Sum (CUSUM) change-point detector. Catches sustained shifts that individual z-scores miss.

The problem CUSUM solves

Imagine the event rate gradually climbs from 100/s to 200/s over 30 seconds. No single tick has a z-score >= 3.0 because each increase is small. But the cumulative shift is significant. CUSUM catches this.

How it works

  1. Each tick, the sample is normalized: (sample - mean) / stddev.
  2. Two cumulative sums are maintained:
    • s_high: accumulates upward deviations minus an allowance (0.5)
    • s_low: accumulates downward deviations minus the same allowance
  3. Both sums are clamped to >= 0 (they cannot go negative).
  4. If s_high exceeds the threshold (5.0 standard deviations), fire change_point_up and reset s_high to 0.
  5. If s_low exceeds the threshold, fire change_point_down and reset s_low to 0.

Cooldown

After firing, a 30-tick cooldown prevents re-firing on the same shift. This avoids alert storms when a new baseline is establishing.

Score

The CUSUM score is:

score = min(1.0, cumulative_sum / (threshold * 2.0))

Capped at 1.0. Higher cumulative sums (larger or longer shifts) produce higher scores.

SignalAggregator

The SignalAggregator manages anomaly alerts across both detectors.

Deduplication

If a detector fires with the same method (e.g., rate_spike) as an existing active alert, the existing alert is updated instead of creating a new one:

  • last_fired_ns is updated
  • fire_count is incremented
  • score is set to the max of old and new

Resolution

An active alert transitions to resolved after 30 seconds of not being re-fired. This means the anomalous condition has ended.

Eviction

Resolved alerts are evicted after 5 minutes. This keeps the alert table clean while retaining recent history for the triage summary.

Capacity

The aggregator holds up to 128 alerts simultaneously.

Correlation Engine

The TemporalProximity analyzer connects anomaly signals to possible causes.

Signal sources

Three types of signals feed the correlation engine:

  1. Anomaly alerts from the RateDetector and CusumDetector
  2. Rising groups – pattern groups whose trend is rising in the current window
  3. Rate changes from detector results

Finding causes

For each active anomaly alert, the engine searches for signals that occurred within a 5-minute window before the anomaly. Candidate causes are ranked by:

strength = (1.0 - normalized_lag) * magnitude

Where normalized_lag is the time lag as a fraction of the 5-minute window. Closer signals with higher magnitude rank higher.

Hypothesis building

The ranked causes form a Hypothesis with:

  • causes[]: up to 8 candidate causes, ordered by strength
  • confidence: the maximum cause strength (a measure of how strongly correlated the top cause is)

Example

t=10s: DB latency spike (anomaly_alert, magnitude=0.8)
t=12s: "Connection refused" group rising (group_spike, magnitude=0.6)
t=15s: Error rate spike (anomaly_alert, magnitude=0.9)  ← the effect

The hypothesis for the error rate spike would include:

  1. DB latency spike (5s lag, strength = 0.73) – closest and high magnitude
  2. “Connection refused” rising (3s lag, strength = 0.57)

This tells the operator (or AI agent): “The error rate spike is likely related to the DB latency spike that started 5 seconds earlier.”

Statistical Structures

All statistical data structures in tailx are O(1) memory and O(1) per-event update. The total statistical engine uses less than 1 MiB of memory.

CountMinSketch

Probabilistic frequency estimator. Answers “how many times have I seen this key?” without storing every key.

Structure

A depth x width matrix of u32 counters. Each row uses a different hash function (wyhash with different seeds). To estimate the count of a key, hash it with each row’s function, look up the counter, and return the minimum across all rows.

Properties

  • Memory: fixed at depth * width * 4 bytes
  • Update: O(depth) – hash and increment one counter per row
  • Query: O(depth) – hash and read one counter per row, return min
  • Error: overestimates only, never undercounts
  • Decay: supports multiplicative decay for sliding window expiry

Usage

Used internally for frequency tracking in the pattern grouping layer.

HyperLogLog

Probabilistic cardinality estimator. Answers “how many distinct values have I seen?” using ~16 KiB of memory.

Configuration

  • Precision: p = 14
  • Registers: 2^14 = 16,384
  • Memory: exactly 16,384 bytes (~16 KiB)
  • Standard error: ~3%

Algorithm

  1. Hash the input key with wyhash -> 64-bit hash
  2. Upper 14 bits select the register index
  3. Count leading zeros of the remaining bits + 1
  4. Store the max of (current register value, leading zeros count)
  5. Estimate: harmonic mean of 2^(-register) values, with bias correction

Merge

Two HyperLogLog sketches merge by taking the register-wise maximum. This makes it composable across sources.

Small range correction

When many registers are still zero, the standard HLL formula overestimates. Linear counting is used instead: m * ln(m / zeros).

TDigest

Streaming percentile estimator. Computes approximate p50, p95, p99 from a stream without storing all values.

Configuration

  • Max centroids: 256
  • Memory: ~4 KiB (256 centroids x 16 bytes each)
  • Compression parameter: 100

How it works

The TDigest maintains a sorted list of (mean, weight) centroids. New values are merged into the nearest centroid, subject to a compression constraint that keeps more centroids at the tails (for accurate extreme percentiles) and fewer in the middle.

Supported queries

  • quantile(0.50) – median
  • quantile(0.95) – 95th percentile
  • quantile(0.99) – 99th percentile
  • Any quantile between 0.0 and 1.0

Accuracy

Higher accuracy at the tails (p1, p99) where it matters most for latency monitoring. The compression parameter (100) trades memory for accuracy – higher values retain more centroids.

EWMA

Exponentially Weighted Moving Average. Tracks a smoothed rate that adapts to changes.

Configuration

// Fast EWMA: 10-second halflife, 1-second tick interval
EWMA.initWithHalflife(10 * std.time.ns_per_s, std.time.ns_per_s)

// Slow EWMA: 5-minute halflife, 1-second tick interval
EWMA.initWithHalflife(300 * std.time.ns_per_s, std.time.ns_per_s)

Alpha computation

The smoothing factor alpha is computed from the halflife:

alpha = 1 - exp(-tick_interval / halflife * ln(2))

A 10-second halflife means after 10 seconds, the influence of old values has decayed by 50%.

Time-weighted updates

The EWMA handles irregular update intervals by adjusting the effective alpha based on the actual elapsed time since the last update. This prevents drift when ticks are not perfectly regular.

Dual EWMA in anomaly detection

The RateDetector uses two EWMAs:

  • Fast (10s halflife): tracks the “current” rate – responds quickly to changes
  • Slow (5min halflife): tracks the “baseline” – represents the normal rate

When the fast EWMA diverges significantly from the slow EWMA, something has changed.

StreamingStats

Welford’s online algorithm for running mean, variance, standard deviation, and z-score.

What it computes

  • Mean: running average
  • Variance: running population variance
  • Standard deviation: sqrt(variance)
  • Z-score: (value - mean) / stddev

Properties

  • Single-pass, numerically stable
  • O(1) memory (stores count, mean, M2)
  • O(1) per update
  • No stored samples – cannot compute percentiles (use TDigest for that)

Usage

Used by both the RateDetector and CusumDetector to compute z-scores of event rate samples against their historical distribution.

TimeWindow

Circular bucket array for time-bucketed statistics.

Structure

TimeWindow {
    buckets: []Bucket,     // circular array
    bucket_count: u16,     // number of buckets
    duration_ns: i128,     // total window span
    bucket_duration_ns: i128, // duration per bucket
    head: u16,             // current bucket index
}

Each Bucket stores:

  • count: number of records
  • sum: sum of values
  • min: minimum value
  • max: maximum value
  • start_ns: bucket start time

Operations

  • advance(now_ns): advance the head to the bucket covering now_ns, clearing expired buckets
  • record(value): add a value to the current bucket
  • rate(): compute the overall rate across all buckets

Usage

Used for time-windowed rate calculations and trend detection in the pattern grouping layer.

Memory budget

StructureSizeCountTotal
CountMinSketch (per instance)depth x width x 4 bytesvaries< 64 KiB
HyperLogLog16,384 bytes116 KiB
TDigest~4 KiBvaries< 16 KiB
EWMA48 bytes2 (rate detector)96 bytes
StreamingStats32 bytes2 (detectors)64 bytes
TimeWindowvaries by bucket countvaries< 32 KiB
Total statistical engine< 1 MiB

CLI Reference

tailx [OPTIONS] [FILES...] [QUERY]

tailx processes log files or stdin, auto-detects formats, extracts structure, groups patterns, detects anomalies, and outputs results to the terminal or as JSON.

Modes

(default) – Pattern mode

tailx app.log

Events are printed line-by-line with severity badges and service names. A ranked pattern summary is displayed at the end (batch mode) or every 500 events (follow mode). This is the mode for interactive triage.

--raw

tailx --raw app.log

Classic tail output. Events are printed with basic formatting (severity badge, service name, message) but no pattern summary, no anomaly alerts, no group rankings. The full pipeline still runs internally.

--trace

tailx --trace app.log

Groups events by trace_id and displays them as tree views with duration and outcome. Events without a trace_id are not shown. The pattern summary is still displayed at the end.

--incident

tailx --incident app.log

Suppresses normal event output. Only displays active anomaly alerts and the pattern summary. Use this for alerting and on-call scenarios where you only want to see signals.

--json

tailx --json app.log

Outputs JSONL (one JSON object per line). Event objects are emitted as events arrive. The triage summary is always the last line. Designed for AI agents and scripts.

Filters

-l, --severity <level>

tailx --severity warn app.log
tailx -l error app.log

Minimum severity threshold for display. Valid levels: trace, debug, info, warn, error, fatal.

Events below the threshold are still processed by the pipeline – filtering is display-only.

-g, --grep <string>

tailx --grep timeout app.log
tailx -g "connection refused" app.log

Filter events whose message contains the given substring. Uses Boyer-Moore-Horspool for fast matching. Case-sensitive.

--service <name>

tailx --service payments app.log
tailx --service nginx app.log

Filter events by exact service name match. The service is auto-detected from the log format (JSON service key, syslog app name, bracketed text in unstructured logs).

--trace-id <id>

tailx --trace-id req-abc-123 app.log

Filter events by exact trace ID match. Best combined with --trace mode to inspect a single request flow.

--field <key=value>

tailx --field status=500 app.log
tailx --field hostname=web01 app.log
tailx --field user_id=42 app.log

Filter events by field value. Supports string and integer comparison – if the event field is an integer and the filter value parses as an integer, numeric comparison is used.

--last <duration>

tailx --last 5m app.log
tailx --last 1h app.log
tailx --last 30s app.log
tailx --last 2d app.log

Only display events from within the given time window. Supported suffixes: s (seconds), m (minutes), h (hours), d (days).

Options

-f, --follow

tailx -f app.log
tailx --follow app.log

Follow files for new data (default behavior). tailx uses poll() to efficiently wait for new data. Detects file truncation (copytruncate) and rotation (new inode at same path).

-n, --no-follow

tailx -n app.log
tailx --no-follow app.log

Read to EOF and stop. Do not wait for new data. Use this for batch analysis of complete files.

-s, --from-start

tailx -s app.log
tailx --from-start app.log

Start reading from the beginning of the file. By default, tailx seeks to the end and only shows new data (like tail -f). Combine with -n for full file analysis:

tailx -s -n app.log

--no-color

tailx --no-color app.log

Disable ANSI color codes in output. Color is also automatically disabled when stdout is not a terminal (piped to a file or another command) or when using --json mode.

--ring-size <n>

tailx --ring-size 131072 app.log

Set the event ring buffer capacity. Default: 65536 (64K events). Must be a power of 2 for efficient bitwise modulo indexing. Larger values retain more history but use more memory.

-h, --help

tailx --help

Display usage information with all modes, filters, options, and examples.

-V, --version

tailx --version
# tailx v1.0

Display the version string.

Positional arguments

Files

tailx app.log
tailx /var/log/*.log
tailx access.log error.log

One or more file paths. Glob patterns (*, ?) are expanded. Multiple files are merged into a single event stream, with source names displayed when more than one file is open.

Intent queries

tailx "errors related to payments" app.log
tailx "5xx from nginx" app.log
tailx "timeout" app.log

If a positional argument is not an existing file path, it is treated as a natural language intent query. Keywords are mapped to filters (severity thresholds, service names, message substrings). See Intent Queries.

Stdin

cat app.log | tailx
journalctl -u myservice | tailx
dmesg | tailx --severity warn

When no files are specified and stdin is not a terminal, tailx reads from stdin. All modes and filters work with stdin input.

Examples

# Tail a file with pattern grouping
tailx app.log

# Full file analysis
tailx -s -n app.log

# Only errors from the payments service
tailx -l error --service payments app.log

# Kernel warnings from dmesg
dmesg | tailx -l warn

# Anomaly-only view across multiple files
tailx --incident *.log

# Trace a specific request
tailx --trace --trace-id req-abc-123 app.log

# JSON output for AI consumption
tailx --json -s -n --last 5m app.log

# Natural language query
tailx "why are payments failing" app.log

# Multiple files with severity filter
tailx -l warn access.log error.log system.log

Supported Formats

tailx auto-detects the log format for each source independently. Detection locks after 8 samples. No configuration required.

JSON / JSONL

{"level":"error","msg":"Connection refused","service":"payments","latency_ms":240,"trace_id":"req-001"}

Detection: line starts with { and ends with } (after trimming whitespace).

Known field keys

JSON keyMaps to
timestamp, ts, time, @timestamp, datetime, tevent timestamp
level, severity, lvl, loglevel, log_levelevent severity
message, msg, log, text, bodyevent message
trace_id, traceId, trace, x-trace-id, request_idevent trace_id
service, service_name, app, application, componentevent service

All other keys become structured fields on the event. Values are parsed as their JSON types: strings, integers (i64), floats (f64), booleans, and null.

Timestamp handling

  • String values: parsed as ISO 8601
  • Integers > 946684800000: epoch milliseconds
  • Integers > 946684800: epoch seconds
  • Floats: epoch seconds with fractional part

Logfmt

ts=2024-03-15T14:23:01Z level=error msg="Connection refused" service=payments latency_ms=240

Detection: 3+ key=value pairs AND contains level=/lvl= AND msg=/message=.

Same known field keys as JSON. Values can be bare words (level=error) or double-quoted strings (msg="hello world"). Bare values that parse as numbers are stored as integers or floats.

Syslog BSD (RFC 3164)

<134>Mar 15 14:23:01 web01 nginx[1234]: GET /api 200 0.012

Detection: line starts with < followed by digits and >.

PRI decoding

The PRI value (0-191) encodes facility and severity. Severity = PRI mod 8:

PRI mod 8Syslog severitytailx severity
0Emergencyfatal
1Alertfatal
2Criticalfatal
3Errorerror
4Warningwarn
5Noticeinfo
6Informationalinfo
7Debugdebug

Extracted fields

  • severity: from PRI, or inferred from message keywords
  • service: from app name (nginx from nginx[1234])
  • hostname: stored as a structured field
  • pid: stored as a structured field (integer if parseable)
  • message: everything after app[pid]:

Journalctl compatibility

Journalctl output omits the PRI prefix but follows the same BSD syslog structure:

Mar 15 14:23:01 web01 nginx[1234]: GET /api 200 0.012

The parser handles this by treating the PRI as optional. When no PRI is present, severity is inferred from message content (keywords like error, warn, [ERROR], etc.).

Key-Value pairs

host=db01 cpu=0.85 memory=0.72 disk=0.45

Detection: 3+ key=value pairs (without the logfmt-specific level= and msg= keys).

Same known field keys as JSON. Values are bare words or quoted strings. Numeric inference applies to bare values.

CLF (Common Log Format)

10.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.1" 200 2326

Detection: IP/hostname, then -, then [, then " within first 80 bytes.

CLF lines are parsed by the fallback parser, which extracts what it can from the structure.

Unstructured text

2024-03-15 14:23:01 ERROR [PaymentService] Connection refused to db:5432

Detection: everything that does not match the above formats.

The fallback parser extracts:

  1. Timestamp prefix: ISO 8601 or similar date/time at the start of the line (skipped)
  2. Severity: bare keywords (ERROR, WARN, INFO, DEBUG, TRACE, FATAL) or bracketed ([ERROR], [WARN])
  3. Service: text in brackets ([PaymentService])
  4. Message: the remainder after extracting the above

Format mixing

Different sources can have different formats. A single tailx invocation can process JSON from one file and syslog from another:

tailx app.log api.json.log

Each source locks to its detected format independently after 8 lines.

Detection priority on ties

When two formats have equal votes after 8 samples, the more structured format wins:

  1. JSON / JSONL (highest priority)
  2. Logfmt
  3. Key-Value pairs
  4. Syslog BSD / Syslog IETF / CLF
  5. Unstructured (lowest priority)

Performance

tailx is built for speed. The entire processing pipeline – parsing, template extraction, grouping, anomaly detection, correlation – runs in the per-event hot path with zero heap allocation after initialization.

Throughput

MetricValue
End-to-end throughput69,000 events/sec (single core)
Measured on47,000 mixed-format lines in 3.1s
Full pipelineparse + Drain + group + trace + anomaly + correlation

This is not a synthetic benchmark. It is measured throughput on real production log data through the complete 12-stage pipeline.

Binary size

Build modeSize
ReleaseSmall (stripped)144 KB
ReleaseSafe3.1 MB

The 144 KB ReleaseSmall binary fits in L2 cache on most modern CPUs. It contains zero external dependencies – no PCRE, no libc (where avoidable), no vendored C code.

Startup time

Cold start is under 1 millisecond. There is no runtime to initialize, no JIT to warm up, no garbage collector to configure. The first event is processed within microseconds of launch.

Memory

Event storage

StructureMemoryNotes
Event struct256 bytesFixed size, cache-line friendly
EventRing (default)16 MB65,536 events x 256 bytes
ArenaPool64 MB max16 arenas x 4 MB, generation-tagged

The EventRing uses power-of-2 capacity for bitwise modulo indexing (index & (capacity - 1) instead of index % capacity). This eliminates a division instruction in the per-event hot path.

Statistical engine

StructureMemory
CountMinSketch< 64 KiB
HyperLogLog16 KiB (exactly 16,384 registers)
TDigest~4 KiB (256 centroids)
EWMA (x2)96 bytes
StreamingStats (x2)64 bytes
TimeWindow< 32 KiB
Total< 1 MiB

Pattern grouping

StructureMemoryNotes
GroupTablescales with unique templatestypically 1-5 MiB
DrainTree4,096 cluster slotsfixed allocation

Anomaly detection

StructureMemory
RateDetector~200 bytes
CusumDetector~200 bytes
SignalAggregator (128 slots)~32 KiB

Trace reconstruction

StructureMemory
TraceStore active (256 traces)~512 KiB
TraceStore finalized (512 traces)~1 MiB

Correlation

StructureMemory
TemporalProximity (256 signals)~64 KiB

Allocation strategy

tailx uses three allocation strategies:

  1. Arena allocation for event data (messages, fields, strings). Generation-tagged arenas allow bulk free on window expiry. Zero per-event free calls.

  2. General-purpose allocation for long-lived singletons (EventRing, DrainTree, GroupTable, TraceStore). Allocated once at startup, freed at shutdown.

  3. Stack allocation for small fixed-size buffers (< 4 KiB). No heap involvement.

After initialization, the per-event hot path performs zero heap allocations. All event data is copied into the current arena, which is a bump allocator (pointer increment only).

Per-operation targets

OperationTargetAchieved
Event struct size256 bytes256 bytes
EventRing push+get1M events correctTested
Drain template extraction0.5 us/lineOn target
Filter evaluation (3 predicates)100 ns/eventOn target
Group classify (hash lookup)O(1)O(1)
Anomaly detector tick10 ms/tick< 1 ms
Correlation engine tick10 ms/tick< 1 ms

What makes it fast

  1. No GC, no runtime: Zig compiles to native code with no runtime overhead. No stop-the-world pauses.

  2. Arena allocation: event data is bump-allocated (pointer increment). No per-event malloc/free.

  3. Power-of-2 ring buffer: bitwise AND instead of modulo division for index wrapping.

  4. Fixed-size Event struct: 256 bytes, fits in 4 cache lines. No pointer chasing for common fields.

  5. Boyer-Moore-Horspool: substring search for --grep uses a bad-character table for O(n/m) average-case matching.

  6. FNV-1a template hash: fast, well-distributed hash for template fingerprinting.

  7. Inline everything: hot path functions are small enough for the compiler to inline. No virtual dispatch.

  8. No external dependencies: the entire binary is self-contained Zig code. No FFI overhead, no dynamic linking.