Drain Template Extraction

Drain is the algorithm that collapses thousands of repetitive log lines into a handful of structural templates. It is the foundation of pattern grouping – without it, every unique log message would be its own group.

The problem

These three log lines are structurally identical:

Connection to 10.0.0.1 timed out after 30s
Connection to 10.0.0.2 timed out after 45s
Connection to 10.0.0.3 timed out after 12s

They differ only in the IP address and timeout duration. A human sees “connection timeout” immediately. Drain teaches tailx to see the same thing.

How it works

1. Tokenize

Split the message by whitespace into tokens.

["Connection", "to", "10.0.0.1", "timed", "out", "after", "30s"]

2. Classify tokens

Each token is classified as either a literal or a wildcard (<*>):

Contains any digit -> wildcard. This catches IPs, ports, durations, counts, UUIDs, timestamps.
Quoted string (starts and ends with ") -> wildcard.
Everything else -> literal.

["Connection", "to", "<*>", "timed", "out", "after", "<*>"]

3. Match against existing clusters

Search existing clusters for one with:

The same token count
Similarity >= 0.5 (the sim_threshold)

Similarity is computed as the fraction of positions where both tokens match (both are wildcards, or both are the same literal):

similarity = matching_positions / total_positions

4. Merge or create

If a match is found: merge the new tokens into the existing cluster. Any position where the existing template has a literal but the new line has a different literal gets generalized to <*>.

If no match is found: create a new cluster with the classified tokens.

5. Hash the template

The final template tokens are hashed with FNV-1a to produce a u64 template_hash. All events that map to the same template get the same hash.

"Connection to <*> timed out after <*>"  →  hash: 0x3a7f...

Example walkthrough

Line 1: Connection to 10.0.0.1 timed out after 30s

Classified: ["Connection", "to", "<*>", "timed", "out", "after", "<*>"]

No existing clusters. Create cluster #0.

Line 2: Connection to 10.0.0.2 timed out after 45s

Classified: ["Connection", "to", "<*>", "timed", "out", "after", "<*>"]

Cluster #0 has 7 tokens, this has 7 tokens. Similarity = 7/7 = 1.0 >= 0.5. Match. All positions agree. Cluster #0 count becomes 2.

Line 3: User logged in from 10.0.0.1 at 14:00

Classified: ["User", "logged", "in", "from", "<*>", "at", "<*>"]

Cluster #0 has 7 tokens, this has 7 tokens. But similarity: position 0 “Connection” vs “User” = mismatch, position 1 “to” vs “logged” = mismatch… similarity < 0.5. No match. Create cluster #1.

Line 4: Error 500 on server web01

Classified: ["Error", "<*>", "on", "server", "<*>"]

Only 5 tokens. Cluster #0 has 7, cluster #1 has 7. Token count mismatch for both. Create cluster #2.

Line 5: Error 404 on server web02

Classified: ["Error", "<*>", "on", "server", "<*>"]

Cluster #2 has 5 tokens, this has 5 tokens. Similarity = 5/5 = 1.0. Match. Cluster #2 count becomes 2.

Configuration

The DrainTree is initialized with:

max_depth: 4 (controls the depth of the classification tree – in this implementation, used as a parameter but matching is linear across clusters)
sim_threshold: 0.5 (minimum similarity to match an existing cluster)
max_clusters: 4096 (hard limit on the number of distinct templates)

When the cluster limit is reached, new messages that don’t match an existing cluster are still hashed (from their classified tokens) but don’t create new clusters.

Why these rules work

The “contains any digit -> wildcard” rule is surprisingly effective because most variable parts in log messages contain digits:

IP addresses: 10.0.0.1
Ports: 5432
Durations: 30s, 250ms
Counts: 42 items
HTTP status codes: 200, 500
UUIDs: 550e8400-e29b-41d4-a716-446655440000
Timestamps: 14:23:01
PIDs: [1234]

The few variable tokens without digits (usernames, hostnames) may not get wildcarded, but they will either match literally (same user) or cause a new cluster (different user). Over time, if both forms appear, the merge step generalizes the position to <*>.

Template hash

The hash function is FNV-1a over the concatenated template tokens (with space separators). This is a fast, well-distributed hash that produces a u64 – the template_hash stored on every event.

Events with the same template_hash are grouped together in the GroupTable. The hash is the primary grouping key for all downstream analysis.

Keyboard shortcuts

tailx Documentation