Monitoring
This page covers how to monitor a Ripple cluster: Prometheus scraping configuration, the full list of pre-registered metrics, recommended alert thresholds, and structured logging.
Prometheus Scraping
Every Ripple worker and coordinator exposes metrics on an HTTP endpoint:
| Component | Port | Path | Content-Type |
|---|---|---|---|
| Worker | 9102 | /metrics | text/plain; version=0.0.4 |
| Coordinator | 9202 | /metrics | text/plain; version=0.0.4 |
Kubernetes Auto-Discovery
Workers and coordinators have Prometheus annotations in their pod templates:
metadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9102"
prometheus.io/path: "/metrics"
With Prometheus Kubernetes service discovery, no additional scrape configuration is needed. Prometheus will automatically discover and scrape all Ripple pods.
Manual Scrape Configuration
If not using Kubernetes service discovery:
scrape_configs:
- job_name: 'ripple-workers'
static_configs:
- targets:
- 'worker-0:9102'
- 'worker-1:9102'
- 'worker-2:9102'
scrape_interval: 5s
- job_name: 'ripple-coordinator'
static_configs:
- targets:
- 'coordinator-0:9202'
scrape_interval: 10s
Pre-Registered Metrics
Graph Engine Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ripple_graph_stabilizations_total | Counter | worker | Total stabilization cycles executed |
ripple_graph_cutoff_hits_total | Counter | worker | Times cutoff prevented unnecessary propagation |
ripple_graph_node_count | Gauge | worker | Total nodes in the computation graph |
ripple_graph_dirty_count | Gauge | worker | Nodes currently marked dirty |
ripple_graph_stabilization_ns | Histogram | worker | Stabilization cycle duration (nanoseconds) |
ripple_graph_recompute_count | Histogram | worker | Nodes recomputed per stabilization |
Key metric: ripple_graph_stabilization_ns. The p99 of this histogram should stay below 10,000 ns (10 us). If it exceeds this, the worker is processing too many symbols per partition.
Transport Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ripple_transport_deltas_sent_total | Counter | worker, output | Deltas sent to downstream workers |
ripple_transport_deltas_received_total | Counter | worker, source | Deltas received from upstream workers |
ripple_transport_delta_bytes_total | Counter | worker | Total bytes transmitted as deltas |
ripple_transport_rpc_latency_ns | Histogram | worker | RPC round-trip latency (nanoseconds) |
ripple_transport_backpressure_level | Gauge | worker | Current backpressure (0.0 to 1.0) |
Key metric: ripple_transport_backpressure_level. Values above 0.8 trigger a Critical alert. Sustained values above 0.95 cause the worker to pause input consumption.
Window Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ripple_window_active_windows | Gauge | worker | Number of open (not yet triggered) windows |
ripple_window_late_events_total | Counter | worker | Mildly late events (within allowed lateness) |
ripple_window_very_late_events_total | Counter | worker | Very late events (dropped) |
ripple_window_retractions_total | Counter | worker | Window result retractions issued |
Key metric: ripple_window_very_late_events_total. A non-zero rate indicates either clock skew in the data source or insufficient allowed lateness.
System Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
ripple_system_event_lag_ns | Histogram | worker | Event time lag (now - event_time) |
ripple_system_heap_words | Gauge | worker | OCaml GC heap size in words |
Key metric: ripple_system_heap_words. If this grows continuously, there is a memory leak. A healthy worker should show < 0.1% heap growth per checkpoint interval.
Alert Thresholds
Recommended Alert Rules
groups:
- name: ripple
rules:
- alert: StabilizationSlow
expr: histogram_quantile(0.99, ripple_graph_stabilization_ns) > 10000000
for: 5m
labels:
severity: warning
annotations:
summary: "Stabilization p99 > 10ms on {{ $labels.worker }}"
- alert: BackpressureHigh
expr: ripple_transport_backpressure_level > 0.8
for: 1m
labels:
severity: critical
annotations:
summary: "Backpressure at {{ $value | humanizePercentage }} on {{ $labels.worker }}"
- alert: LateEventsExcessive
expr: >
rate(ripple_window_very_late_events_total[5m])
/ rate(ripple_graph_stabilizations_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Very late events exceed 1% on {{ $labels.worker }}"
- alert: HeapGrowth
expr: >
deriv(ripple_system_heap_words[10m]) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Continuous heap growth on {{ $labels.worker }}"
- alert: WorkerDown
expr: up{job="ripple-workers"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Worker {{ $labels.instance }} is down"
- alert: NoStabilizations
expr: >
rate(ripple_graph_stabilizations_total[5m]) == 0
and up{job="ripple-workers"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Worker {{ $labels.worker }} has not stabilized in 5 minutes"
Threshold Summary
| Condition | Threshold | Severity | Action |
|---|---|---|---|
| Stabilization p99 > 10ms | > 10_000_000 ns | Warning | Check partition size, consider splitting |
| Backpressure > 80% | > 0.8 | Critical | Scale out workers or increase parallelism |
| Backpressure > 95% | > 0.95 | Critical | Worker pauses input – immediate action needed |
| Very late events > 1% | > 0.01 ratio | Warning | Increase allowed_lateness_sec or fix source clocks |
| Checkpoint age > 20s | > 2x interval | Critical | Check checkpoint store (S3) availability |
| Worker heartbeat > 30s | > 30s | Critical | Worker is dead, coordinator will rebalance |
| Heap growth > 0.1%/interval | deriv > 10000 | Warning | Possible memory leak, investigate with GC stats |
Structured Logging
All Ripple log entries are machine-parseable s-expressions:
type entry =
{ level : level (* Debug | Info | Warn | Error *)
; timestamp_ns : Int64.t
; worker_id : string option
; component : string (* "graph", "transport", "checkpoint", etc. *)
; message : string
; context : (string * string) list
}
[@@deriving sexp]
Example log output:
((level Info)(timestamp_ns 1709000000000000000)(worker_id (w0))(component graph)(message stabilized)(context ((nodes 4001)(recomputed 3)(dirty 0))))
Log Levels
| Level | Usage |
|---|---|
Debug | Per-stabilization diagnostics, heartbeat details |
Info | Worker lifecycle transitions, checkpoint writes |
Warn | Late events, slow stabilization, clock skew |
Error | Schema mismatch, checkpoint failure, crash recovery |
Parsing Logs
Because logs are s-expressions, they can be parsed with any sexp library or with standard Unix tools:
# Extract all Error-level entries
grep '(level Error)' ripple.log | sexp query '(field message)'
# Count stabilizations per worker
grep '(message stabilized)' ripple.log | grep -o '(worker_id [^)]*' | sort | uniq -c
In production, ship logs to a centralized system (ELK, Loki, Datadog) and parse the sexp structure for indexing and alerting.