Monitoring

This page covers how to monitor a Ripple cluster: Prometheus scraping configuration, the full list of pre-registered metrics, recommended alert thresholds, and structured logging.

Prometheus Scraping

Every Ripple worker and coordinator exposes metrics on an HTTP endpoint:

Component	Port	Path	Content-Type
Worker	9102	`/metrics`	`text/plain; version=0.0.4`
Coordinator	9202	`/metrics`	`text/plain; version=0.0.4`

Kubernetes Auto-Discovery

Workers and coordinators have Prometheus annotations in their pod templates:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9102"
    prometheus.io/path: "/metrics"

With Prometheus Kubernetes service discovery, no additional scrape configuration is needed. Prometheus will automatically discover and scrape all Ripple pods.

Manual Scrape Configuration

If not using Kubernetes service discovery:

scrape_configs:
  - job_name: 'ripple-workers'
    static_configs:
      - targets:
        - 'worker-0:9102'
        - 'worker-1:9102'
        - 'worker-2:9102'
    scrape_interval: 5s

  - job_name: 'ripple-coordinator'
    static_configs:
      - targets:
        - 'coordinator-0:9202'
    scrape_interval: 10s

Pre-Registered Metrics

Graph Engine Metrics

Metric	Type	Labels	Description
`ripple_graph_stabilizations_total`	Counter	worker	Total stabilization cycles executed
`ripple_graph_cutoff_hits_total`	Counter	worker	Times cutoff prevented unnecessary propagation
`ripple_graph_node_count`	Gauge	worker	Total nodes in the computation graph
`ripple_graph_dirty_count`	Gauge	worker	Nodes currently marked dirty
`ripple_graph_stabilization_ns`	Histogram	worker	Stabilization cycle duration (nanoseconds)
`ripple_graph_recompute_count`	Histogram	worker	Nodes recomputed per stabilization

Key metric: ripple_graph_stabilization_ns. The p99 of this histogram should stay below 10,000 ns (10 us). If it exceeds this, the worker is processing too many symbols per partition.

Transport Metrics

Metric	Type	Labels	Description
`ripple_transport_deltas_sent_total`	Counter	worker, output	Deltas sent to downstream workers
`ripple_transport_deltas_received_total`	Counter	worker, source	Deltas received from upstream workers
`ripple_transport_delta_bytes_total`	Counter	worker	Total bytes transmitted as deltas
`ripple_transport_rpc_latency_ns`	Histogram	worker	RPC round-trip latency (nanoseconds)
`ripple_transport_backpressure_level`	Gauge	worker	Current backpressure (0.0 to 1.0)

Key metric: ripple_transport_backpressure_level. Values above 0.8 trigger a Critical alert. Sustained values above 0.95 cause the worker to pause input consumption.

Window Metrics

Metric	Type	Labels	Description
`ripple_window_active_windows`	Gauge	worker	Number of open (not yet triggered) windows
`ripple_window_late_events_total`	Counter	worker	Mildly late events (within allowed lateness)
`ripple_window_very_late_events_total`	Counter	worker	Very late events (dropped)
`ripple_window_retractions_total`	Counter	worker	Window result retractions issued

Key metric: ripple_window_very_late_events_total. A non-zero rate indicates either clock skew in the data source or insufficient allowed lateness.

System Metrics

Metric	Type	Labels	Description
`ripple_system_event_lag_ns`	Histogram	worker	Event time lag (now - event_time)
`ripple_system_heap_words`	Gauge	worker	OCaml GC heap size in words

Key metric: ripple_system_heap_words. If this grows continuously, there is a memory leak. A healthy worker should show < 0.1% heap growth per checkpoint interval.

Alert Thresholds

Recommended Alert Rules

groups:
  - name: ripple
    rules:
      - alert: StabilizationSlow
        expr: histogram_quantile(0.99, ripple_graph_stabilization_ns) > 10000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Stabilization p99 > 10ms on {{ $labels.worker }}"

      - alert: BackpressureHigh
        expr: ripple_transport_backpressure_level > 0.8
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Backpressure at {{ $value | humanizePercentage }} on {{ $labels.worker }}"

      - alert: LateEventsExcessive
        expr: >
          rate(ripple_window_very_late_events_total[5m])
          / rate(ripple_graph_stabilizations_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Very late events exceed 1% on {{ $labels.worker }}"

      - alert: HeapGrowth
        expr: >
          deriv(ripple_system_heap_words[10m]) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Continuous heap growth on {{ $labels.worker }}"

      - alert: WorkerDown
        expr: up{job="ripple-workers"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Worker {{ $labels.instance }} is down"

      - alert: NoStabilizations
        expr: >
          rate(ripple_graph_stabilizations_total[5m]) == 0
          and up{job="ripple-workers"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker {{ $labels.worker }} has not stabilized in 5 minutes"

Threshold Summary

Condition	Threshold	Severity	Action
Stabilization p99 > 10ms	`> 10_000_000 ns`	Warning	Check partition size, consider splitting
Backpressure > 80%	`> 0.8`	Critical	Scale out workers or increase parallelism
Backpressure > 95%	`> 0.95`	Critical	Worker pauses input – immediate action needed
Very late events > 1%	`> 0.01 ratio`	Warning	Increase `allowed_lateness_sec` or fix source clocks
Checkpoint age > 20s	`> 2x interval`	Critical	Check checkpoint store (S3) availability
Worker heartbeat > 30s	`> 30s`	Critical	Worker is dead, coordinator will rebalance
Heap growth > 0.1%/interval	`deriv > 10000`	Warning	Possible memory leak, investigate with GC stats

Structured Logging

All Ripple log entries are machine-parseable s-expressions:

type entry =
  { level : level          (* Debug | Info | Warn | Error *)
  ; timestamp_ns : Int64.t
  ; worker_id : string option
  ; component : string     (* "graph", "transport", "checkpoint", etc. *)
  ; message : string
  ; context : (string * string) list
  }
[@@deriving sexp]

Example log output:

((level Info)(timestamp_ns 1709000000000000000)(worker_id (w0))(component graph)(message stabilized)(context ((nodes 4001)(recomputed 3)(dirty 0))))

Log Levels

Level	Usage
`Debug`	Per-stabilization diagnostics, heartbeat details
`Info`	Worker lifecycle transitions, checkpoint writes
`Warn`	Late events, slow stabilization, clock skew
`Error`	Schema mismatch, checkpoint failure, crash recovery

Parsing Logs

Because logs are s-expressions, they can be parsed with any sexp library or with standard Unix tools:

# Extract all Error-level entries
grep '(level Error)' ripple.log | sexp query '(field message)'

# Count stabilizations per worker
grep '(message stabilized)' ripple.log | grep -o '(worker_id [^)]*' | sort | uniq -c

In production, ship logs to a centralized system (ELK, Loki, Datadog) and parse the sexp structure for indexing and alerting.

Keyboard shortcuts

Ripple