Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring

This page covers how to monitor a Ripple cluster: Prometheus scraping configuration, the full list of pre-registered metrics, recommended alert thresholds, and structured logging.

Prometheus Scraping

Every Ripple worker and coordinator exposes metrics on an HTTP endpoint:

ComponentPortPathContent-Type
Worker9102/metricstext/plain; version=0.0.4
Coordinator9202/metricstext/plain; version=0.0.4

Kubernetes Auto-Discovery

Workers and coordinators have Prometheus annotations in their pod templates:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "9102"
    prometheus.io/path: "/metrics"

With Prometheus Kubernetes service discovery, no additional scrape configuration is needed. Prometheus will automatically discover and scrape all Ripple pods.

Manual Scrape Configuration

If not using Kubernetes service discovery:

scrape_configs:
  - job_name: 'ripple-workers'
    static_configs:
      - targets:
        - 'worker-0:9102'
        - 'worker-1:9102'
        - 'worker-2:9102'
    scrape_interval: 5s

  - job_name: 'ripple-coordinator'
    static_configs:
      - targets:
        - 'coordinator-0:9202'
    scrape_interval: 10s

Pre-Registered Metrics

Graph Engine Metrics

MetricTypeLabelsDescription
ripple_graph_stabilizations_totalCounterworkerTotal stabilization cycles executed
ripple_graph_cutoff_hits_totalCounterworkerTimes cutoff prevented unnecessary propagation
ripple_graph_node_countGaugeworkerTotal nodes in the computation graph
ripple_graph_dirty_countGaugeworkerNodes currently marked dirty
ripple_graph_stabilization_nsHistogramworkerStabilization cycle duration (nanoseconds)
ripple_graph_recompute_countHistogramworkerNodes recomputed per stabilization

Key metric: ripple_graph_stabilization_ns. The p99 of this histogram should stay below 10,000 ns (10 us). If it exceeds this, the worker is processing too many symbols per partition.

Transport Metrics

MetricTypeLabelsDescription
ripple_transport_deltas_sent_totalCounterworker, outputDeltas sent to downstream workers
ripple_transport_deltas_received_totalCounterworker, sourceDeltas received from upstream workers
ripple_transport_delta_bytes_totalCounterworkerTotal bytes transmitted as deltas
ripple_transport_rpc_latency_nsHistogramworkerRPC round-trip latency (nanoseconds)
ripple_transport_backpressure_levelGaugeworkerCurrent backpressure (0.0 to 1.0)

Key metric: ripple_transport_backpressure_level. Values above 0.8 trigger a Critical alert. Sustained values above 0.95 cause the worker to pause input consumption.

Window Metrics

MetricTypeLabelsDescription
ripple_window_active_windowsGaugeworkerNumber of open (not yet triggered) windows
ripple_window_late_events_totalCounterworkerMildly late events (within allowed lateness)
ripple_window_very_late_events_totalCounterworkerVery late events (dropped)
ripple_window_retractions_totalCounterworkerWindow result retractions issued

Key metric: ripple_window_very_late_events_total. A non-zero rate indicates either clock skew in the data source or insufficient allowed lateness.

System Metrics

MetricTypeLabelsDescription
ripple_system_event_lag_nsHistogramworkerEvent time lag (now - event_time)
ripple_system_heap_wordsGaugeworkerOCaml GC heap size in words

Key metric: ripple_system_heap_words. If this grows continuously, there is a memory leak. A healthy worker should show < 0.1% heap growth per checkpoint interval.

Alert Thresholds

groups:
  - name: ripple
    rules:
      - alert: StabilizationSlow
        expr: histogram_quantile(0.99, ripple_graph_stabilization_ns) > 10000000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Stabilization p99 > 10ms on {{ $labels.worker }}"

      - alert: BackpressureHigh
        expr: ripple_transport_backpressure_level > 0.8
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Backpressure at {{ $value | humanizePercentage }} on {{ $labels.worker }}"

      - alert: LateEventsExcessive
        expr: >
          rate(ripple_window_very_late_events_total[5m])
          / rate(ripple_graph_stabilizations_total[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Very late events exceed 1% on {{ $labels.worker }}"

      - alert: HeapGrowth
        expr: >
          deriv(ripple_system_heap_words[10m]) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Continuous heap growth on {{ $labels.worker }}"

      - alert: WorkerDown
        expr: up{job="ripple-workers"} == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Worker {{ $labels.instance }} is down"

      - alert: NoStabilizations
        expr: >
          rate(ripple_graph_stabilizations_total[5m]) == 0
          and up{job="ripple-workers"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Worker {{ $labels.worker }} has not stabilized in 5 minutes"

Threshold Summary

ConditionThresholdSeverityAction
Stabilization p99 > 10ms> 10_000_000 nsWarningCheck partition size, consider splitting
Backpressure > 80%> 0.8CriticalScale out workers or increase parallelism
Backpressure > 95%> 0.95CriticalWorker pauses input – immediate action needed
Very late events > 1%> 0.01 ratioWarningIncrease allowed_lateness_sec or fix source clocks
Checkpoint age > 20s> 2x intervalCriticalCheck checkpoint store (S3) availability
Worker heartbeat > 30s> 30sCriticalWorker is dead, coordinator will rebalance
Heap growth > 0.1%/intervalderiv > 10000WarningPossible memory leak, investigate with GC stats

Structured Logging

All Ripple log entries are machine-parseable s-expressions:

type entry =
  { level : level          (* Debug | Info | Warn | Error *)
  ; timestamp_ns : Int64.t
  ; worker_id : string option
  ; component : string     (* "graph", "transport", "checkpoint", etc. *)
  ; message : string
  ; context : (string * string) list
  }
[@@deriving sexp]

Example log output:

((level Info)(timestamp_ns 1709000000000000000)(worker_id (w0))(component graph)(message stabilized)(context ((nodes 4001)(recomputed 3)(dirty 0))))

Log Levels

LevelUsage
DebugPer-stabilization diagnostics, heartbeat details
InfoWorker lifecycle transitions, checkpoint writes
WarnLate events, slow stabilization, clock skew
ErrorSchema mismatch, checkpoint failure, crash recovery

Parsing Logs

Because logs are s-expressions, they can be parsed with any sexp library or with standard Unix tools:

# Extract all Error-level entries
grep '(level Error)' ripple.log | sexp query '(field message)'

# Count stabilizations per worker
grep '(message stabilized)' ripple.log | grep -o '(worker_id [^)]*' | sort | uniq -c

In production, ship logs to a centralized system (ELK, Loki, Datadog) and parse the sexp structure for indexing and alerting.