Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Incremental Computation

Incremental computation is the foundational idea behind Ripple. This page explains what it is, why it makes Ripple fundamentally faster than re-computation-based systems, and how it compares to existing stream processing frameworks.

The Core Insight

Most data processing systems react to a change by recomputing everything from scratch. If you have 10,000 symbols and one price changes, a naive system recomputes all 10,000 VWAPs.

Incremental computation takes a different approach: recompute only what is affected by the change. If one price changes, recompute that symbol’s VWAP and any downstream aggregation – nothing else.

This is the difference between O(N) and O(R), where R is the number of nodes reachable from the changed input. In the typical case of a single trade arriving, R = 3 (leaf + map + fold), regardless of whether N is 1,000 or 100,000.

Self-Adjusting Computation

Ripple is built on the theory of self-adjusting computation, originally developed by Umut Acar and later implemented in Jane Street’s Incremental library. The key properties:

  1. Memoization of intermediate results. Every node in the computation graph caches its last computed value. A node is only recomputed if at least one of its inputs has changed.

  2. Change propagation, not re-execution. When an input changes, the system identifies the minimal set of nodes that need updating and processes them in topological order.

  3. Cutoff optimization. Even when a node is recomputed, if its new value equals its old value (by the user-provided equality function), propagation stops. This prevents phantom updates from rippling through the graph.

Comparison with Stream Processing Frameworks

Flink processes events through a DAG of operators. Each operator receives an event, processes it, and emits results downstream. This is a push-based dataflow model.

AspectFlinkRipple
ModelPush-based dataflowSelf-adjusting computation
RecomputationPer-event through full operator chainOnly affected subgraph
StatePer-operator keyed state (RocksDB)Node values in array (in-memory)
CheckpointingChandy-Lamport barriersSnapshot leaf values + input offsets
LanguageJava/ScalaOCaml
Latency floorNetwork + serialization overheadArray index + function call

Flink’s strength is horizontal scalability across commodity clusters. Ripple’s strength is per-node efficiency – sub-microsecond stabilization for the common case.

Kafka Streams

Kafka Streams is a client library (no separate cluster) that processes records from Kafka topics. It uses a topology of processors connected by internal topics.

AspectKafka StreamsRipple
DeploymentEmbedded in applicationDedicated worker processes
StateRocksDB state storesIn-memory incremental graph
WindowingTime-based with wall-clock triggersEvent-time with watermarks
Exactly-onceKafka transactionsIdempotent processing + dedup
RecomputationFull reprocess on state restoreCheckpoint leaf values, recompute derived

Materialize / Differential Dataflow

Materialize (built on Timely/Differential Dataflow) is the closest conceptual relative to Ripple. Both maintain materialized views that update incrementally.

AspectMaterializeRipple
FoundationDifferential Dataflow (Rust)Incremental (OCaml)
InterfaceSQLOCaml module signatures
Change representation(data, time, diff) triplesDelta type (Set/Patch/Remove)
DeploymentStandalone databaseEmbeddable library + workers

Ripple deliberately targets the same niche but within the Jane Street OCaml ecosystem, where interop with Core, Async, bin_prot, and ppx_* is essential.

Why O(R) Matters

Consider a VWAP pipeline with 2,000 symbols (the maximum per worker):

Graph structure:
  2000 leaf nodes (one per symbol)
  2000 map nodes (vwap_price extraction)
  1 incr_fold node (portfolio total)
  ─────────────────────────
  4001 total nodes

When a single trade arrives:

ApproachNodes touchedTime (measured)
Full recompute4,001~27 us
Incremental (Ripple)3~250 ns

The incremental approach is 100x faster for the single-change case. This difference compounds: at 100K events/sec, the full-recompute approach spends 2.7 seconds per second on computation alone, while the incremental approach spends 25 milliseconds.

The Stabilization Cycle

Every change in Ripple follows the same pattern:

1. set_leaf(var, new_value)    -- marks leaf dirty, pushes to heap
2. stabilize(graph)            -- processes dirty heap in height order
3. watch(graph, node)          -- reads current value (O(1) array lookup)

Between steps 1 and 2, you can set multiple leaves. Stabilization processes all of them in a single pass, which is more efficient than stabilizing after each individual change.

set_leaf(price_AAPL, 150.0)    -- AAPL leaf -> dirty
set_leaf(price_GOOG, 2800.0)   -- GOOG leaf -> dirty
set_leaf(price_MSFT, 420.0)    -- MSFT leaf -> dirty
stabilize(g)                    -- processes 3 leaves + their dependents
                                -- in one pass, height-ordered

This batching property is exploited by the VWAP demo, which processes 1,000 trades before stabilizing.

Deterministic Replay

Because the incremental graph is a pure function from inputs to outputs (given a deterministic clock), Ripple supports deterministic replay: load a checkpoint, replay the input log from the checkpoint’s offset, and arrive at exactly the same state as before the crash.

This property requires that no module in Ripple calls Time_ns.now() or Random.int directly. All non-determinism flows through the EFFECT module interface (see Effect Injection).