Incremental Computation
Incremental computation is the foundational idea behind Ripple. This page explains what it is, why it makes Ripple fundamentally faster than re-computation-based systems, and how it compares to existing stream processing frameworks.
The Core Insight
Most data processing systems react to a change by recomputing everything from scratch. If you have 10,000 symbols and one price changes, a naive system recomputes all 10,000 VWAPs.
Incremental computation takes a different approach: recompute only what is affected by the change. If one price changes, recompute that symbol’s VWAP and any downstream aggregation – nothing else.
This is the difference between O(N) and O(R), where R is the number of nodes reachable from the changed input. In the typical case of a single trade arriving, R = 3 (leaf + map + fold), regardless of whether N is 1,000 or 100,000.
Self-Adjusting Computation
Ripple is built on the theory of self-adjusting computation, originally developed by Umut Acar and later implemented in Jane Street’s Incremental library. The key properties:
-
Memoization of intermediate results. Every node in the computation graph caches its last computed value. A node is only recomputed if at least one of its inputs has changed.
-
Change propagation, not re-execution. When an input changes, the system identifies the minimal set of nodes that need updating and processes them in topological order.
-
Cutoff optimization. Even when a node is recomputed, if its new value equals its old value (by the user-provided equality function), propagation stops. This prevents phantom updates from rippling through the graph.
Comparison with Stream Processing Frameworks
Apache Flink
Flink processes events through a DAG of operators. Each operator receives an event, processes it, and emits results downstream. This is a push-based dataflow model.
| Aspect | Flink | Ripple |
|---|---|---|
| Model | Push-based dataflow | Self-adjusting computation |
| Recomputation | Per-event through full operator chain | Only affected subgraph |
| State | Per-operator keyed state (RocksDB) | Node values in array (in-memory) |
| Checkpointing | Chandy-Lamport barriers | Snapshot leaf values + input offsets |
| Language | Java/Scala | OCaml |
| Latency floor | Network + serialization overhead | Array index + function call |
Flink’s strength is horizontal scalability across commodity clusters. Ripple’s strength is per-node efficiency – sub-microsecond stabilization for the common case.
Kafka Streams
Kafka Streams is a client library (no separate cluster) that processes records from Kafka topics. It uses a topology of processors connected by internal topics.
| Aspect | Kafka Streams | Ripple |
|---|---|---|
| Deployment | Embedded in application | Dedicated worker processes |
| State | RocksDB state stores | In-memory incremental graph |
| Windowing | Time-based with wall-clock triggers | Event-time with watermarks |
| Exactly-once | Kafka transactions | Idempotent processing + dedup |
| Recomputation | Full reprocess on state restore | Checkpoint leaf values, recompute derived |
Materialize / Differential Dataflow
Materialize (built on Timely/Differential Dataflow) is the closest conceptual relative to Ripple. Both maintain materialized views that update incrementally.
| Aspect | Materialize | Ripple |
|---|---|---|
| Foundation | Differential Dataflow (Rust) | Incremental (OCaml) |
| Interface | SQL | OCaml module signatures |
| Change representation | (data, time, diff) triples | Delta type (Set/Patch/Remove) |
| Deployment | Standalone database | Embeddable library + workers |
Ripple deliberately targets the same niche but within the Jane Street OCaml ecosystem, where interop with Core, Async, bin_prot, and ppx_* is essential.
Why O(R) Matters
Consider a VWAP pipeline with 2,000 symbols (the maximum per worker):
Graph structure:
2000 leaf nodes (one per symbol)
2000 map nodes (vwap_price extraction)
1 incr_fold node (portfolio total)
─────────────────────────
4001 total nodes
When a single trade arrives:
| Approach | Nodes touched | Time (measured) |
|---|---|---|
| Full recompute | 4,001 | ~27 us |
| Incremental (Ripple) | 3 | ~250 ns |
The incremental approach is 100x faster for the single-change case. This difference compounds: at 100K events/sec, the full-recompute approach spends 2.7 seconds per second on computation alone, while the incremental approach spends 25 milliseconds.
The Stabilization Cycle
Every change in Ripple follows the same pattern:
1. set_leaf(var, new_value) -- marks leaf dirty, pushes to heap
2. stabilize(graph) -- processes dirty heap in height order
3. watch(graph, node) -- reads current value (O(1) array lookup)
Between steps 1 and 2, you can set multiple leaves. Stabilization processes all of them in a single pass, which is more efficient than stabilizing after each individual change.
set_leaf(price_AAPL, 150.0) -- AAPL leaf -> dirty
set_leaf(price_GOOG, 2800.0) -- GOOG leaf -> dirty
set_leaf(price_MSFT, 420.0) -- MSFT leaf -> dirty
stabilize(g) -- processes 3 leaves + their dependents
-- in one pass, height-ordered
This batching property is exploited by the VWAP demo, which processes 1,000 trades before stabilizing.
Deterministic Replay
Because the incremental graph is a pure function from inputs to outputs (given a deterministic clock), Ripple supports deterministic replay: load a checkpoint, replay the input log from the checkpoint’s offset, and arrive at exactly the same state as before the crash.
This property requires that no module in Ripple calls Time_ns.now() or Random.int directly. All non-determinism flows through the EFFECT module interface (see Effect Injection).