Why It’s Fast
Ripple achieves sub-microsecond stabilization through three key optimizations: heap-based dirty propagation, incremental fold, and cutoff. This page explains each optimization, the memory layout that supports them, and the zero-allocation discipline on the hot path.
The Three Key Optimizations
1. Heap-Based Dirty Propagation
Problem: When a leaf changes, which nodes need recomputation?
Naive approach (dirty flags): Scan all N nodes, check if dirty, recompute if so. This is O(N) per stabilization regardless of how many nodes actually changed.
Ripple’s approach (min-heap): Push dirty nodes onto a min-heap keyed by (height, node_id). Pop nodes in order and recompute. This is O(R log R) where R is the number of dirty nodes.
Benchmark B-05 proved this matters:
Linear scan at 10K nodes: ~27 us per stabilization
Heap-based at 10K nodes: ~250 ns per stabilization
Speedup: 100x
The heap gives O(R log R), but in the common case R is very small (1 leaf + 2-3 dependents), so the practical complexity is O(1) with a small constant.
2. Incremental Fold (Incr_fold)
Problem: A portfolio VWAP aggregates 2,000 symbol prices. When one price changes, how do we update the sum?
Naive fold: Re-fold all 2,000 prices: sum = fold (+) 0 [p0; p1; ...; p1999]. This is O(N) per stabilization.
Incr_fold: Track which parents changed. For each changed parent, remove the old value and add the new value:
acc = old_sum
acc = acc - old_price_AAPL (remove)
acc = acc + new_price_AAPL (add)
return acc
This is O(changed_parents), which is O(1) for the single-trade case.
Performance comparison (2,000 symbols, 1 change):
Fold: O(2000) = ~5 us
Incr_fold: O(1) = ~50 ns
Speedup: 100x
The Incr_fold node maintains a snapshot of all parent values from the previous cycle. During dirty propagation, when a parent of an Incr_fold is about to be marked dirty, the parent’s index is recorded. During recomputation, only those recorded indices are processed.
3. Cutoff Optimization
Problem: A node recomputes but produces the same value as before. Should its dependents recompute?
Without cutoff: Yes – every recomputation propagates to all dependents, creating phantom updates.
With cutoff: No – if the new value equals the old value (by the user-provided equality function), propagation stops.
[leaf: raw_price=250]
|
v
[map: clamp(100)] recomputes: min(250, 100) = 100
| old value was 100
| cutoff: 100 == 100, STOP
v
[map: format] NOT recomputed (saved ~100ns + string allocation)
Cutoff is especially effective for:
- Threshold/clamp functions
- Boolean predicates (changes in input that don’t flip the predicate)
- Rounding functions
- Any function with a small output domain
In the VWAP pipeline, cutoff prevents propagation when a trade arrives but the VWAP (weighted average) does not change meaningfully due to rounding.
Memory Layout
Node Array
Nodes are stored in a contiguous array, not a linked structure:
graph.nodes:
┌──────┬──────┬──────┬──────┬──────┬──────┐
│ n[0] │ n[1] │ n[2] │ n[3] │ n[4] │ ... │
└──────┴──────┴──────┴──────┴──────┴──────┘
^ ^
sequential memory grows by 2x
Array access is O(1) by node ID: graph.nodes.(node_id). No hash table lookup, no pointer chase.
Dirty Heap
The heap is also array-based:
dirty_heap.entries:
┌─────────┬─────────┬─────────┬─────────┐
│ (0, n3) │ (1, n7) │ (1, n9) │ unused │
└─────────┴─────────┴─────────┴─────────┘
index 1 index 2 index 3 (1-indexed)
dirty_heap.in_heap:
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ F │ F │ F │ T │ F │ F │ F │ T │ F │ T │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
n0 n1 n2 n3 n4 n5 n6 n7 n8 n9
in_heap provides O(1) membership check by node_id. This prevents duplicate insertion and is essential for the deduplication during dirty propagation.
Cache Friendliness
With 2,000 nodes at ~200 bytes each:
- Node array: 400 KB (fits in L2 cache on most CPUs)
- Dirty heap: ~80 KB (fits in L1 cache)
- Total working set: < 500 KB
During stabilization, the access pattern is:
- Pop from heap (sequential access to heap array)
- Read node from node array (indexed access, likely cached)
- Read parent values from node array (indexed access)
- Write new value to node array
- Push dependents to heap
Steps 2-4 access a small number of nodes (typically 3-5) that are likely in the same cache line or adjacent lines. There is no pointer chasing through a linked data structure.
Zero Allocation on the Hot Path
The stabilization loop allocates nothing during the common case:
| Operation | Allocation? | Why |
|---|---|---|
Dirty_heap.pop_min | No | Returns struct from pre-allocated array |
Dirty_heap.push | No | Writes to pre-allocated array slot |
| Node recompute (Map) | Depends on f | User function may allocate |
| Cutoff check | No | Calls user equality function |
| Set dependent dirty | No | Sets boolean flag + heap push |
The heap is pre-allocated to capacity. The node array is pre-allocated and grows only at construction time. The in_heap array is pre-allocated.
Allocation can occur in user-provided functions (f in Map, add/remove in Incr_fold). For numeric operations (float arithmetic), OCaml represents them as unboxed doubles – no allocation. For string operations or record construction, allocation occurs but is amortized by the GC.
Measuring Hot Path Allocations
Use OCaml’s Gc.stat() to verify zero allocation:
let before = Gc.stat () in
let _ = Graph.stabilize g in
let after = Gc.stat () in
let allocs = after.minor_words -. before.minor_words in
printf "words allocated: %.0f\n" allocs;
(* For numeric-only stabilization: 0 *)
Performance Summary
| Operation | Time | Complexity |
|---|---|---|
set_leaf | ~10 ns | O(1) |
stabilize (1 change, numeric) | ~250 ns | O(R log R), R typically 3 |
watch | ~5 ns | O(1) array index |
Incr_fold update | ~50 ns | O(changed_parents) |
| Cutoff check (float equality) | ~2 ns | O(1) |
| Dirty heap push | ~20 ns | O(log N) |
| Dirty heap pop | ~20 ns | O(log N) |
End-to-end for a single trade through the VWAP pipeline:
set_leaf: 10 nsstabilize(leaf + map + incr_fold): 250 nswatch: 5 ns- Total: ~265 ns per trade
At 265 ns/trade, the theoretical maximum throughput is ~3.8 million trades/sec on a single core.