Why It’s Fast

Ripple achieves sub-microsecond stabilization through three key optimizations: heap-based dirty propagation, incremental fold, and cutoff. This page explains each optimization, the memory layout that supports them, and the zero-allocation discipline on the hot path.

The Three Key Optimizations

1. Heap-Based Dirty Propagation

Problem: When a leaf changes, which nodes need recomputation?

Naive approach (dirty flags): Scan all N nodes, check if dirty, recompute if so. This is O(N) per stabilization regardless of how many nodes actually changed.

Ripple’s approach (min-heap): Push dirty nodes onto a min-heap keyed by (height, node_id). Pop nodes in order and recompute. This is O(R log R) where R is the number of dirty nodes.

Benchmark B-05 proved this matters:
  Linear scan at 10K nodes: ~27 us per stabilization
  Heap-based at 10K nodes:  ~250 ns per stabilization
  Speedup: 100x

The heap gives O(R log R), but in the common case R is very small (1 leaf + 2-3 dependents), so the practical complexity is O(1) with a small constant.

2. Incremental Fold (Incr_fold)

Problem: A portfolio VWAP aggregates 2,000 symbol prices. When one price changes, how do we update the sum?

Naive fold: Re-fold all 2,000 prices: sum = fold (+) 0 [p0; p1; ...; p1999]. This is O(N) per stabilization.

Incr_fold: Track which parents changed. For each changed parent, remove the old value and add the new value:

acc = old_sum
acc = acc - old_price_AAPL     (remove)
acc = acc + new_price_AAPL     (add)
return acc

This is O(changed_parents), which is O(1) for the single-trade case.

Performance comparison (2,000 symbols, 1 change):
  Fold:      O(2000) = ~5 us
  Incr_fold: O(1)    = ~50 ns
  Speedup:   100x

The Incr_fold node maintains a snapshot of all parent values from the previous cycle. During dirty propagation, when a parent of an Incr_fold is about to be marked dirty, the parent’s index is recorded. During recomputation, only those recorded indices are processed.

3. Cutoff Optimization

Problem: A node recomputes but produces the same value as before. Should its dependents recompute?

Without cutoff: Yes – every recomputation propagates to all dependents, creating phantom updates.

With cutoff: No – if the new value equals the old value (by the user-provided equality function), propagation stops.

[leaf: raw_price=250]
       |
       v
[map: clamp(100)]  recomputes: min(250, 100) = 100
       |            old value was 100
       |            cutoff: 100 == 100, STOP
       v
[map: format]       NOT recomputed (saved ~100ns + string allocation)

Cutoff is especially effective for:

Threshold/clamp functions
Boolean predicates (changes in input that don’t flip the predicate)
Rounding functions
Any function with a small output domain

In the VWAP pipeline, cutoff prevents propagation when a trade arrives but the VWAP (weighted average) does not change meaningfully due to rounding.

Memory Layout

Node Array

Nodes are stored in a contiguous array, not a linked structure:

graph.nodes:
  ┌──────┬──────┬──────┬──────┬──────┬──────┐
  │ n[0] │ n[1] │ n[2] │ n[3] │ n[4] │ ...  │
  └──────┴──────┴──────┴──────┴──────┴──────┘
  ^                                          ^
  sequential memory                          grows by 2x

Array access is O(1) by node ID: graph.nodes.(node_id). No hash table lookup, no pointer chase.

Dirty Heap

The heap is also array-based:

dirty_heap.entries:
  ┌─────────┬─────────┬─────────┬─────────┐
  │ (0, n3) │ (1, n7) │ (1, n9) │ unused  │
  └─────────┴─────────┴─────────┴─────────┘
  index 1    index 2    index 3    (1-indexed)

dirty_heap.in_heap:
  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  │ F │ F │ F │ T │ F │ F │ F │ T │ F │ T │
  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  n0   n1   n2   n3   n4   n5   n6   n7   n8   n9

in_heap provides O(1) membership check by node_id. This prevents duplicate insertion and is essential for the deduplication during dirty propagation.

Cache Friendliness

With 2,000 nodes at ~200 bytes each:

Node array: 400 KB (fits in L2 cache on most CPUs)
Dirty heap: ~80 KB (fits in L1 cache)
Total working set: < 500 KB

During stabilization, the access pattern is:

Pop from heap (sequential access to heap array)
Read node from node array (indexed access, likely cached)
Read parent values from node array (indexed access)
Write new value to node array
Push dependents to heap

Steps 2-4 access a small number of nodes (typically 3-5) that are likely in the same cache line or adjacent lines. There is no pointer chasing through a linked data structure.

Zero Allocation on the Hot Path

The stabilization loop allocates nothing during the common case:

Operation	Allocation?	Why
`Dirty_heap.pop_min`	No	Returns struct from pre-allocated array
`Dirty_heap.push`	No	Writes to pre-allocated array slot
Node recompute (Map)	Depends on `f`	User function may allocate
Cutoff check	No	Calls user equality function
Set dependent dirty	No	Sets boolean flag + heap push

The heap is pre-allocated to capacity. The node array is pre-allocated and grows only at construction time. The in_heap array is pre-allocated.

Allocation can occur in user-provided functions (f in Map, add/remove in Incr_fold). For numeric operations (float arithmetic), OCaml represents them as unboxed doubles – no allocation. For string operations or record construction, allocation occurs but is amortized by the GC.

Measuring Hot Path Allocations

Use OCaml’s Gc.stat() to verify zero allocation:

let before = Gc.stat () in
let _ = Graph.stabilize g in
let after = Gc.stat () in
let allocs = after.minor_words -. before.minor_words in
printf "words allocated: %.0f\n" allocs;
(* For numeric-only stabilization: 0 *)

Performance Summary

Operation	Time	Complexity
`set_leaf`	~10 ns	O(1)
`stabilize` (1 change, numeric)	~250 ns	O(R log R), R typically 3
`watch`	~5 ns	O(1) array index
`Incr_fold` update	~50 ns	O(changed_parents)
Cutoff check (float equality)	~2 ns	O(1)
Dirty heap push	~20 ns	O(log N)
Dirty heap pop	~20 ns	O(log N)

End-to-end for a single trade through the VWAP pipeline:

set_leaf: 10 ns
stabilize (leaf + map + incr_fold): 250 ns
watch: 5 ns
Total: ~265 ns per trade

At 265 ns/trade, the theoretical maximum throughput is ~3.8 million trades/sec on a single core.

Keyboard shortcuts

Ripple