Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Why It’s Fast

Ripple achieves sub-microsecond stabilization through three key optimizations: heap-based dirty propagation, incremental fold, and cutoff. This page explains each optimization, the memory layout that supports them, and the zero-allocation discipline on the hot path.

The Three Key Optimizations

1. Heap-Based Dirty Propagation

Problem: When a leaf changes, which nodes need recomputation?

Naive approach (dirty flags): Scan all N nodes, check if dirty, recompute if so. This is O(N) per stabilization regardless of how many nodes actually changed.

Ripple’s approach (min-heap): Push dirty nodes onto a min-heap keyed by (height, node_id). Pop nodes in order and recompute. This is O(R log R) where R is the number of dirty nodes.

Benchmark B-05 proved this matters:
  Linear scan at 10K nodes: ~27 us per stabilization
  Heap-based at 10K nodes:  ~250 ns per stabilization
  Speedup: 100x

The heap gives O(R log R), but in the common case R is very small (1 leaf + 2-3 dependents), so the practical complexity is O(1) with a small constant.

2. Incremental Fold (Incr_fold)

Problem: A portfolio VWAP aggregates 2,000 symbol prices. When one price changes, how do we update the sum?

Naive fold: Re-fold all 2,000 prices: sum = fold (+) 0 [p0; p1; ...; p1999]. This is O(N) per stabilization.

Incr_fold: Track which parents changed. For each changed parent, remove the old value and add the new value:

acc = old_sum
acc = acc - old_price_AAPL     (remove)
acc = acc + new_price_AAPL     (add)
return acc

This is O(changed_parents), which is O(1) for the single-trade case.

Performance comparison (2,000 symbols, 1 change):
  Fold:      O(2000) = ~5 us
  Incr_fold: O(1)    = ~50 ns
  Speedup:   100x

The Incr_fold node maintains a snapshot of all parent values from the previous cycle. During dirty propagation, when a parent of an Incr_fold is about to be marked dirty, the parent’s index is recorded. During recomputation, only those recorded indices are processed.

3. Cutoff Optimization

Problem: A node recomputes but produces the same value as before. Should its dependents recompute?

Without cutoff: Yes – every recomputation propagates to all dependents, creating phantom updates.

With cutoff: No – if the new value equals the old value (by the user-provided equality function), propagation stops.

[leaf: raw_price=250]
       |
       v
[map: clamp(100)]  recomputes: min(250, 100) = 100
       |            old value was 100
       |            cutoff: 100 == 100, STOP
       v
[map: format]       NOT recomputed (saved ~100ns + string allocation)

Cutoff is especially effective for:

  • Threshold/clamp functions
  • Boolean predicates (changes in input that don’t flip the predicate)
  • Rounding functions
  • Any function with a small output domain

In the VWAP pipeline, cutoff prevents propagation when a trade arrives but the VWAP (weighted average) does not change meaningfully due to rounding.

Memory Layout

Node Array

Nodes are stored in a contiguous array, not a linked structure:

graph.nodes:
  ┌──────┬──────┬──────┬──────┬──────┬──────┐
  │ n[0] │ n[1] │ n[2] │ n[3] │ n[4] │ ...  │
  └──────┴──────┴──────┴──────┴──────┴──────┘
  ^                                          ^
  sequential memory                          grows by 2x

Array access is O(1) by node ID: graph.nodes.(node_id). No hash table lookup, no pointer chase.

Dirty Heap

The heap is also array-based:

dirty_heap.entries:
  ┌─────────┬─────────┬─────────┬─────────┐
  │ (0, n3) │ (1, n7) │ (1, n9) │ unused  │
  └─────────┴─────────┴─────────┴─────────┘
  index 1    index 2    index 3    (1-indexed)

dirty_heap.in_heap:
  ┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
  │ F │ F │ F │ T │ F │ F │ F │ T │ F │ T │
  └───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  n0   n1   n2   n3   n4   n5   n6   n7   n8   n9

in_heap provides O(1) membership check by node_id. This prevents duplicate insertion and is essential for the deduplication during dirty propagation.

Cache Friendliness

With 2,000 nodes at ~200 bytes each:

  • Node array: 400 KB (fits in L2 cache on most CPUs)
  • Dirty heap: ~80 KB (fits in L1 cache)
  • Total working set: < 500 KB

During stabilization, the access pattern is:

  1. Pop from heap (sequential access to heap array)
  2. Read node from node array (indexed access, likely cached)
  3. Read parent values from node array (indexed access)
  4. Write new value to node array
  5. Push dependents to heap

Steps 2-4 access a small number of nodes (typically 3-5) that are likely in the same cache line or adjacent lines. There is no pointer chasing through a linked data structure.

Zero Allocation on the Hot Path

The stabilization loop allocates nothing during the common case:

OperationAllocation?Why
Dirty_heap.pop_minNoReturns struct from pre-allocated array
Dirty_heap.pushNoWrites to pre-allocated array slot
Node recompute (Map)Depends on fUser function may allocate
Cutoff checkNoCalls user equality function
Set dependent dirtyNoSets boolean flag + heap push

The heap is pre-allocated to capacity. The node array is pre-allocated and grows only at construction time. The in_heap array is pre-allocated.

Allocation can occur in user-provided functions (f in Map, add/remove in Incr_fold). For numeric operations (float arithmetic), OCaml represents them as unboxed doubles – no allocation. For string operations or record construction, allocation occurs but is amortized by the GC.

Measuring Hot Path Allocations

Use OCaml’s Gc.stat() to verify zero allocation:

let before = Gc.stat () in
let _ = Graph.stabilize g in
let after = Gc.stat () in
let allocs = after.minor_words -. before.minor_words in
printf "words allocated: %.0f\n" allocs;
(* For numeric-only stabilization: 0 *)

Performance Summary

OperationTimeComplexity
set_leaf~10 nsO(1)
stabilize (1 change, numeric)~250 nsO(R log R), R typically 3
watch~5 nsO(1) array index
Incr_fold update~50 nsO(changed_parents)
Cutoff check (float equality)~2 nsO(1)
Dirty heap push~20 nsO(log N)
Dirty heap pop~20 nsO(log N)

End-to-end for a single trade through the VWAP pipeline:

  • set_leaf: 10 ns
  • stabilize (leaf + map + incr_fold): 250 ns
  • watch: 5 ns
  • Total: ~265 ns per trade

At 265 ns/trade, the theoretical maximum throughput is ~3.8 million trades/sec on a single core.