Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Docker & Kubernetes

This page covers the deployment infrastructure for Ripple: the Docker image, docker compose development stack, and Kubernetes production manifests.

Dockerfile

Ripple uses a multi-stage build for minimal production images:

# Stage 1: Build
FROM ocaml/opam:ubuntu-22.04-ocaml-5.3 AS builder

RUN sudo apt-get update && sudo apt-get install -y \
    librdkafka-dev \
    libssl-dev \
    pkg-config \
    && sudo rm -rf /var/lib/apt/lists/*

WORKDIR /home/opam/ripple
COPY --chown=opam:opam ripple.opam dune-project ./
RUN opam install . --deps-only --yes

COPY --chown=opam:opam . .
RUN eval $(opam env) && dune build bin/worker/main.exe

# Stage 2: Runtime
FROM ubuntu:22.04 AS runtime

RUN apt-get update && apt-get install -y \
    librdkafka1 \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

COPY --from=builder \
  /home/opam/ripple/_build/default/bin/worker/main.exe \
  /usr/local/bin/ripple-worker

EXPOSE 9100  # Health
EXPOSE 9101  # RPC
EXPOSE 9102  # Metrics

ENTRYPOINT ["/usr/local/bin/ripple-worker"]

The opam install . --deps-only layer is cached separately from the source copy, so dependency changes rebuild the dependency layer but source-only changes skip it.

Build the image:

docker build -f infra/docker/Dockerfile.worker -t ripple/worker:latest .

Docker Compose (Development)

The development stack runs Redpanda (Kafka-compatible) and MinIO (S3-compatible) alongside Ripple workers:

services:
  # Redpanda: Kafka-compatible broker, no ZooKeeper
  redpanda:
    image: docker.redpanda.com/redpandadata/redpanda:v24.1.1
    command:
      - redpanda start
      - --smp 1
      - --memory 512M
      - --overprovisioned
      - --kafka-addr internal://0.0.0.0:9092,external://0.0.0.0:19092
      - --advertise-kafka-addr internal://redpanda:9092,external://localhost:19092
    ports:
      - "19092:19092"   # Kafka API
      - "18082:18082"   # HTTP Proxy
    healthcheck:
      test: ["CMD", "rpk", "cluster", "health"]
      interval: 5s

  # MinIO: S3-compatible checkpoint storage
  minio:
    image: minio/minio:latest
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: ripple
      MINIO_ROOT_PASSWORD: ripplepass
    ports:
      - "9000:9000"     # S3 API
      - "9001:9001"     # Web console
    volumes:
      - minio-data:/data

  # Init job: create Kafka topics
  create-infra:
    image: docker.redpanda.com/redpandadata/redpanda:v24.1.1
    depends_on:
      redpanda: { condition: service_healthy }
    entrypoint: >
      bash -c "
        rpk topic create trades --brokers redpanda:9092 --partitions 8 &&
        rpk topic create vwap-output --brokers redpanda:9092 --partitions 8
      "

  # Init job: create S3 bucket
  create-bucket:
    image: minio/mc:latest
    depends_on:
      minio: { condition: service_healthy }
    entrypoint: >
      bash -c "
        mc alias set local http://minio:9000 ripple ripplepass &&
        mc mb local/ripple-checkpoints --ignore-existing
      "

volumes:
  minio-data:

Usage

cd infra/compose

# Start infrastructure
docker compose up -d

# Wait for health checks
docker compose ps

# Run integration test
./run-integration-test.sh

# Teardown
docker compose down -v

Accessing Services

ServiceURLPurpose
Kafka APIlocalhost:19092Produce/consume trades
MinIO Consolehttp://localhost:9001Browse checkpoint bucket
MinIO S3 APIhttp://localhost:9000S3-compatible endpoint

Kubernetes (Production)

Namespace

apiVersion: v1
kind: Namespace
metadata:
  name: ripple
  labels:
    app: ripple

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: ripple-config
  namespace: ripple
data:
  checkpoint_bucket: "s3://ripple-checkpoints/prod"
  kafka_brokers: "kafka-0.kafka.svc:9092,kafka-1.kafka.svc:9092"
  ripple.sexp: |
    ((cluster
      ((name prod)
       (defaults
        ((num_partitions 128)
         (max_keys_per_partition 2000)
         (checkpoint_interval_sec 10)
         (heartbeat_interval_sec 5)
         (failure_detection_timeout_sec 30))))))

Worker StatefulSet

Workers use a StatefulSet (not Deployment) because they need:

  • Stable network identity for partition assignment
  • Stable storage for local checkpoint cache
  • Ordered, graceful scaling
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ripple-worker
  namespace: ripple
spec:
  serviceName: ripple-worker
  replicas: 10
  podManagementPolicy: Parallel
  template:
    metadata:
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9102"
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: worker
          image: ripple/worker:latest
          ports:
            - { name: health, containerPort: 9100 }
            - { name: rpc,    containerPort: 9101 }
            - { name: metrics, containerPort: 9102 }
          env:
            - name: RIPPLE_WORKER_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
          resources:
            requests: { cpu: "1", memory: "512Mi" }
            limits:   { cpu: "2", memory: "1Gi" }
          livenessProbe:
            httpGet: { path: /health, port: health }
            initialDelaySeconds: 10
            periodSeconds: 10
          readinessProbe:
            httpGet: { path: /ready, port: health }
            initialDelaySeconds: 5
            periodSeconds: 5
          volumeMounts:
            - name: checkpoint-cache
              mountPath: /var/lib/ripple/checkpoints
  volumeClaimTemplates:
    - metadata:
        name: checkpoint-cache
      spec:
        accessModes: ["ReadWriteOnce"]
        resources:
          requests:
            storage: 10Gi

Key configuration:

  • podManagementPolicy: Parallel – workers start simultaneously, no sequential ordering needed
  • terminationGracePeriodSeconds: 30 – allows time for drain + checkpoint on shutdown
  • Worker ID is derived from the pod name (ripple-worker-0, ripple-worker-1, etc.)
  • Local checkpoint cache on PVC for fast recovery without S3 round-trip

Coordinator Deployment

The coordinator is stateless, so it uses a Deployment (not StatefulSet):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ripple-coordinator
  namespace: ripple
spec:
  replicas: 2   # HA -- active/standby
  template:
    spec:
      containers:
        - name: coordinator
          image: ripple/coordinator:latest
          ports:
            - { name: grpc,    containerPort: 9200 }
            - { name: health,  containerPort: 9201 }
            - { name: metrics, containerPort: 9202 }
          resources:
            requests: { cpu: "500m", memory: "256Mi" }
            limits:   { cpu: "1",    memory: "512Mi" }

Headless Service

apiVersion: v1
kind: Service
metadata:
  name: ripple-worker
  namespace: ripple
spec:
  clusterIP: None   # Headless for StatefulSet DNS
  selector:
    app: ripple
    component: worker
  ports:
    - { name: rpc, port: 9101 }
    - { name: metrics, port: 9102 }

The headless service gives each worker a stable DNS name: ripple-worker-0.ripple-worker.ripple.svc.cluster.local.

Scaling

Horizontal Scaling

# Scale workers
kubectl scale statefulset ripple-worker --replicas=20 -n ripple

The coordinator detects new workers via heartbeat registration and rebalances partitions automatically via the consistent hash ring.

Resource Guidelines

ComponentCPU RequestMemory RequestRationale
Worker1 core512 MiGraph engine is CPU-bound, 500KB working set
Coordinator500m256 MiLightweight, mostly heartbeat tracking

Workers are CPU-bound (stabilization loop). Memory usage is predictable: ~200 bytes/node * 4,001 nodes = ~800 KB for the graph, plus input buffers and GC overhead.