The TBF Binary Format

TBF (TALA Binary Format) is the on-disk segment format for intent data. It is purpose-built for TALA's hybrid graph-plus-vector data model, co-designed for zero-copy memory-mapped access, SIMD-aligned embedding vectors, and graph adjacency traversal without deserialization.

TBF replaces general-purpose formats (Parquet, Arrow IPC, FlatBuffers) because none of them provide graph-native layout alongside SIMD-aligned vector storage. The cost is a format that only TALA reads and writes. The benefit is that every access pattern TALA cares about -- columnar field scans, embedding similarity, edge traversal, membership tests -- is optimized at the byte level.

Segment Structure

A single .tbf segment file is laid out as follows:

Offset 0           ┌──────────────────────────────────┐
                    │        File Header (128B)        │
                    │                                  │
                    │  magic: "TALB"                   │
                    │  version, counts, region offsets │
Offset 128         ├──────────────────────────────────┤
                    │      Node Payload Region         │
                    │    (columnar, 64B-aligned)       │
                    │                                  │
                    │  [ids]  [timestamps]  [hashes]   │
                    │  [confidences]  [statuses]       │
                    ├──────────────────────────────────┤
                    │    Embedding Vector Region        │
                    │   (fixed-stride, 64B-aligned)    │
                    │                                  │
                    │  [vec_0] [pad] [vec_1] [pad] ... │
                    ├──────────────────────────────────┤
                    │        Edge Region (CSR)          │
                    │                                  │
                    │  [row_pointers]  [edge_entries]  │
                    ├──────────────────────────────────┤
                    │       Bloom Filter Block          │
                    │                                  │
                    │  [num_hashes] [num_bits] [bits]  │
                    └──────────────────────────────────┘

Every region boundary is padded to a 64-byte boundary. This guarantees that SIMD loads (_mm512_load_ps for AVX-512, _mm256_load_ps for AVX2) never straddle region or column boundaries.

File Header

The header occupies the first 128 bytes and is laid out for direct mmap casting via #[repr(C, packed)]:

Offset	Size	Field	Description
0	4	`magic`	`0x54414C42` ("TALB")
4	2	`version_major`	Format major version
6	2	`version_minor`	Format minor version
8	8	`segment_id`	Monotonic segment identifier
16	8	`created_at`	Unix timestamp (nanoseconds)
24	8	`node_count`	Number of intent nodes
32	8	`edge_count`	Number of edges
40	4	`embedding_dim`	Vector dimensionality
44	1	`embedding_dtype`	0=f32, 1=f16, 2=bf16, 3=int8
45	1	`compression`	0=none, 1=lz4, 2=zstd
46	2	`flags`	Bit flags (see below)
48	8	`node_region_offset`	Byte offset to node payload region
56	8	`embed_region_offset`	Byte offset to embedding region
64	8	`edge_region_offset`	Byte offset to edge region
72	8	`string_table_offset`	Byte offset to string table
80	8	`bloom_offset`	Byte offset to bloom filter
88	8	`index_offset`	Byte offset to B+ tree index
96	8	`footer_offset`	Byte offset to footer
104	4	`checksum`	CRC32C of header bytes [0..104)
108	20	`reserved`	Zero-padded

Flags

Bit	Name	Meaning
0	`has_outcomes`	Outcome fields are populated
1	`has_embeddings`	Embedding region is present
2	`is_compacted`	Segment is the result of compaction
3	`is_encrypted`	Payload regions are AES-256-GCM encrypted
4	`is_partial`	Segment was not cleanly closed (crash indicator)

A reader encountering is_partial = 1 knows the segment is incomplete. Recovery discards it and replays the WAL.

Node Payload Region

Node fields are stored in columnar groups. Each column contains values for all N nodes, stored contiguously. This layout enables vectorized scans over a single field without pulling unrelated data into cache.

Column Layout

Column	Element Size	Content
`id`	16 bytes	UUID (128-bit, big-endian)
`timestamp`	8 bytes	u64 nanosecond epoch
`context_hash`	8 bytes	u64 xxHash3
`confidence`	4 bytes	f32 extraction confidence
`outcome_status`	1 byte	u8 enum: 0=pending, 1=success, 2=failure, 3=partial

Each column start is padded to the next 64-byte boundary. For a segment with 100K nodes, the timestamp column occupies 800KB of contiguous memory -- a single field scan touches only those 800KB, not the full node data.

The ColumnReader provides zero-copy typed access. Given a column offset and a node index, it computes the byte position and interprets the value directly from the underlying buffer:

#![allow(unused)]
fn main() {
// Read the timestamp of node i
let ts = reader.read_u64(timestamp_col_offset, i);
// Computes: offset + i * 8, reads 8 bytes as little-endian u64
}

No allocation. No deserialization struct. Pointer arithmetic and a byte reinterpretation.

Embedding Vector Region

The most performance-critical region. Embedding vectors are stored contiguously with fixed stride, aligned to 64-byte boundaries.

Why 64-Byte Alignment

SIMD instruction sets impose alignment requirements on memory operands:

ISA	Load Instruction	Required Alignment
SSE	`_mm_load_ps`	16 bytes
AVX2	`_mm256_load_ps`	32 bytes
AVX-512	`_mm512_load_ps`	64 bytes
NEON	`vld1q_f32`	16 bytes (recommended)

TALA standardizes on 64 bytes to satisfy all tiers. This means AVX-512 loads can use the aligned variant (_mm512_load_ps) instead of the unaligned variant (_mm512_loadu_ps), avoiding a potential penalty on some microarchitectures.

Stride Calculation

The stride is the vector size rounded up to the next 64-byte boundary:

stride = align_up(dim * sizeof(dtype), 64)

For the default configuration (dim=384, dtype=f32):

Vector size: 384 x 4 = 1,536 bytes
1,536 / 64 = 24 (already aligned)
Stride: 1,536 bytes

For dim=384, dtype=f16:

Vector size: 384 x 2 = 768 bytes
768 / 64 = 12 (already aligned)
Stride: 768 bytes

Access Pattern

Given node index i, the embedding is at byte offset:

embed_region_offset + (i * stride)

This is O(1) with zero deserialization. An mmap'd file pointer plus offset arithmetic is all that is needed. The EmbeddingReader wraps this:

#![allow(unused)]
fn main() {
pub fn get(&self, index: usize) -> &[f32] {
    let offset = index * self.stride;
    // SAFETY: offset is within bounds, alignment guaranteed by stride
    unsafe {
        std::slice::from_raw_parts(
            self.data[offset..].as_ptr() as *const f32,
            self.dim,
        )
    }
}
}

Quantization

The embedding_dtype header field determines how bytes in the embedding region are interpreted:

Value	Type	Bytes/Dim	Use Case
0	f32	4	Hot store, real-time queries
1	f16	2	Warm store, 2x compression
2	bf16	2	Training-compatible
3	int8	1	Cold store, 4x compression

For int8, each vector carries an 8-byte header (f32 scale, f32 zero_point) for dequantization. Quality loss at int8 is minimal: Spearman rank correlation of similarity rankings versus f32 baseline is 0.990-0.995.

Edge Region (CSR)

Edges use Compressed Sparse Row encoding, optimized for forward traversal: given a node, find all its outgoing edges.

Structure

The edge region contains two arrays:

Row pointer array: (N+1) entries of u64, where N is the node count. row_pointers[i] gives the index into the edge array where node i's edges begin. row_pointers[i+1] - row_pointers[i] gives node i's out-degree.

Edge entry array: E entries, where each entry is:

Field	Size	Description
`target`	4 bytes	u32 index of destination node
`relation`	1 byte	u8 enum: 0=causal, 1=temporal, 2=dependency, 3=retry, 4=branch
`weight`	4 bytes	f32 confidence
`padding`	3 bytes	Zero (alignment to 12 bytes)

To find all edges from node i:

#![allow(unused)]
fn main() {
let start = row_pointers[i] as usize;
let end = row_pointers[i + 1] as usize;
let edges = &edge_array[start..end];
}

This is a single index lookup and a slice operation. No linked-list traversal, no hash table probe.

Reverse Index

For backward traversal (find all nodes pointing to a given node), a CSC (Compressed Sparse Column) block can be appended after the CSR block, using the same format but transposed. Its presence is indicated by a flag in the edge region header.

A bloom filter over node UUIDs provides fast negative lookups: "Is this intent definitely not in this segment?" This is critical for query routing across multiple segments. Before performing a full B+ tree lookup or column scan, the bloom filter rejects segments that cannot contain the target node.

Layout

Field	Size	Description
`num_hashes`	4 bytes	Number of hash functions
`num_bits`	8 bytes	Total bit count
`bits`	`ceil(num_bits/64) * 8` bytes	Filter data as u64 words

Target false positive rate: 1%. For 100K nodes, this requires approximately 120KB. The filter uses double hashing (FNV-1a based) with k hash functions derived from two base hashes.

Why Binary-First

TALA stores intent data in a custom binary format rather than JSON, JSONL, or other text formats. The reasons are quantitative:

Density. A single intent with a 384-dimensional f32 embedding occupies 1,536 bytes for the embedding alone. In JSON, each float becomes 6-8 characters of text plus separators, inflating to approximately 4,000 bytes. Over millions of intents, this difference is the gap between fitting in memory and not.

Zero-copy access. An mmap'd TBF segment exposes node fields and embeddings directly as typed memory. A JSON file must be parsed, decoded, and allocated into structs before any field can be read.

SIMD compatibility. JSON floats are ASCII text. Computing cosine similarity over JSON-encoded vectors requires parsing every float before any math can happen. TBF embeddings are already in the format that SIMD instructions consume: contiguous f32 arrays at 64-byte alignment.

Graph traversal. The CSR edge index enables O(1) lookup of a node's neighbors. In JSON, edges would be embedded in node objects or stored in a separate array requiring linear scan.

Why Columnar

Row-oriented storage (one struct per intent, fields interleaved) is natural for single-record access but wasteful for analytical queries. When tala find needs to scan 100K timestamps to identify a time range, row-oriented layout forces it to pull every field of every intent into cache, even though only the 8-byte timestamp is needed.

Columnar layout stores each field as a contiguous array. A timestamp scan touches 800KB (100K x 8 bytes) instead of the full record size (potentially 2KB+ per intent including the embedding). This matches the access pattern that SIMD vectorization exploits: uniform arrays processed in bulk.

Comparison with Existing Formats

Property	TBF	Parquet	Arrow IPC	FlatBuffers
Graph-native layout (CSR)	Yes	No	No	No
SIMD-aligned embeddings	Yes (64B)	No	Yes (64B)	No
Zero-copy mmap	Yes	Partial	Yes	Yes
Columnar node fields	Yes	Yes	Yes	No
Built-in bloom filter	Yes	Optional	No	No
Built-in graph index	Yes (B+ tree + CSR)	No	No	No
Append-only segments	Yes	Yes	No	No
Per-region compression	Yes	Per-column	No	No

TALA — Intent-Native Narrative Execution Layer