palimpsest-extract

HTML-to-text extraction and RAG chunking with full provenance tracking. Every chunk carries its source URL, capture timestamp, content hash, and character offset.

ExtractedDocument

#![allow(unused)]
fn main() {
pub struct ExtractedDocument {
    pub url: String,
    pub title: Option<String>,
    pub description: Option<String>,
    pub text: String,
    pub text_length: usize,
    pub chunks: Vec<ContentChunk>,
    pub text_hash: String,
    pub source_hash: String,
    pub captured_at: String,
}
}

extract_document

#![allow(unused)]
fn main() {
pub fn extract_document(
    raw_response: &[u8],
    source_url: &Url,
    captured_at: CaptureInstant,
    source_hash: ContentHash,
    chunk_config: &ChunkConfig,
) -> ExtractedDocument
}

Pipeline: raw HTTP response -> strip headers -> HTML to clean text -> chunk with provenance.

ContentChunk

#![allow(unused)]
fn main() {
pub struct ContentChunk {
    pub text: String,
    pub source_url: Url,
    pub captured_at: CaptureInstant,
    pub source_hash: ContentHash,
    pub chunk_hash: ContentHash,       // BLAKE3 of chunk text
    pub chunk_index: usize,
    pub total_chunks: usize,
    pub char_offset: usize,            // Position in source text
}
}

ChunkConfig

#![allow(unused)]
fn main() {
pub struct ChunkConfig {
    pub target_size: usize,  // Default: 1000 characters
    pub overlap: usize,      // Default: 200 characters
}
}

Chunking Strategy

Splitting respects natural boundaries in priority order:

Paragraph boundaries (double newline)
Sentence boundaries (period/question/exclamation + space)
Word boundaries (space)
Character boundary (last resort)

Each chunk overlaps with the next by overlap characters to preserve context at boundaries.

Key Invariant

Extraction is deterministic. Same input = same chunks = same hashes. Every chunk’s provenance chain is complete: chunk_hash -> source_hash -> source_url + captured_at.

Keyboard shortcuts