palimpsest-extract
HTML-to-text extraction and RAG chunking with full provenance tracking. Every chunk carries its source URL, capture timestamp, content hash, and character offset.
ExtractedDocument
#![allow(unused)]
fn main() {
pub struct ExtractedDocument {
pub url: String,
pub title: Option<String>,
pub description: Option<String>,
pub text: String,
pub text_length: usize,
pub chunks: Vec<ContentChunk>,
pub text_hash: String,
pub source_hash: String,
pub captured_at: String,
}
}
extract_document
#![allow(unused)]
fn main() {
pub fn extract_document(
raw_response: &[u8],
source_url: &Url,
captured_at: CaptureInstant,
source_hash: ContentHash,
chunk_config: &ChunkConfig,
) -> ExtractedDocument
}
Pipeline: raw HTTP response -> strip headers -> HTML to clean text -> chunk with provenance.
ContentChunk
#![allow(unused)]
fn main() {
pub struct ContentChunk {
pub text: String,
pub source_url: Url,
pub captured_at: CaptureInstant,
pub source_hash: ContentHash,
pub chunk_hash: ContentHash, // BLAKE3 of chunk text
pub chunk_index: usize,
pub total_chunks: usize,
pub char_offset: usize, // Position in source text
}
}
ChunkConfig
#![allow(unused)]
fn main() {
pub struct ChunkConfig {
pub target_size: usize, // Default: 1000 characters
pub overlap: usize, // Default: 200 characters
}
}
Chunking Strategy
Splitting respects natural boundaries in priority order:
- Paragraph boundaries (double newline)
- Sentence boundaries (period/question/exclamation + space)
- Word boundaries (space)
- Character boundary (last resort)
Each chunk overlaps with the next by overlap characters to preserve context at boundaries.
Key Invariant
Extraction is deterministic. Same input = same chunks = same hashes. Every chunk’s provenance chain is complete: chunk_hash -> source_hash -> source_url + captured_at.