tala-intent

Converts raw shell command strings into structured Intent objects. The crate implements the IntentExtractor trait from tala-core through the IntentPipeline, which orchestrates four stages: tokenization, embedding generation, classification, and intent assembly.

Key Types

TypeDescription
TokenA parsed token from a raw shell command
IntentPipelineThe full extraction pipeline (implements IntentExtractor)

Key Functions

FunctionDescription
tokenize(raw)Parse a raw command string into structured tokens
hash_context(ctx)FNV-1a hash of a Context for the intent's context_hash field

Token

An enum representing a single parsed element from a shell command. The tokenizer handles pipes, redirects, flags, quoted strings, and backslash escapes.

#![allow(unused)]
fn main() {
#[derive(Clone, Debug, PartialEq, Eq)]
pub enum Token {
    /// The command name (first word, or first word after a pipe).
    Command(String),
    /// A positional argument.
    Arg(String),
    /// A flag (short `-x` or long `--foo`).
    Flag(String),
    /// Pipe operator `|`.
    Pipe,
    /// Input redirect `<`.
    RedirectIn,
    /// Output redirect `>` or append `>>`.
    RedirectOut { append: bool },
}
}

tokenize

#![allow(unused)]
fn main() {
/// Tokenize a raw shell command into structured tokens.
///
/// Handles:
/// - Pipes (`|`)
/// - Redirects (`<`, `>`, `>>`)
/// - Flags (`-x`, `--flag`)
/// - Quoted strings (single and double)
/// - Backslash escapes within double quotes
///
/// This is a simplified shell splitter, not a full POSIX parser.
pub fn tokenize(raw: &str) -> Vec<Token>;
}

Examples

#![allow(unused)]
fn main() {
use tala_intent::{tokenize, Token};

// Simple command with flags and arguments
let tokens = tokenize("ls -la /tmp");
assert_eq!(tokens, vec![
    Token::Command("ls".into()),
    Token::Flag("-la".into()),
    Token::Arg("/tmp".into()),
]);

// Pipeline
let tokens = tokenize("cat file.txt | grep error");
assert_eq!(tokens, vec![
    Token::Command("cat".into()),
    Token::Arg("file.txt".into()),
    Token::Pipe,
    Token::Command("grep".into()),
    Token::Arg("error".into()),
]);

// Redirects
let tokens = tokenize("sort < input.txt >> output.txt");
assert_eq!(tokens, vec![
    Token::Command("sort".into()),
    Token::RedirectIn,
    Token::Arg("input.txt".into()),
    Token::RedirectOut { append: true },
    Token::Arg("output.txt".into()),
]);

// Quoted strings
let tokens = tokenize("echo \"hello world\"");
assert_eq!(tokens, vec![
    Token::Command("echo".into()),
    Token::Arg("hello world".into()),
]);

// Empty input yields no tokens
assert!(tokenize("").is_empty());
}

IntentPipeline

The main extraction engine. Implements IntentExtractor from tala-core. Construction pre-computes exemplar embeddings for all intent categories, making classification a pure cosine-similarity lookup with no runtime model loading.

The pipeline is Send + Sync and can be shared across threads.

#![allow(unused)]
fn main() {
pub struct IntentPipeline { /* private */ }

impl IntentPipeline {
    /// Create a new pipeline. Pre-computes exemplar embeddings
    /// for intent classification.
    pub fn new() -> Self;

    /// Tokenize a raw command string.
    pub fn tokenize(&self, raw: &str) -> Vec<Token>;

    /// Generate a 384-dimensional embedding for a raw command.
    /// Uses a deterministic hash-based bag-of-characters approach
    /// with L2 normalization to unit length.
    pub fn embed(&self, raw: &str) -> Vec<f32>;

    /// Classify a command given its embedding.
    /// Compares against pre-computed exemplars using cosine similarity
    /// and returns the category with the highest average match.
    pub fn classify(&self, embedding: &[f32]) -> IntentCategory;
}

impl Default for IntentPipeline {
    fn default() -> Self { Self::new() }
}
}

IntentExtractor Implementation

#![allow(unused)]
fn main() {
impl IntentExtractor for IntentPipeline {
    /// Extract a structured Intent from a raw command string and context.
    ///
    /// Pipeline:
    /// 1. Validate and tokenize the raw command
    /// 2. Generate a 384-dim embedding
    /// 3. Classify the intent category
    /// 4. Hash the execution context
    /// 5. Assemble the Intent with a random ID and current timestamp
    ///
    /// Returns `TalaError::ExtractionFailed` if the command is empty
    /// or produces no tokens.
    fn extract(&self, raw: &str, context: &Context) -> Result<Intent, TalaError>;
}
}

Example

#![allow(unused)]
fn main() {
use tala_core::{Context, IntentExtractor};
use tala_intent::IntentPipeline;

let pipeline = IntentPipeline::new();

let ctx = Context {
    cwd: "/home/user/project".to_string(),
    env_hash: 42,
    session_id: 1,
    shell: "zsh".to_string(),
    user: "ops".to_string(),
};

let intent = pipeline.extract("cargo build --release", &ctx).unwrap();

assert_eq!(intent.raw_command, "cargo build --release");
assert_eq!(intent.embedding.len(), 384);
assert!(intent.context_hash != 0);
assert!(intent.outcome.is_none());
assert!((intent.confidence - 1.0).abs() < f32::EPSILON);
}

IntentCategory

Classification of an intent's purpose. The classifier uses pre-computed exemplar embeddings for each category and finds the best match by average cosine similarity.

#![allow(unused)]
fn main() {
// Defined in tala-core, used by tala-intent
pub enum IntentCategory {
    Build,      // cargo build, make, gcc, npm run build
    Deploy,     // kubectl apply, docker push, terraform apply
    Debug,      // gdb, strace, perf record, valgrind
    Configure,  // vim ~/.bashrc, chmod, chown, git config
    Query,      // grep, find, ps aux, df -h, curl
    Navigate,   // cd, ls, pwd, tree
    Other(String),
}
}

The classifier falls back to Other if the best average similarity score is below 0.05.


hash_context

#![allow(unused)]
fn main() {
/// Hash a Context into a u64 for the Intent's context_hash field.
///
/// Uses FNV-1a over all context fields: cwd, env_hash, session_id,
/// shell, and user. The hash is deterministic for the same input.
pub fn hash_context(ctx: &Context) -> u64;
}

Example

#![allow(unused)]
fn main() {
use tala_core::Context;
use tala_intent::hash_context;

let ctx = Context {
    cwd: "/home/user".to_string(),
    env_hash: 0,
    session_id: 1,
    shell: "bash".to_string(),
    user: "ops".to_string(),
};

let h1 = hash_context(&ctx);
let h2 = hash_context(&ctx);
assert_eq!(h1, h2); // deterministic
}

Embedding Details

The embedding generator produces 384-dimensional unit-length vectors using a deterministic hash-based approach. Each byte of the input command contributes to three positions in the vector via multiplicative hashing (FNV-like), with sign determined by hash bits and amplitude decayed by position (earlier characters contribute more). The result is L2-normalized.

This is a placeholder for real ML embeddings but preserves the property that similar command strings produce similar vectors, enabling meaningful cosine similarity comparisons.

PropertyValue
Dimensionality384
NormalizationL2 (unit length)
DeterminismSame input always produces same output
Similarity preservationSimilar strings yield similar vectors