A Truth Filter for AI Output
An experiment: pointing property-based testing at a specification instead of code, and writing down what happened
Twenty-seven claims through the filter. One came out falsified. Six small structural ingredients the prose hadn't named.
An AI wrote me a 36-kilobyte paper on how to build a second brain. It had theorems, proof sketches, and citation chains, and it read like the real thing. I wanted to know which parts of it actually were. So I took every falsifiable claim in it and ran it through a property-based testing harness — the same kind of tool Jepsen, TigerBeetle, and the Hypothesis ecosystem use to break distributed systems — to see what would break. Twenty-seven of the 28 encoded claims held up under random inputs. One — a universal-quantifier encoding of "replay always improves recall" — was falsified by a minimal shrunk counterexample and re-encoded as a statistical claim, which passed. Along the way, six small structural ingredients surfaced — things the synthesis hadn't spelled out, not because the AI was wrong, but because prose doesn't naturally name every structural requirement a working implementation needs. What follows is how it went. It's one experiment, one artifact, shared in case the method is useful to someone else.
The starting observation
AI systems produce plausible-looking ideas quickly — output with the surface properties of the thing it's imitating. Research syntheses with citation chains. Architectural proposals with flowcharts. Code with conventions. Reasoning traces that locally look sound. It reads internally consistent and professionally styled. Whether any given claim inside it holds up under implementation is a separate question the prose doesn't usually address.
This isn't a criticism of AI output, and the same thing is true of human writing: prose describes, implementation tests. What got me curious was whether property-based testing — a tool most engineers associate with verifying code — could be pointed at the specification layer instead, and what it would catch if it could.
So I tried it. One synthesis, every falsifiable claim turned into a property, a couple of sessions of careful work. Here's what I saw.
Prose describes; implementation tests.
The tool
The toolchain is small; the pedigree is deep.
Hegel (hegel.dev) is a property-based testing framework built for cross-language use. Its Rust bindings (hegeltest) speak a protocol to a server descended from Hypothesis — David R. MacIver's Python framework, which in turn descended from John Hughes's QuickCheck for Haskell. Hegel makes this testing lineage available to Rust and Go, with additional languages on the roadmap. You write a property in your language of choice; Hegel generates random inputs, runs the property, and — critically — when it finds a failing input, it shrinks the counterexample to the smallest input that still fails.
This family of tools has been quietly holding the floor on some of the hardest problems in software engineering for two decades. Hypothesis validates the Python standard library and is used by AstraZeneca, Stripe, Mozilla, and countless production teams. QuickCheck and its descendants verify compilers, databases, and distributed systems. Jepsen has used the same discipline of randomized adversarial testing to find consensus bugs in Postgres, Redis, MongoDB, and a generation of distributed data stores. TigerBeetle's deterministic simulation testing is built on the same foundation. Antithesis applies it autonomously at scale to customer software. These are not incidental users — they are systems where being wrong is expensive, and the reason they all converge on property-based testing is that, when correctness matters, you do not want a test that confirms your assumptions; you want a framework whose job is to try to break them.
Hegel belongs in that lineage. Nothing in this experiment changes the tool. What changes is the layer it's pointed at.
Property-based testing inverts the question most tests ask. Example-based testing asks: "does this example work?" PBT asks: "what example could break this?" The framework's job is not to confirm a claim but to try to falsify it. When falsification succeeds, the shrunk counterexample is the signal — a minimal input exhibiting the failure.
Deterministic Simulation Testing is the broader discipline: randomized simulation under controlled conditions, reproducible by seed, designed to surface emergent bugs. Property-based testing is its coordinate system.
For this experiment I applied the same tool, unchanged, to a different target — not code, but the specification itself. An AI's generated prose describes a conjecture. PBT describes a conjecture plus a mechanism for trying to disprove it. The conjecture alone is unfalsifiable until you wire the mechanism.
Here is how it looks in practice. Research.md states the Hopfield descent theorem with a proof sketch: asynchronous single-neuron updates monotonically decrease energy. The Rust test:
#[hegel::test]
fn descent_under_async_update(tc: TestCase) {
let seed = tc.draw(gs::integers::<u64>().min_value(1).max_value(1u64 << 40));
let i = tc.draw(gs::integers::<usize>().min_value(0).max_value(N - 1));
let mut rng = Rng::new(seed);
let mut t = build_symmetric_weights(&mut rng);
let theta = build_thresholds(&mut rng);
let mut v = build_bipolar_state(&mut rng);
let e_before = energy(&t, &theta, &v);
async_update(&t, &theta, &mut v, i);
let e_after = energy(&t, &theta, &v);
assert!(e_after <= e_before + 1e-9);
}
Hegel runs this function a hundred times with different seed and i. Every pass is a specific symmetric weight matrix, threshold vector, binary state, and index where the energy did in fact decrease. A failure would mean the proof transcription is wrong — and Hegel would shrink to the minimal input making it so.
Three kinds of claim
Not every claim in a paper is directly testable the same way. We classified each into one of five buckets as a planning step.
| Class | Shape | Example |
|---|---|---|
| A | Directly provable over random finite inputs. | Hopfield descent theorem; Oja's rule convergence to top eigenvector. |
| B | Requires simulation plus tolerance. | Echo-state property; attractor basin completion; Hill-function stability. |
| B-stat | Claim is about average behavior over a distribution, not pointwise. Needs inner Monte Carlo + confidence interval. | "Replay improves recall on average"; capacity scales with K; natural gradient is invariant under reparameterization. |
| C | Needs heavy external tooling (eigensolver, LP, TDA, Gillespie). Document and defer. | CRN semilinearity; higher-dim persistent homology. |
| D | Philosophical / falsification boundary. Not a property to satisfy — a bar any positive claim must clear. | Protein-folding NP-completeness; microtubule decoherence critique. |
The classification matters because it determines the test's shape. A class-A claim becomes a clean #[hegel::test] with an assert!. A class-B-stat claim becomes a Monte Carlo harness that asserts MeanCI::lower_95() > 0. A class-D claim becomes a hypotheses/falsification-targets/*.md card — no executable test, just a boundary statement.
We started with twenty claims. By project end, all the class-C claims we initially flagged turned out to be reducible to B or B-stat with small self-contained implementations — ~60 lines of vertex enumeration for LP, Kruskal for 0-dim persistence, a xorshift PRNG for Monte Carlo. No external deps were pulled in.
The hypothesis-card convention
Every claim gets a matched pair of files:
hypotheses/<id>.md— a hypothesis card with frontmatter (source line range, class, status, test path) and a short body explaining the claim and the propertytests/<id>.rs— one#[hegel::test]function encoding the property
The index (hypotheses/index.md) is a single-table-of-record with one row per claim. When a test's status changes — pending → encoded → passing → falsified — both the card header and the index row update together.
Here's a real card, for the Hopfield descent theorem:
---
id: hopfield-descent
source: research.md L74, L83-L91
class: A
status: passing
test: tests/hopfield_descent.rs::descent_under_async_update
---
**Claim.** In a classical symmetric Hopfield network with T_ij = T_ji,
zero diagonal, thresholds θ_i, and binary activities V_i ∈ {-1, +1},
asynchronous single-neuron updates with V_i' = sign(h_i) where
h_i = Σ_j T_ij V_j - θ_i monotonically decrease the energy.
**Proof sketch from spec.** ΔE = -(V_i' - V_i) · h_i. Choosing
V_i' = sign(h_i) makes the product ≥ 0, hence ΔE ≤ 0.
**Property.** For any symmetric T with zero diagonal, any θ, any
V ∈ {-1, +1}^n, and any index i, a single asynchronous update
satisfies E(V') ≤ E(V) within floating-point tolerance.
**Why this test.** The spec proves the statement; a failure means
either the encoding diverges from the spec or the proof has a
transcription error. Either way, the shrunk counterexample is the
valuable output.
The pattern scales. For 28 claims, we have 28 cards and 28 test files. Every file is short, every claim is traceable to research.md by line number, every pass or fail has a home.
The experiment, chronologically
Starting small: class A
We began with the two claims research.md proves in its own body — Hopfield descent and Oja's rule convergence. Class A: directly provable, just instantiate random inputs and check. Both passed, 100 property cases each, no counterexamples. Toolchain wired end-to-end, convention proved out. Suite runtime: two seconds.
Class B expansion
Next came simulation-and-tolerance claims: echo-state property, STDP with homeostatic scaling, attractor basin completion, reservoir-readout training. Eleven tests total, suite at ten seconds.
One pattern emerged immediately: construct your inputs to satisfy preconditions at draw time. Don't reject invalid inputs via tc.assume() — that silently drops coverage. If a property says "for any symmetric matrix with spectral radius < 1...", then draw values that produce such matrices by construction, not by generate-and-reject. Hegel's recommended style; we lived it.
The first real falsification
hippo-replay-consolidation. The claim: replay of stored patterns improves recall under interference.
First draft: universal pointwise — for every (stored, noise, cue) tuple, Σ-Hamming-with-replay ≤ Σ-Hamming-without-replay.
First run: passed, 100 cases, no counterexamples.
Second run: failed. Hegel had drawn different inputs. It shrunk to a specific all-−1 ferromagnetic noise pattern where replay hurt recall — Σ Hamming of 4 with replay, vs 2 without.
mean Σ hamming (with replay) = 4 > 2 = Σ hamming (no replay)
— preserved in the test file with an #[ignore] attribute, so future readers see the artifact and can re-run it to verify the falsification. Deletion would have erased the lesson.
The counterexample wasn't a bug in the test. It was a signal about the claim's scope. Research.md L182 says "improved sample efficiency through offline updates" — a statistical claim about the distribution of behaviors, not a universal one. We had over-reached by encoding it as pointwise.
We preserved the falsified test as #[ignore] with the counterexample recorded, then wrote a class-B-stat version: draw distribution parameters via Hegel, inner Monte Carlo sampling via a seeded xorshift, assert 95% CI lower bound on mean improvement > 0. It passed. We closed the loop and recorded the lesson in memory.
This moment was the first clear demonstration of what the filter actually does. It doesn't just confirm the spec's theorems — it detects when we've mis-encoded them, and forces us to sharpen the encoding.
Scaling up
The next concern was operational scale.
Tests at N=8, K=2 are scaffolding, not validation. Real memory systems have N in the thousands or millions. We pushed to N=256 where possible, N=128 for composition tests, N=64 for B-stat tests with meaningful inner-sample budgets.
Two mechanics made this feasible.
First, cargo test --release — the Rust optimizer gave 5–10× compute headroom. A composition test that took 8 seconds in debug mode took 0.8 seconds in release.
Second, seed-driven generation. Hegel communicates draws to its Python server via CBOR-over-stdio. The serialization cost is superlinear in draw size; above ~N²≈1000 floats per test case, throughput collapses. A test that runs 100 cases at N=16 in 0.4 seconds can take 12 seconds at N=48 — because 2000 floats per case through CBOR is slow, not because the actual computation is slow.
The fix was to have Hegel draw just a u64 seed plus a few scalar hyperparameters, and have the test body synthesize the large random structures (weight matrices, spike trains, noise sequences) from the seed using a local xorshift PRNG. Fourteen-to-forty-six times speedup in practice, with no loss of property-based coverage — Hegel still explores the parameter distribution, it just does so via the seed rather than per-entry.
pub struct Rng { pub state: u64 }
impl Rng {
pub fn new(seed: u64) -> Self {
let s = if seed == 0 { 0xDEAD_BEEF_CAFE_BABE } else { seed };
Self { state: s }
}
pub fn next_u64(&mut self) -> u64 {
let mut x = self.state;
x ^= x << 13; x ^= x >> 7; x ^= x << 17;
self.state = x;
x
}
pub fn next_f64(&mut self) -> f64 {
(self.next_u64() >> 11) as f64 / (1u64 << 53) as f64
}
pub fn next_bipolar(&mut self) -> i8 {
if self.next_u64() & 1 == 1 { 1 } else { -1 }
}
}
With these two mechanics, a suite that would have taken five minutes at scale ran in sixty-three seconds. Twenty-seven tests, N up to 256.
Six things that came up in composition
Some of the most interesting moments in the experiment weren't corroborations. They were places where the first honest encoding of a claim failed, and fixing the failure meant introducing a small structural ingredient the synthesis hadn't mentioned.
Here are the six that surprised me most, annotated by how they actually arose. Three of them were filter-extracted in the strong sense — Hegel produced a shrunk counterexample and the minimum change that made the test pass turned out to be a concrete architectural ingredient. Three of them were engineer-noticed — the filter failed on my first encoding, but what I had actually missed was a definitional / textbook prerequisite, not a novel requirement. Both kinds of outcome were useful; they're not the same kind of finding, and the prose below labels each one.
-
Sparse-index codes need pairwise Hamming distance ≥ 3. (filter-extracted)
Priority 1 in the spec's roadmap: a hippocampal-indexed attractor memory. Hippo (sparse index) plus Hopfield (attractor refinement). At α = 2.0 — twice the pattern dimension, fourteen times Hopfield's classical 0.14 capacity — recall started at 64%. Widening the signature space to K=12 bits got me to 89%. Only constructing signatures with minimum pairwise Hamming distance ≥ 3 — so any 1-bit cue flip unambiguously routes to one stored pattern — pushed recall to 100%. The paper mentions "sparse addressing" but never specifies distance properties. The coding-theoretic condition came out of iterating against shrunk counterexamples. -
Reservoir signatures must be hybridized with content bits. (filter-extracted)
Priority 6: reservoir-based streaming. The intuition: drive a reservoir with random inputs, take the sign bits of its state as a temporal signature. First attempt: only 2 unique signatures across 32 events, recall at 8%. The reservoir under random ±1 scalar drive collapses to a low-dimensional attractor; its sign bits mostly track the last input. Spectral-radius rescaling helped (8% → 50%). The fix that took the test to 93% was making the signature half reservoir-derived, half random event content. Research.md's flowchart shows events → reservoir → sparse index → attractor, without mentioning that the index must mix reservoir state with event content for enough bit diversity. -
STDP-based salience gating requires canonical pre-before-post timing. (filter-extracted)
Priority 3: neuromodulated STDP write gate. First draft triggered salient events with pre and post firing simultaneously. Result: mean Δ = −0.027 — modulated learning concentrated less weight on the target synapse than unmodulated. STDP with simultaneous spikes gives zero net plasticity (traces increment after plasticity in canonical ordering; LTP and LTD both fire against zero traces). The fix that made the test pass was a two-step salient protocol — pre fires at t, post fires at t+1. Whether the spec implicitly required this or my first encoding under-constrained it is a judgment, not a test output — but the encoding that produced the correct sign is the one with canonical LTP timing. -
Scheduling claims need a forgetting mechanism. (engineer-noticed, definitional)
Priority 4: replay-driven consolidation scheduler. Pure additive Hebbian is order-independent (just a sum of outer products), so any "scheduling" claim on top of it is vacuous. The test failed because its claim had no semantic content under additive Hebbian; adding ahebbian_add_decayingoperator was the fix I chose. The filter surfaced that the claim-as-stated couldn't be tested; that the fix is a forgetting operator is a matter of linear algebra, not a discovery. Still worth flagging, because the spec proposes a scheduler without naming a forgetting operator as prerequisite. -
Combinatorial-structure claims have general-position preconditions. (engineer-noticed, textbook)
The MST-based invariance test passed at N=6 for multiple sessions. Scaled to N=16, it failed on one seed — Hegel shrunk to a point cloud with two coincident points. At coincident points, MST is not unique; the edge set differs under different tie-break orders. The spec's "robustness to monotone distortions" implicitly assumes distinct pairwise distances — the standard general position assumption in TDA literature. Any working TDA implementation already handles this (jitter or tie-break rule). The filter's contribution here was reliably finding the omitted precondition; it didn't discover a novel requirement. -
Info-geometric "invariance" is not the same as "better conditioning." (engineer-noticed, textbook)
Priority 8: natural-gradient meta-controller. First draft tested whether natural gradient converges faster than raw gradient on Bernoulli MLE near boundary. Failed: mean Δ = +0.0028, raw slightly faster. The "better conditioning" phrase is a conditional claim — it depends on problem geometry. The universal claim of information geometry is reparameterization invariance: natural gradient in p-space and in logit(p)-space give the same trajectory in p-space. Raw gradient doesn't. This is in the standard texts (Amari; Martens); the filter's role was forcing me to notice I'd conflated the two.
The integrated run
Toward the end of the experiment I wired every primitive into one streaming system and ran it as a single unit, just to see what happened. Until then each primitive had been validated in isolation and paired into a handful of small compositions; the question the integrated run asks is whether the pieces still cohere when you turn them all on at once.
- A random reservoir provides temporal context between events
- Hybrid sparse signatures — half from reservoir state, half from random event content — drive the hippo index
- Hippo sparse index routes cues to the nearest stored event by signature
- Hopfield with decaying Hebbian weights provides distributed attractor memory that forgets unless refreshed
- Scheduled replay during periodic sleep windows re-adds recent events to the substrate
- Retrieval composes Hopfield attractor refinement with signature-based fallback
Forty-eight events streamed over time. α = 0.75 — well over Hopfield's 0.14 capacity. Decay at 2% per write. Sleep window every four events, replaying the last three. Cue each stored event with one random bit flip, measure recall@1.
mean_per_trial − 2·SE > 0.80. Observed mean across runs ≈ 0.91; lower_95 floor observed ≈ 0.85. Reproduce with cargo test --release --test integrated_second_brain.
For this one artifact at this scale, the architectural ecology described in the flowchart at L129–L143 held together when I wired all the pieces at once. Every primitive I added seemed to contribute something measurable, and nothing obviously dead-ended. That's what the experiment produced. I'm not making a larger claim about what happens at different scales or on different artifacts.
The numbers
The filter covered all ten priorities from research.md's validation roadmap.
| # | Priority | Composition test | Status |
|---|---|---|---|
| 1 | Hippocampal-indexed attractor | second-brain-stream | 100% recall @ α=2.0 (14× Hopfield capacity) |
| 2 | Benna-Fusi multi-timescale core | benna-fusi-capacity(+scaling) | passing |
| 3 | Neuromodulated STDP write gate | neuromod-stdp-gated | salience concentration CI > 0 |
| 4 | Replay-driven consolidation | replay-consolidation-scheduler | scheduled replay > always-online |
| 5 | CRN/GRN control plane | crn-mode-switch | 95%+ mode-switch reliability |
| 6 | Reservoir temporal encoder | second-brain-stream-temporal | 93% recall streaming |
| 7 | Topological indexing | tda-cluster-persistence | perfect MST-cut cluster recovery |
| 8 | Info-geometric meta-controller | info-geometric-controller | reparameterization invariance |
| 9 | FBA budget allocator | fba-budget-allocator | LP feasibility + optimality + monotonicity |
| 10 | Microtubule (falsification target) | — (class D card) | bar stated, not expected to be cleared |
| — | Integrated ecology | integrated-second-brain | 91% recall end-to-end |
Five things I noticed
Small patterns that came up more than once during the experiment. I don't know how far they generalize; they were at least useful to me, and they seem like the kind of thing that might hold up in adjacent cases. Offered as observations, not rules.
1. Construct preconditions at draw time; don't marshal bulk data through the framework
Two related things. First: if a property has a precondition, encode it in the draws — dependent min_value/max_value, permutations with .unique(true), bounded ranges derived from earlier draws. Do not use tc.assume() to reject invalid inputs; that silently drops coverage and slows the shrinker. Second: if your PBT framework serializes draws across a process boundary (Hegel and Hypothesis do, via CBOR to their Python server), the per-case marshalling cost is superlinear in draw size. Have Hegel draw only (seed, hyperparams) and derive bulk structure (weight matrices, long spike trains) from a local PRNG. 10–50× speedups in practice, with the trade-off that the framework can no longer shrink individual entries of the big structures — usually acceptable for the kinds of claims papers make.
2. Spec prose verbs hide statistical quantifiers
"Improves," "faster," "more accurate," "better" usually mean on average over some distribution. Encoding them as pointwise universal invariants over-reaches. Use class B-stat with inner Monte Carlo. When the spec says "X improves Y," your first question should be "over what distribution of inputs?" If the prose doesn't say, it almost certainly means "iid samples from a natural distribution" — and your test should look like that.
3. Claim failures are spec failures, not test failures
When Hegel shrinks to a counterexample, your first instinct should not be "what's wrong with my test." It should be "what's wrong with my claim." Usually the claim was missing a precondition that the filter just revealed, or was a statistical claim encoded as a universal. Fix the claim, not the test. The test was doing its job. And: keep the falsified artifact — mark it #[ignore] with the shrunk counterexample in the test body, rather than deleting it. The re-encoded version sits beside the original, and the lesson is preserved.
4. Scale reveals claims
Toy-scale tests at N=8 can pass for claims that break at N=32. N=32 can pass for claims that break at N=256. Scale is a specification-tightening tool. Don't stop at the first N where the test passes — push until you find where the claim actually lives.
5. Composition is where architecture hides
Individual primitive tests corroborate primitives. Composition tests validate architectures. The most interesting bugs live in composition — places where the paper's prose chains primitives together implicitly, and the filter reveals that the connection requires an ingredient the prose didn't name. In this experiment, every one of the six structural ingredients came from a composition test, not a primitive test. I don't know why — it's a distribution worth noting.
Try it yourself
The project lives at hegel-as-truth-filter. Clone and run:
cargo test --release
You should see 27 passing tests and 1 ignored (the preserved falsification) in about 63 seconds. Every hypothesis card in hypotheses/ traces to a line range in research.md — the full AI-generated synthesis is in the repo, so you can audit the artifact yourself rather than take the write-up's characterization on trust. The Rust toolchain is pinned in rust-toolchain.toml for reproducibility.
To apply this method to your own research paper:
- Identify claims by line number. Not every sentence — look for theorems, proof sketches, stated results, or proposed systems.
- Classify each claim as A, B, B-stat, C, or D.
- For each, write a one-page hypothesis card with the claim, its source, and its operational property.
- Write the test. Start small. Let Hegel shrink on failures.
- When a test fails, ask what architectural precondition the claim is missing.
- Scale up. Compose. Integrate.
- When all priorities are covered, you'll have two artifacts: a working substrate of verified primitives, and a list of engineering requirements the paper didn't name.
The process takes days to weeks per artifact, depending on density. It's slower than just implementing. What came out of it for me — this once — was a working substrate of verified primitives plus a short list of small structural ingredients the paper hadn't named. Whether it's worth the cost for any particular case depends on what the case is.
Claim failures are spec failures, not test failures.
Property-based testing is not a new idea. Pointing it at a specification rather than an implementation is not a new idea either — formal methods people have been doing variations of this for decades. The only thing this write-up tries to do is share the experience of trying it on an AI-generated artifact, end to end, at a scale small enough to fit on a laptop, with the results visible.
Hegel was the tool I used. Hypothesis, QuickCheck, proptest, Jepsen, TigerBeetle's DST, Antithesis — all different instantiations of the same basic idea, that a claim should survive a framework's attempt to falsify it. Any of them would have supported the same kind of experiment. I happened to pick Hegel because it's the Rust binding to the Hypothesis server, and I wanted to write the tests in Rust.
If a specification is coherent, the filter corroborates it. If it contains silent assumptions, the filter surfaces them. If it's inconsistent, the filter sometimes finds the contradiction in a minimal form. For this one artifact, it earned its keep. If you try something similar, I'd be curious what you find.