JOURDANLABS
← BENCHMARKS / MUNINN · MEMORY VALIDATION
MUNINN · MEMORY VALIDATION

MUNINN

The first benchmark for memory validation pipelines — contradiction detection, importance ranking, and honest refusal on retrieved memories.

MUNINN memory validation benchmark artifact - two ravens facing each other
0.847
VALIDATION F1
0.921
CONTRADICTION RECALL
17.3%
REFUSAL RATE
What It Measures

Muninn is the first public benchmark that measures memory validation — what happens after retrieval. Given a set of retrieved memories (from any retrieval system, including MemPalace), a validation pipeline must detect contradictions, rank by importance, apply temporal decay, and refuse to surface low-confidence results.

Existing memory benchmarks (LongMemEval, others) measure retrieval recall — did you find the right chunk? Muninn measures the next layer — once you have the chunks, can you tell which are reliable, which contradict each other, and which the agent should never see? RAVEN is the open-source validation system this benchmark measures. The benchmark is the receipt; the system is the product.

Muninn is complementary to LongMemEval, not a replacement.
Retrieval and validation are different problems.

Results
SystemValidation F1Contradiction RecallRefusal Rate
RAVEN v0.1 (COSMIC)0.8470.92117.3% — reported per-class
Pass-through baseline (no validation)0.4120.0000.000
Simple dedup baseline0.5030.1140.000
LLM-judge baseline0.6810.742variable

Baselines are real implementations. Pass-through returns all retrieved memories unfiltered. LLM-judge uses GPT-4-class judgment — non-deterministic, included for reference only.

Reproducibility
Corpus sourceCurated memory sets, public domain
Corpus sealSHA-256 in CHECKPOINT_RESULTS.md
Repogithub.com/jourdanlabs/benchmarks/muninn
Limitations

Synthetic contradiction injection. Some contradictions are synthetically injected into the corpus. Real-world contradictions may differ in distribution.

Retrieval assumed perfect. Muninn evaluates validation given retrieved results. It does not evaluate retrieval itself — that's LongMemEval's domain.

LLM-judge non-determinism. The LLM-judge baseline uses non-deterministic inference. Reported score is median of 5 runs.