MUNINN
The first benchmark for memory validation pipelines — contradiction detection, importance ranking, and honest refusal on retrieved memories.

Muninn is the first public benchmark that measures memory validation — what happens after retrieval. Given a set of retrieved memories (from any retrieval system, including MemPalace), a validation pipeline must detect contradictions, rank by importance, apply temporal decay, and refuse to surface low-confidence results.
Existing memory benchmarks (LongMemEval, others) measure retrieval recall — did you find the right chunk? Muninn measures the next layer — once you have the chunks, can you tell which are reliable, which contradict each other, and which the agent should never see? RAVEN is the open-source validation system this benchmark measures. The benchmark is the receipt; the system is the product.
Muninn is complementary to LongMemEval, not a replacement.
Retrieval and validation are different problems.
Baselines are real implementations. Pass-through returns all retrieved memories unfiltered. LLM-judge uses GPT-4-class judgment — non-deterministic, included for reference only.
Synthetic contradiction injection. Some contradictions are synthetically injected into the corpus. Real-world contradictions may differ in distribution.
Retrieval assumed perfect. Muninn evaluates validation given retrieved results. It does not evaluate retrieval itself — that's LongMemEval's domain.
LLM-judge non-determinism. The LLM-judge baseline uses non-deterministic inference. Reported score is median of 5 runs.