ORACLE
Deterministic factual claim verification with honest refusal.
What it is
ORACLE is a 6-stage deterministic claim verification pipeline (CLAIM DECOMPOSITION → NOVA → ECLIPSE → PULSAR → LUNA → AURORA). It verifies factual claims against a curated knowledge base of approximately 100 facts, spanning history, science, geography, technology, and general knowledge.
The critical design choice: AURORA refuses to emit a verdict when aggregate confidence falls below 0.70. This means ORACLE refuses on 67.5% of the 200-claim test corpus — and counts those refusals correctly when the claim is genuinely unsupportable from the KB. Honest refusal is not a failure mode. It is a first-class output.
Both baselines — CONFIDENT_ALWAYS and NAIVE_KEYWORD — have zero refusal rate. They answer everything and are wrong far more often. CONFIDENT_ALWAYS emits VERIFIED for every claim and achieves 31% accuracy (the fraction of gold-VERIFIED claims). NAIVE_KEYWORD uses keyword matching and achieves 25%.
Results
| System | Accuracy | Macro F1 | Refusal Rate |
|---|---|---|---|
| ORACLE v0.1 | 0.5100 | 0.3097 | 0.6750 |
| CONFIDENT_ALWAYS (baseline) | 0.3100 | 0.1578 | 0.0000 |
| NAIVE_KEYWORD (baseline) | 0.2500 | 0.1333 | 0.0000 |
Per-verdict breakdown (ORACLE v0.1)
| Verdict | Precision | Recall | F1 | TP | FP | FN |
|---|---|---|---|---|---|---|
| VERIFIED | 0.7593 | 0.6613 | 0.7069 | 41 | 13 | 21 |
| REFUTED | 1.0000 | 0.1250 | 0.2222 | 11 | 0 | 77 |
| UNSUPPORTED | 0.0000 | 0.0000 | 0.0000 | 0 | 0 | 0 |
Per-domain accuracy
| Domain | Total | Correct | Accuracy |
|---|---|---|---|
| science | 53 | 32 | 0.6038 |
| history | 48 | 28 | 0.5833 |
| general | 50 | 29 | 0.5800 |
| technology | 33 | 11 | 0.3333 |
| geography | 16 | 2 | 0.1250 |
Reproducibility
cd5de198497a5cf09e372aa99745cac940c774b4d212da70902d382a71911ad2200 (62 VERIFIED · 88 REFUTED · 50 UNSUPPORTED)2026-04-23T07:19:10Zgithub.com/jourdanlabs/benchmarks/oracleLimitations
Small KB (v0.1). The knowledge base contains approximately 100 curated facts. REFUTED recall is only 12.5% — the pipeline correctly identifies refuted claims when it fires, but refuses on most REFUTED claims due to low confidence. KB expansion is the primary lever for recall improvement.
Geography domain accuracy: 12.5%. The KB has thin geographic coverage. Geography claims are almost universally refused or misclassified.
v0.1 baseline only. This is a v0.1 seal. The pipeline architecture is proven; the current accuracy reflects KB size, not pipeline quality. v0.2 targets KB expansion and REFUTED recall improvement.
Next version
ORACLE v0.2 targets KB expansion (1,000+ facts across all five domains), improved REFUTED recall, and calibration verification. The reliability diagram showing confidence-vs-accuracy correlation is included in the repo.