← BenchmarksORACLE · Factual Verification

ORACLE

Deterministic factual claim verification with honest refusal.

Accuracy

51.0%

vs CONFIDENT_ALWAYS

+20pp

Refusal rate

67.5%

What it is

ORACLE is a 6-stage deterministic claim verification pipeline (CLAIM DECOMPOSITION → NOVA → ECLIPSE → PULSAR → LUNA → AURORA). It verifies factual claims against a curated knowledge base of approximately 100 facts, spanning history, science, geography, technology, and general knowledge.

The critical design choice: AURORA refuses to emit a verdict when aggregate confidence falls below 0.70. This means ORACLE refuses on 67.5% of the 200-claim test corpus — and counts those refusals correctly when the claim is genuinely unsupportable from the KB. Honest refusal is not a failure mode. It is a first-class output.

Both baselines — CONFIDENT_ALWAYS and NAIVE_KEYWORD — have zero refusal rate. They answer everything and are wrong far more often. CONFIDENT_ALWAYS emits VERIFIED for every claim and achieves 31% accuracy (the fraction of gold-VERIFIED claims). NAIVE_KEYWORD uses keyword matching and achieves 25%.

Results

System	Accuracy	Macro F1	Refusal Rate
ORACLE v0.1	0.5100	0.3097	0.6750
CONFIDENT_ALWAYS (baseline)	0.3100	0.1578	0.0000
NAIVE_KEYWORD (baseline)	0.2500	0.1333	0.0000

Per-verdict breakdown (ORACLE v0.1)

Verdict	Precision	Recall	F1	TP	FP	FN
VERIFIED	0.7593	0.6613	0.7069	41	13	21
REFUTED	1.0000	0.1250	0.2222	11	0	77
UNSUPPORTED	0.0000	0.0000	0.0000	0	0	0

Per-domain accuracy

Domain	Total	Correct	Accuracy
science	53	32	0.6038
history	48	28	0.5833
general	50	29	0.5800
technology	33	11	0.3333
geography	16	2	0.1250

Reproducibility

Corpus SHA-256cd5de198497a5cf09e372aa99745cac940c774b4d212da70902d382a71911ad2

Total claims200 (62 VERIFIED · 88 REFUTED · 50 UNSUPPORTED)

Generated2026-04-23T07:19:10Z

Repogithub.com/jourdanlabs/benchmarks/oracle

Limitations

Small KB (v0.1). The knowledge base contains approximately 100 curated facts. REFUTED recall is only 12.5% — the pipeline correctly identifies refuted claims when it fires, but refuses on most REFUTED claims due to low confidence. KB expansion is the primary lever for recall improvement.

Geography domain accuracy: 12.5%. The KB has thin geographic coverage. Geography claims are almost universally refused or misclassified.

v0.1 baseline only. This is a v0.1 seal. The pipeline architecture is proven; the current accuracy reflects KB size, not pipeline quality. v0.2 targets KB expansion and REFUTED recall improvement.

Next version

ORACLE v0.2 targets KB expansion (1,000+ facts across all five domains), improved REFUTED recall, and calibration verification. The reliability diagram showing confidence-vs-accuracy correlation is included in the repo.

GitHub →Reproducibility guide →