← BenchmarksORACLE · Factual Verification

ORACLE

Deterministic factual claim verification with honest refusal.

Accuracy
51.0%
vs CONFIDENT_ALWAYS
+20pp
Refusal rate
67.5%

What it is

ORACLE is a 6-stage deterministic claim verification pipeline (CLAIM DECOMPOSITION → NOVA → ECLIPSE → PULSAR → LUNA → AURORA). It verifies factual claims against a curated knowledge base of approximately 100 facts, spanning history, science, geography, technology, and general knowledge.

The critical design choice: AURORA refuses to emit a verdict when aggregate confidence falls below 0.70. This means ORACLE refuses on 67.5% of the 200-claim test corpus — and counts those refusals correctly when the claim is genuinely unsupportable from the KB. Honest refusal is not a failure mode. It is a first-class output.

Both baselines — CONFIDENT_ALWAYS and NAIVE_KEYWORD — have zero refusal rate. They answer everything and are wrong far more often. CONFIDENT_ALWAYS emits VERIFIED for every claim and achieves 31% accuracy (the fraction of gold-VERIFIED claims). NAIVE_KEYWORD uses keyword matching and achieves 25%.


Results

SystemAccuracyMacro F1Refusal Rate
ORACLE v0.10.51000.30970.6750
CONFIDENT_ALWAYS (baseline)0.31000.15780.0000
NAIVE_KEYWORD (baseline)0.25000.13330.0000

Per-verdict breakdown (ORACLE v0.1)

VerdictPrecisionRecallF1TPFPFN
VERIFIED0.75930.66130.7069411321
REFUTED1.00000.12500.222211077
UNSUPPORTED0.00000.00000.0000000

Per-domain accuracy

DomainTotalCorrectAccuracy
science53320.6038
history48280.5833
general50290.5800
technology33110.3333
geography1620.1250

Reproducibility

Corpus SHA-256cd5de198497a5cf09e372aa99745cac940c774b4d212da70902d382a71911ad2
Total claims200 (62 VERIFIED · 88 REFUTED · 50 UNSUPPORTED)
Generated2026-04-23T07:19:10Z
Repogithub.com/jourdanlabs/benchmarks/oracle

Limitations

Small KB (v0.1). The knowledge base contains approximately 100 curated facts. REFUTED recall is only 12.5% — the pipeline correctly identifies refuted claims when it fires, but refuses on most REFUTED claims due to low confidence. KB expansion is the primary lever for recall improvement.

Geography domain accuracy: 12.5%. The KB has thin geographic coverage. Geography claims are almost universally refused or misclassified.

v0.1 baseline only. This is a v0.1 seal. The pipeline architecture is proven; the current accuracy reflects KB size, not pipeline quality. v0.2 targets KB expansion and REFUTED recall improvement.

Next version

ORACLE v0.2 targets KB expansion (1,000+ facts across all five domains), improved REFUTED recall, and calibration verification. The reliability diagram showing confidence-vs-accuracy correlation is included in the repo.

GitHub →Reproducibility guide →