ALCHEMIST · BENCHMARK

Deterministic finance
beats fluent guessing.

Same EDGAR facts. Same task. Same output requirements. ALCHEMIST wins where production finance actually cares: reproducibility, auditability, refusal discipline, and cell-level method consistency. CIPHER is the DCF model inside the suite.

Benchmark boundary

This page is built for live demos against GPT, Claude, Gemini, Grok, or any other model. Paste their answer into the challenge box and score it against the same ALCHEMIST discipline checks.

COMPARISON FRAME

ALCHEMIST keeps the math stable. LLMs have to prove they can.

SystemSource trailFormula methodRefusal disciplineAudit chainDeterministic
ALCHEMIST model suiteSEC-backedsealedexplicitLUNA chainyes
Frontier LLM answermust provemust disclosemust showusually absentno
Spreadsheet templatemanualvisiblemanualversion driftpartly
Legacy data terminallicensedblack boxlimitedexport-dependentopaque
What the live test proves

ALCHEMIST's edge is not a prettier answer. It is same input, same output, same source trail, every time. CIPHER is the DCF engine; COMPS, LBO, Credit, SOTP, and the rest are adjacent ALCHEMIST models that inherit the same source/refusal/audit discipline. The legacy DCF-vs-LLM harness lives in `score_llm.py`; this UI turns the same idea into a demo-ready challenge for every ALCHEMIST model. Existing CIPHER engine disclosures record a May 2, 2026 frontier comparison where the CIPHER composite was 21/22 (95.5%) and frontier derived-ratio scores were GPT 17.1%, Claude 22.9%, Grok 14.3%, and Gemini 22.9%. Fresh model outputs should be pasted below when demoing live.

LIVE LLM CHALLENGE

Paste any LLM answer. Score it like a model-risk reviewer.

COMPS prompt

Build a banker-grade comparable-company analysis for NVDA. Select 5 public peers; show EV/Revenue, EV/EBITDA, EV/EBIT, EV/FCF, P/E, P/S, P/B, revenue growth, EBITDA margin, ROE, ROIC, and Net Debt/EBITDA; state every formula; refuse unsafe denominators; flag fiscal-year mismatches; cite data provenance; and end with reproducibility notes.

ALCHEMIST expected behavior

Resolves a peer set, computes only defensible multiples, refuses non-meaningful denominators, flags fiscal-year mismatches, and exposes source/method provenance.

0%waiting
Responses loaded0%
Best response0%
Audit pressure0%
Refusal pressure0%
Paste the four answers

Each slot persists in this browser for the selected challenge, so you can paste all four and switch back without losing the bakeoff.

ALCHEMIST · COMPS0% · waiting
Structure0%
Method0%
Refusal0%
Audit0%
Safety0%
GPT-5.50% · waiting
Structure0%
Method0%
Refusal0%
Audit0%
Safety0%
Claude0% · waiting
Structure0%
Method0%
Refusal0%
Audit0%
Safety0%
Gemini0% · waiting
Structure0%
Method0%
Refusal0%
Audit0%
Safety0%
Grok0% · waiting
Structure0%
Method0%
Refusal0%
Audit0%
Safety0%