← CRUCIBLEReproducibility

How to re-run a benchmark

Every JourdanLabs benchmark is designed to be verifiable by a stranger. No special access required to verify corpora and run scoring harnesses. Engine access requires an API key.

Section 1 / How to re-run a benchmark

Step-by-step

01
Clone the benchmarks repository
git clone https://github.com/jourdanlabs/benchmarks
cd benchmarks
02
Navigate to the benchmark directory
cd citadel   # or signal, sentinel, oracle, lens, compass
03
Install dependencies
pip install -r requirements.txt
04
Verify the corpus SHA
python scoring/verify_corpus.py --corpus corpus/corpus_v1.jsonl --sha corpus/corpus_v1.sha256

This step will abort if the corpus has been modified. If it passes, you have the unmodified sealed corpus.

05
Run the scoring harness against the baselines
python scoring/score.py --predictions baselines/keyword_baseline.jsonl --corpus corpus/corpus_v1.jsonl

This reproduces the baseline scores published in CHECKPOINT_RESULTS.md without the COSMIC API.

06
Run with COSMIC predictions (requires API key)
COSMIC_API_KEY=your_key python pipeline/run.py --corpus corpus/corpus_v1.jsonl
python scoring/score.py --predictions pipeline/outputs/predictions.jsonl --corpus corpus/corpus_v1.jsonl

This reproduces the full COSMIC results. Requires an evaluation API key.


Section 2 / Corpus integrity

The SHA pinning model

Every corpus is SHA-256 sealed before any pipeline contact. The seal hash is stored incorpus/corpus_v1.sha256 and cross-referenced inCHECKPOINT_RESULTS.md. The LUNA audit log for each pipeline run also records the corpus SHA at run time.

To verify you have the unmodified corpus: run the verify script (Step 04 above) or manually compute:

# macOS
shasum -a 256 corpus/corpus_v1.jsonl

# Linux
sha256sum corpus/corpus_v1.jsonl

Compare the output to the hash in corpus/corpus_v1.sha256. They must match.

BenchmarkCorpus SHA-256 (E.2 / v0.1)
SIGNALSee CHECKPOINT_RESULTS.md
CITADELa6a98dbb30794fb98413129c3a9855af2214f840b1a1fe74e5175485dab99d81
SENTINELSee CHECKPOINT_RESULTS.md
ORACLEcd5de198497a5cf09e372aa99745cac940c774b4d212da70902d382a71911ad2
LENSSee CHECKPOINT_RESULTS.md
COMPASSSee CHECKPOINT_RESULTS.md

Section 3 / Evaluation API access

COSMIC engine access

COSMIC engine implementations are proprietary. To reproduce JourdanLabs' published results against the live engines, request an evaluation API key.

With the public repository alone, you can:

  • Verify the corpus SHA
  • Run baselines against the corpus
  • Inspect the full methodology
  • Re-run scoring with your own predictions in the expected format
  • Review the LUNA audit log structure

To reproduce JourdanLabs' COSMIC results specifically, you need the evaluation API. This split — open corpora and scoring, proprietary engines — is how SuperGLUE, HELM, and other serious benchmark programs operate.

Request an evaluation API key →
github.com/jourdanlabs/benchmarks →Methodology →