How to re-run a benchmark
Every JourdanLabs benchmark is designed to be verifiable by a stranger. No special access required to verify corpora and run scoring harnesses. Engine access requires an API key.
Step-by-step
git clone https://github.com/jourdanlabs/benchmarks cd benchmarks
cd citadel # or signal, sentinel, oracle, lens, compass
pip install -r requirements.txt
python scoring/verify_corpus.py --corpus corpus/corpus_v1.jsonl --sha corpus/corpus_v1.sha256
This step will abort if the corpus has been modified. If it passes, you have the unmodified sealed corpus.
python scoring/score.py --predictions baselines/keyword_baseline.jsonl --corpus corpus/corpus_v1.jsonl
This reproduces the baseline scores published in CHECKPOINT_RESULTS.md without the COSMIC API.
COSMIC_API_KEY=your_key python pipeline/run.py --corpus corpus/corpus_v1.jsonl python scoring/score.py --predictions pipeline/outputs/predictions.jsonl --corpus corpus/corpus_v1.jsonl
This reproduces the full COSMIC results. Requires an evaluation API key.
The SHA pinning model
Every corpus is SHA-256 sealed before any pipeline contact. The seal hash is stored incorpus/corpus_v1.sha256 and cross-referenced inCHECKPOINT_RESULTS.md. The LUNA audit log for each pipeline run also records the corpus SHA at run time.
To verify you have the unmodified corpus: run the verify script (Step 04 above) or manually compute:
# macOS shasum -a 256 corpus/corpus_v1.jsonl # Linux sha256sum corpus/corpus_v1.jsonl
Compare the output to the hash in corpus/corpus_v1.sha256. They must match.
COSMIC engine access
COSMIC engine implementations are proprietary. To reproduce JourdanLabs' published results against the live engines, request an evaluation API key.
With the public repository alone, you can:
- —Verify the corpus SHA
- —Run baselines against the corpus
- —Inspect the full methodology
- —Re-run scoring with your own predictions in the expected format
- —Review the LUNA audit log structure
To reproduce JourdanLabs' COSMIC results specifically, you need the evaluation API. This split — open corpora and scoring, proprietary engines — is how SuperGLUE, HELM, and other serious benchmark programs operate.