CRUCIBLE / REPRODUCIBILITY

Every result reproducible.

Every JourdanLabs benchmark ships with everything needed to reproduce the published results:

✓Sealed corpus (SHA-verifiable)
✓Honest baselines (real implementations, not straw men)
✓Deterministic pipeline (same input → same output, every time)
✓Step-by-step reproduction instructions
✓GitHub repo with scoring harness and baseline code

Engine implementations are proprietary. Scoring harnesses and baseline code are public. This split — open corpora and scoring, proprietary engines — is how serious benchmark programs (SuperGLUE, HELM) operate.

Example Reproduction Commands

# Clone the benchmarks repository
git clone https://github.com/jourdanlabs/benchmarks
# Navigate to a benchmark
cd benchmarks/citadel
# Verify the corpus SHA
python scoring/verify_corpus.py --corpus corpus/corpus_v1.jsonl
# Run baselines against the corpus (no API key required)
python scoring/score.py --predictions baselines/keyword_baseline.jsonl
# Run with COSMIC predictions (requires evaluation API key)
COSMIC_API_KEY=your_key python pipeline/run.py

Full instructions in each benchmark's README. This is the pattern, not a working script.

GitHub Repository

All benchmark corpora, scoring harnesses, and baseline implementations are published in the public benchmarks repository.

github.com/jourdanlabs/benchmarks→