CRUCIBLE / REPRODUCIBILITY
Every result reproducible.
Every JourdanLabs benchmark ships with everything needed to reproduce the published results:
- ✓Sealed corpus (SHA-verifiable)
- ✓Honest baselines (real implementations, not straw men)
- ✓Deterministic pipeline (same input → same output, every time)
- ✓Step-by-step reproduction instructions
- ✓GitHub repo with scoring harness and baseline code
Engine implementations are proprietary. Scoring harnesses and baseline code are public. This split — open corpora and scoring, proprietary engines — is how serious benchmark programs (SuperGLUE, HELM) operate.
Example Reproduction Commands
# Clone the benchmarks repository
git clone https://github.com/jourdanlabs/benchmarks
# Navigate to a benchmark
cd benchmarks/citadel
# Verify the corpus SHA
python scoring/verify_corpus.py --corpus corpus/corpus_v1.jsonl
# Run baselines against the corpus (no API key required)
python scoring/score.py --predictions baselines/keyword_baseline.jsonl
# Run with COSMIC predictions (requires evaluation API key)
COSMIC_API_KEY=your_key python pipeline/run.py
Full instructions in each benchmark's README. This is the pattern, not a working script.
GitHub Repository
All benchmark corpora, scoring harnesses, and baseline implementations are published in the public benchmarks repository.
github.com/jourdanlabs/benchmarks→