COMPASS
Reading-level calibration.

COMPASS tests reading-level complexity calibration on research papers. The system must classify text complexity within one tier of the ground-truth label. Research papers are the hardest category — they combine technical vocabulary, discipline-specific knowledge, and high inferential demand.
The 15/15 within-1-tier result means every research paper in the test set was assigned a complexity tier within one level of its ground-truth classification. Surface metrics (Flesch-Kincaid, Gunning Fog) routinely mis-classify research papers. COMPASS gets all 15 within one tier.
Within-1-tier, not exact. Metric counts within-1-tier matches, not exact matches. Exact-match accuracy is lower and documented in repo.
English-only. Corpus and pipeline are English-language only. Multilingual calibration out of scope.
Domain coverage. Tier system designed for benchmark corpus document types. Novel document types may produce degraded calibration.