CITADEL
Corporate subsidiary hierarchy reconstruction from SEC Exhibit 21 filings.
What it is
CITADEL reconstructs corporate subsidiary hierarchies from publicly filed SEC Exhibit 21 documents. Every US public company with annual revenues exceeding certain thresholds must file Exhibit 21 with their 10-K, listing all significant subsidiaries. These filings are public via EDGAR but are unstructured — the data lives in HTML tables, PDFs, and free-text disclosures of varying quality.
COSMIC's CITADEL pipeline ingests Exhibit 21 filings, normalizes entity names using a deterministic rule set (no ML inference), and reconstructs parent→subsidiary DAGs for a 400-entity corpus drawn from the S&P 500 and Fortune 500. The ground truth was assembled from the same EDGAR filings using an independent reference implementation, sealed with SHA-256 before any pipeline contact.
This task matters for financial compliance, competitive intelligence, and regulatory reporting. Who owns what — and can you prove it from public filings, without LLM guessing? CITADEL answers that question deterministically.
Results
| Metric | Value | 95% CI (BCa, B=2000) |
|---|---|---|
| Micro F1 | 0.6161 | [0.5282 – 0.6740] |
| Micro Precision | 0.6523 | — |
| Micro Recall | 0.5836 | — |
| Macro F1 | 0.5936 | [0.5551 – 0.6251] |
| Entities scored | 342 | of 400 in corpus |
| Entities with ≥1 TP | 290 | — |
| TP / FP / FN | 23,999 / 12,791 / 17,120 | — |
Confidence intervals computed via BCa bootstrap (B=2,000). Ground truth SHA:4911f158...cf54c (verified).
Methodology arc
CITADEL has a documented checkpoint arc showing per-fix attribution:
| Checkpoint | Micro F1 | Delta | Fix |
|---|---|---|---|
| D (baseline) | 0.6025 | — | Initial reconstruction, 5 sessions |
| E (regression) | 0.5774 | −0.0251 | Code change introduced normalization regression |
| E.1 | 0.5826 | +0.0052 | Class A: _INLINE_JUR regex fix + _SKIP plural + SEC disclaimer |
| E.2 (current seal) | 0.6161 | +0.0335 | Class B: canonical 3-fallback Exhibit 21 document finder |
Each fix is scoped, attributed, and re-scored in isolation. The regression at Checkpoint E is documented openly — CITADEL declined before recovering, and the arc shows why.
Reproducibility
a6a98dbb30794fb98413129c3a9855af2214f840b1a1fe74e5175485dab99d814911f15899f4a9b6fa342de27470c828887569320b9c7f9da231d516e86cf54c400github.com/jourdanlabs/benchmarks/citadelLimitations
42 systematic zero-TP entities. Root causes include: PDF-embedded Exhibit 21 documents (Class C1, structural), abbreviated filings under SEC Rule 601(b)(21)(ii), and GLEIF-only fallback coverage.
HCA Healthcare (2,578 GT relationships, 0 TP). EX-21 document found but contains zero parseable subsidiaries in the expected HTML format. Likely cause: PDF embed or non-standard layout. Class C1 structural issue.
Coverage ceiling at F1 ~0.62. Class C structural issues represent the ceiling without new data sources.
Ground truth assembled from same source. Ground truth was assembled from EDGAR filings using an independent implementation, but shares the same upstream data source as the pipeline. Off-EDGAR data was not incorporated.
Next version
Checkpoint F targets Class C1 structural issues: multi-part exhibit ingestion, prior-year filing fallback, and REIT-specific subsidiary structure detection. The 42 systematic zero-TP entities are the known gap list.