FLAGSHIP ENVELOPE
Dimension tested · source enrichment
Arena question
Does source enrichment change grounded-QA quality when raw NIST controls, summary-enriched text, and clause/structure maps are run under the same generation and evaluation envelope?
Lanes
Primary metrics
evidence_extractionanswerabilitygroundingSecondary metrics: faithfulness, answer_relevancy, selected_count, evidence substring validity
Observed per-lane means
| metric | clause_structure_map | raw_text | summary_enriched | spread |
|---|---|---|---|---|
| evidence_extraction | 0.918 | 0.905 | 0.915 | 0.0136 |
| answerability | 0.950 | 0.933 | 0.945 | 0.0175 |
| grounding | 0.955 | 0.943 | 0.955 | 0.0126 |
Best lane per metric shown in oxblood. Spread = max − min across lanes.
Raw text, summary enrichment, and clause/structure-map enrichment all reached the 50-candidate floor, so the run proves the enrichment dimension is executable in the fixed grounded-QA envelope.
The important finding is operational before it is comparative: derived source forms can be used without losing evidence binding. Summary and clause-map candidates cite literal text from their derived source units, not from the raw NIST text, and the evaluator was rebound to the matching unit. That means QA Arena can compare enrichment strategies without mixing source contexts.
This run should not be presented as a broad proof that enrichment improves quality. It supports a narrower claim: all three enrichment forms are viable arms, and the website can now show arm-level results for raw, summary, and clause-map runs under the same source family. The next useful test is whether enrichment creates a meaningful quality or answerability gain on harder or messier documents.
Earlier raw-text and summary-enriched pilot runs were incomplete; the canonical real-run (2026-06-13) replaces them as the only flagship-grade row.
Experiment spec
QA envelope
Scores
| metric | mean | range | n |
|---|---|---|---|
| grounding | 0.938 | 0.840–1.000 | 72 |
| faithfulness | 0.939 | 0.900–0.980 | 72 |
| evidence_extraction | 0.900 | 0.780–0.950 | 72 |
| answerability | 0.946 | 0.830–0.990 | 72 |
| answer_relevancy | 0.937 | 0.850–0.980 | 72 |