Flagship Dim Source Enrichment

Dimension tested · source enrichment

§ 01Arena measurement

Arena question

Does source enrichment change grounded-QA quality when raw NIST controls, summary-enriched text, and clause/structure maps are run under the same generation and evaluation envelope?

Lanes

clause_structure_map

candidates 56

selected 50

raw_text

candidates 56

selected 50

summary_enriched

candidates 56

selected 50

Primary metrics

evidence_extractionanswerabilitygrounding

Secondary metrics: faithfulness, answer_relevancy, selected_count, evidence substring validity

Observed per-lane means

metric	clause_structure_map	raw_text	summary_enriched	spread
evidence_extraction	0.918	0.905	0.915	0.0136
answerability	0.950	0.933	0.945	0.0175
grounding	0.955	0.943	0.955	0.0126

Best lane per metric shown in oxblood. Spread = max − min across lanes.

§ 02Interpretation

Raw text, summary enrichment, and clause/structure-map enrichment all reached the 50-candidate floor, so the run proves the enrichment dimension is executable in the fixed grounded-QA envelope.

The important finding is operational before it is comparative: derived source forms can be used without losing evidence binding. Summary and clause-map candidates cite literal text from their derived source units, not from the raw NIST text, and the evaluator was rebound to the matching unit. That means QA Arena can compare enrichment strategies without mixing source contexts.

This run should not be presented as a broad proof that enrichment improves quality. It supports a narrower claim: all three enrichment forms are viable arms, and the website can now show arm-level results for raw, summary, and clause-map runs under the same source family. The next useful test is whether enrichment creates a meaningful quality or answerability gain on harder or messier documents.

complete

§ 03Limits & next step

Earlier raw-text and summary-enriched pilot runs were incomplete; the canonical real-run (2026-06-13) replaces them as the only flagship-grade row.

§ 04Planning

Experiment spec

Tools

agent_skill

Budget

Not specified

Judge

taxonomy_agent (agent_skill)

Source: NIST SP 800-53

§ 05Envelope & scores

QA envelope

Use case

flagship:grounded-document-qa:source-enrichment-substrate

Artifact contract

answer.qa_pair.grounded.v1

Metric set

answer.grounded.v1

Operator

claude

Models

agent_internal

Scores

metric	mean	range	n
grounding	0.938	0.840–1.000	72
faithfulness	0.939	0.900–0.980	72
evidence_extraction	0.900	0.780–0.950	72
answerability	0.946	0.830–0.990	72
answer_relevancy	0.937	0.850–0.980	72

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

© 2026 Quaerens AI Labs / Mario Lanzillotta