Quaerens AI Labs Vol. I · Experiment record · MMXXVI
Quaerens
QA Arena Inquirer Knowledge Graph Methodology Work with us →
← Grounded Document QA · flagship envelope

FLAGSHIP ENVELOPE

Flagship Dim Source Enrichment

Dimension tested · source enrichment

§ 01Arena measurement

Arena question

Does source enrichment change grounded-QA quality when raw NIST controls, summary-enriched text, and clause/structure maps are run under the same generation and evaluation envelope?

Lanes

clause_structure_map
candidates  56
selected  50
raw_text
candidates  56
selected  50
summary_enriched
candidates  56
selected  50

Primary metrics

evidence_extractionanswerabilitygrounding

Secondary metrics: faithfulness, answer_relevancy, selected_count, evidence substring validity

Observed per-lane means

metricclause_structure_mapraw_textsummary_enrichedspread
evidence_extraction0.9180.9050.9150.0136
answerability0.9500.9330.9450.0175
grounding0.9550.9430.9550.0126

Best lane per metric shown in oxblood. Spread = max − min across lanes.

§ 02Interpretation

Raw text, summary enrichment, and clause/structure-map enrichment all reached the 50-candidate floor, so the run proves the enrichment dimension is executable in the fixed grounded-QA envelope.

The important finding is operational before it is comparative: derived source forms can be used without losing evidence binding. Summary and clause-map candidates cite literal text from their derived source units, not from the raw NIST text, and the evaluator was rebound to the matching unit. That means QA Arena can compare enrichment strategies without mixing source contexts.

This run should not be presented as a broad proof that enrichment improves quality. It supports a narrower claim: all three enrichment forms are viable arms, and the website can now show arm-level results for raw, summary, and clause-map runs under the same source family. The next useful test is whether enrichment creates a meaningful quality or answerability gain on harder or messier documents.

complete
§ 03Limits & next step

Earlier raw-text and summary-enriched pilot runs were incomplete; the canonical real-run (2026-06-13) replaces them as the only flagship-grade row.

§ 04Planning

Experiment spec

Tools
agent_skill
Budget
Not specified
Judge
taxonomy_agent (agent_skill)
Source: NIST SP 800-53
§ 05Envelope & scores

QA envelope

Use case
flagship:grounded-document-qa:source-enrichment-substrate
Artifact contract
answer.qa_pair.grounded.v1
Metric set
answer.grounded.v1
Operator
claude
Models
agent_internal

Scores

metricmeanrangen
grounding0.9380.840–1.00072
faithfulness0.9390.900–0.98072
evidence_extraction0.9000.780–0.95072
answerability0.9460.830–0.99072
answer_relevancy0.9370.850–0.98072
Quaerens
Evidence over claims · scoped over global
© 2026 Quaerens AI Labs / Mario Lanzillotta