FLAGSHIP ENVELOPE
Dimension tested · source shape
Arena question
Can the fixed grounded-document QA envelope maintain quality when the source shape changes across policy regulation, technical security controls, and code/API documentation?
Lanes
Primary metrics
groundingfaithfulnessevidence_extractionSecondary metrics: answerability, answer_relevancy, selected_count, duplicate rejection
Observed per-lane means
| metric | code_api_software_doc | policy_regulatory | technical_security_control | spread |
|---|---|---|---|---|
| grounding | 0.971 | 0.966 | 0.967 | 0.0053 |
| faithfulness | 0.961 | 0.958 | 0.960 | 0.0031 |
| evidence_extraction | 0.918 | 0.919 | 0.920 | 0.0021 |
Best lane per metric shown in oxblood. Spread = max − min across lanes.
The same grounded-QA pipeline produced usable selected data across policy, technical-control, and code-documentation sources, with 50 kept candidates per arm and no collapse in evaluation discrimination.
This dimension is best read as robustness evidence. Changing the source shape did not break the agentic source-unit workflow: each arm produced literal-evidence candidates, per-unit evaluation prompts, and a clean contract audit. The arm scores were close, so the result does not support a strong claim that one source shape is intrinsically better for grounded QA. It shows that this fixed envelope can carry the same QA task across three very different document styles.
The practical takeaway is that source shape is a real dimension to track, but the current run argues more for cross-source reliability than for winner selection. A stronger claim would need more source families, independent evaluation, or statistical analysis of the small score differences.
Experiment spec
QA envelope
Scores
| metric | mean | range | n |
|---|---|---|---|
| grounding | 0.945 | 0.918–0.970 | 54 |
| faithfulness | 0.921 | 0.894–0.949 | 54 |
| evidence_extraction | 0.877 | 0.850–0.905 | 54 |
| answerability | 0.915 | 0.888–0.943 | 54 |
| answer_relevancy | 0.901 | 0.874–0.929 | 54 |