FLAGSHIP ENVELOPE
Dimension tested · budget & scale
Arena question
Does increasing candidate budget improve grounded-QA output quality, or does it mainly increase yield and selection headroom?
Lanes
Primary metrics
evidence_extractiongroundinganswerabilitySecondary metrics: faithfulness, answer_relevancy, selected_count, duplicate rejection
Observed per-lane means
| metric | budget_100 | budget_200 | budget_50 | spread |
|---|---|---|---|---|
| evidence_extraction | 0.908 | 0.914 | 0.901 | 0.0124 |
| grounding | 0.948 | 0.954 | 0.942 | 0.0124 |
| answerability | 0.937 | 0.944 | 0.932 | 0.0119 |
Best lane per metric shown in oxblood. Spread = max − min across lanes.
Raising the candidate budget from 50 to 100 to 200 increased the available selected set without causing an obvious quality collapse.
This dimension tested volume, not a new source or model strategy. The run held the source package and grounded-QA envelope fixed while changing how many candidates were produced and kept. All three arms completed at their planned sizes, and evaluation remained discriminating across the larger set.
The result is useful for planning scale: a larger budget gives more selection headroom and more examples to curate. It does not yet prove that more budget automatically produces better QA pairs. The public claim should stay focused on yield and operational capacity, with quality-lift treated as a follow-up question.
Score distributions are close across arms. The audit closes this as a volume/yield result — larger budget produced more kept rows without an obvious quality lift or collapse, not a strong quality-separation result.
Experiment spec
QA envelope
Scores
| metric | mean | range | n |
|---|---|---|---|
| grounding | 0.951 | 0.885–0.995 | 350 |
| faithfulness | 0.945 | 0.873–0.994 | 350 |
| evidence_extraction | 0.910 | 0.850–0.953 | 350 |
| answerability | 0.940 | 0.886–0.985 | 350 |
| answer_relevancy | 0.935 | 0.863–0.984 | 350 |