Flagship Dim Budget

Dimension tested · budget & scale

§ 01Arena measurement

Arena question

Does increasing candidate budget improve grounded-QA output quality, or does it mainly increase yield and selection headroom?

Lanes

budget_100

candidates 100

selected 100

budget_200

candidates 200

selected 200

budget_50

candidates 50

selected 50

Primary metrics

evidence_extractiongroundinganswerability

Secondary metrics: faithfulness, answer_relevancy, selected_count, duplicate rejection

Observed per-lane means

metric	budget_100	budget_200	budget_50	spread
evidence_extraction	0.908	0.914	0.901	0.0124
grounding	0.948	0.954	0.942	0.0124
answerability	0.937	0.944	0.932	0.0119

Best lane per metric shown in oxblood. Spread = max − min across lanes.

§ 02Interpretation

Raising the candidate budget from 50 to 100 to 200 increased the available selected set without causing an obvious quality collapse.

This dimension tested volume, not a new source or model strategy. The run held the source package and grounded-QA envelope fixed while changing how many candidates were produced and kept. All three arms completed at their planned sizes, and evaluation remained discriminating across the larger set.

The result is useful for planning scale: a larger budget gives more selection headroom and more examples to curate. It does not yet prove that more budget automatically produces better QA pairs. The public claim should stay focused on yield and operational capacity, with quality-lift treated as a follow-up question.

complete

§ 03Limits & next step

Score distributions are close across arms. The audit closes this as a volume/yield result — larger budget produced more kept rows without an obvious quality lift or collapse, not a strong quality-separation result.

§ 04Planning

Experiment spec

Tools

agent_skill

Budget

Not specified

Judge

taxonomy_agent (agent_skill)

Source: NIST SP 800-53

§ 05Envelope & scores

QA envelope

Use case

flagship:grounded-document-qa:candidate_budget

Artifact contract

answer.qa_pair.grounded.v1

Metric set

answer.grounded.v1

Operator

nico

Models

agent_internal

Scores

metric	mean	range	n
grounding	0.951	0.885–0.995	350
faithfulness	0.945	0.873–0.994	350
evidence_extraction	0.910	0.850–0.953	350
answerability	0.940	0.886–0.985	350
answer_relevancy	0.935	0.863–0.984	350

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

© 2026 Quaerens AI Labs / Mario Lanzillotta