FLAGSHIP ENVELOPE
Dimension tested · evaluation suite
Arena question
Does changing the evaluation suite alter the ranking or interpretation of the same fixed grounded-QA candidate set?
Lanes
Primary metrics
top_candidate_changedtop_provider_changedsuite_mean_scoreSecondary metrics: faithfulness, grounding, evidence_extraction, answerability
Observed per-lane means
| metric | answer.grounded.v1 | quality.v1 | spread |
|---|---|---|---|
| top_candidate_changed | — | — | — |
| top_provider_changed | — | — | — |
| suite_mean_score | 0.939 | 0.923 | 0.0156 |
Best lane per metric shown in oxblood. Spread = max − min across lanes.
Rank-change facts
Top candidate unchanged across suites · top provider unchanged across suites.
Evaluating the same 150 candidates with `quality.v1` and `answer.grounded.v1` produced an observed null result: the top candidate and top provider ordering did not change.
This is still a useful result. The experiment proves that evaluator choice can be isolated as its own dimension because both suites judged the same candidate set instead of triggering a new generation run. In this specific grounded-document QA envelope, the generic quality suite and the grounded-answer suite agreed on the top-level ordering.
The result should not be framed as proof that evaluator choice never matters. It says that for this candidate set, changing the suite did not change the headline ranking. The next useful case is a harder candidate set where generic fluency and source-grounding are more likely to disagree.
Eval-only lane — no new generation occurs. It reuses source-shape real-run candidates as input. The result is an observed null: the routed suite did not move the top-ranked candidate or provider for this fixed candidate set.
Experiment spec
QA envelope