Flagship Dim Eval Suite

Dimension tested · evaluation suite

§ 01Arena measurement

Arena question

Does changing the evaluation suite alter the ranking or interpretation of the same fixed grounded-QA candidate set?

Lanes

answer.grounded.v1

candidates 150

selected 150

quality.v1

candidates 150

selected 150

Primary metrics

top_candidate_changedtop_provider_changedsuite_mean_score

Secondary metrics: faithfulness, grounding, evidence_extraction, answerability

Observed per-lane means

metric	answer.grounded.v1	quality.v1	spread
top_candidate_changed	—	—	—
top_provider_changed	—	—	—
suite_mean_score	0.939	0.923	0.0156

Best lane per metric shown in oxblood. Spread = max − min across lanes.

Rank-change facts

Top candidate unchanged across suites · top provider unchanged across suites.

§ 02Interpretation

Evaluating the same 150 candidates with `quality.v1` and `answer.grounded.v1` produced an observed null result: the top candidate and top provider ordering did not change.

This is still a useful result. The experiment proves that evaluator choice can be isolated as its own dimension because both suites judged the same candidate set instead of triggering a new generation run. In this specific grounded-document QA envelope, the generic quality suite and the grounded-answer suite agreed on the top-level ordering.

The result should not be framed as proof that evaluator choice never matters. It says that for this candidate set, changing the suite did not change the headline ranking. The next useful case is a harder candidate set where generic fluency and source-grounding are more likely to disagree.

complete

§ 03Limits & next step

Eval-only lane — no new generation occurs. It reuses source-shape real-run candidates as input. The result is an observed null: the routed suite did not move the top-ranked candidate or provider for this fixed candidate set.

§ 04Planning

Experiment spec

Tools

agent_skill

Budget

Unspecified candidates per tool · 50 final

Judge

taxonomy_agent (agent_skill)

Source: NIST SP 800-53

§ 05Envelope & scores

QA envelope

Use case

flagship:grounded-document-qa:eval_suite

Artifact contract

answer.qa_pair.grounded.v1

Metric set

True

Operator

nico

Models

agent_internal

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

© 2026 Quaerens AI Labs / Mario Lanzillotta