Quaerens AI Labs Vol. I · Experiment record · MMXXVI
Quaerens
QA Arena Inquirer Knowledge Graph Methodology Work with us →
← Grounded Document QA · flagship envelope

FLAGSHIP ENVELOPE

Flagship Dim Eval Suite

Dimension tested · evaluation suite

§ 01Arena measurement

Arena question

Does changing the evaluation suite alter the ranking or interpretation of the same fixed grounded-QA candidate set?

Lanes

answer.grounded.v1
candidates  150
selected  150
quality.v1
candidates  150
selected  150

Primary metrics

top_candidate_changedtop_provider_changedsuite_mean_score

Secondary metrics: faithfulness, grounding, evidence_extraction, answerability

Observed per-lane means

metricanswer.grounded.v1quality.v1spread
top_candidate_changed
top_provider_changed
suite_mean_score0.9390.9230.0156

Best lane per metric shown in oxblood. Spread = max − min across lanes.

Rank-change facts

Top candidate unchanged across suites · top provider unchanged across suites.

§ 02Interpretation

Evaluating the same 150 candidates with `quality.v1` and `answer.grounded.v1` produced an observed null result: the top candidate and top provider ordering did not change.

This is still a useful result. The experiment proves that evaluator choice can be isolated as its own dimension because both suites judged the same candidate set instead of triggering a new generation run. In this specific grounded-document QA envelope, the generic quality suite and the grounded-answer suite agreed on the top-level ordering.

The result should not be framed as proof that evaluator choice never matters. It says that for this candidate set, changing the suite did not change the headline ranking. The next useful case is a harder candidate set where generic fluency and source-grounding are more likely to disagree.

complete
§ 03Limits & next step

Eval-only lane — no new generation occurs. It reuses source-shape real-run candidates as input. The result is an observed null: the routed suite did not move the top-ranked candidate or provider for this fixed candidate set.

§ 04Planning

Experiment spec

Tools
agent_skill
Budget
Unspecified candidates per tool · 50 final
Judge
taxonomy_agent (agent_skill)
Source: NIST SP 800-53
§ 05Envelope & scores

QA envelope

Use case
flagship:grounded-document-qa:eval_suite
Artifact contract
answer.qa_pair.grounded.v1
Metric set
True
Operator
nico
Models
agent_internal
Quaerens
Evidence over claims · scoped over global
© 2026 Quaerens AI Labs / Mario Lanzillotta