Flagship Dim Frontier Vs Open Source

Dimension tested · frontier vs open source

§ 01Arena measurement

Arena question

Does a frontier cloud model and an open-weight model produce different grounded-QA quality under the same source-unit, budget, and evaluation envelope?

Lanes

frontier_cloud

candidates 54

selected 50

open_source_open_weight

candidates 54

selected 50

Primary metrics

groundingevidence_extractionanswerability

Secondary metrics: faithfulness, answer_relevancy, selected_count, retry count, failure artifact count

Observed per-lane means

metric	frontier_cloud	open_source_open_weight	spread
grounding	0.992	0.994	0.0016
evidence_extraction	0.958	0.964	0.0060
answerability	0.962	0.964	0.0016

Best lane per metric shown in oxblood. Spread = max − min across lanes.

Operational signals

2 failure / repair artifact(s) recorded for this run.

§ 02Interpretation

The OpenAI frontier arm and the open-weight Llama arm both reached 50 kept candidates after retrying only the missing Llama AC-6 unit.

This dimension tested model-family posture under the same grounded-QA source-unit workflow. The first pass left the open-weight arm short because of OpenRouter rate limits, not because the source strategy failed. The resume/retry pass preserved successful candidates and filled only the missing unit, so the final comparison is based on normal QA Arena artifacts rather than a fabricated backfill.

The result is suitable as a completed frontier-vs-open-weight run for this envelope. It should not be read as a global model leaderboard. The public explanation should keep the operational detail: frontier and open-weight both reached the floor, while the open-weight transport path required targeted retry handling.

complete

§ 03Limits & next step

The original Claude/Anthropic frontier-arm plan was superseded by OpenAI on 2026-06-13. The first real-run (-001) was partially_observed; the resume/retry pass retried the failed Llama AC-6 unit (9 new candidates) to complete the dimension.

§ 04Planning

Experiment spec

Tools

agent_skill

Budget

Not specified

Judge

taxonomy_agent (agent_skill)

Source: NIST SP 800-53

§ 05Envelope & scores

QA envelope

Use case

flagship:grounded-document-qa:frontier_vs_open_source

Artifact contract

answer.qa_pair.grounded.v1

Metric set

answer.grounded.v1

Operator

nico

Models

agent_internal, anthropic/claude-sonnet-4.5,meta-llama/llama-3.1-70b-instruct

Scores

metric	mean	range	n
grounding	0.945	0.904–0.992	50
faithfulness	0.938	0.898–0.987	50
evidence_extraction	0.904	0.861–0.952	50
answerability	0.936	0.894–0.982	50
answer_relevancy	0.928	0.886–0.973	50

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

© 2026 Quaerens AI Labs / Mario Lanzillotta