Quaerens AI Labs Vol. I · Experiment record · MMXXVI
Quaerens
QA Arena Inquirer Knowledge Graph Methodology Work with us →
← Grounded Document QA · flagship envelope

FLAGSHIP ENVELOPE

Flagship Dim Frontier Vs Open Source

Dimension tested · frontier vs open source

§ 01Arena measurement

Arena question

Does a frontier cloud model and an open-weight model produce different grounded-QA quality under the same source-unit, budget, and evaluation envelope?

Lanes

frontier_cloud
candidates  54
selected  50
open_source_open_weight
candidates  54
selected  50

Primary metrics

groundingevidence_extractionanswerability

Secondary metrics: faithfulness, answer_relevancy, selected_count, retry count, failure artifact count

Observed per-lane means

metricfrontier_cloudopen_source_open_weightspread
grounding0.9920.9940.0016
evidence_extraction0.9580.9640.0060
answerability0.9620.9640.0016

Best lane per metric shown in oxblood. Spread = max − min across lanes.

Operational signals

2 failure / repair artifact(s) recorded for this run.

§ 02Interpretation

The OpenAI frontier arm and the open-weight Llama arm both reached 50 kept candidates after retrying only the missing Llama AC-6 unit.

This dimension tested model-family posture under the same grounded-QA source-unit workflow. The first pass left the open-weight arm short because of OpenRouter rate limits, not because the source strategy failed. The resume/retry pass preserved successful candidates and filled only the missing unit, so the final comparison is based on normal QA Arena artifacts rather than a fabricated backfill.

The result is suitable as a completed frontier-vs-open-weight run for this envelope. It should not be read as a global model leaderboard. The public explanation should keep the operational detail: frontier and open-weight both reached the floor, while the open-weight transport path required targeted retry handling.

complete
§ 03Limits & next step

The original Claude/Anthropic frontier-arm plan was superseded by OpenAI on 2026-06-13. The first real-run (-001) was partially_observed; the resume/retry pass retried the failed Llama AC-6 unit (9 new candidates) to complete the dimension.

§ 04Planning

Experiment spec

Tools
agent_skill
Budget
Not specified
Judge
taxonomy_agent (agent_skill)
Source: NIST SP 800-53
§ 05Envelope & scores

QA envelope

Use case
flagship:grounded-document-qa:frontier_vs_open_source
Artifact contract
answer.qa_pair.grounded.v1
Metric set
answer.grounded.v1
Operator
nico
Models
agent_internal, anthropic/claude-sonnet-4.5,meta-llama/llama-3.1-70b-instruct

Scores

metricmeanrangen
grounding0.9450.904–0.99250
faithfulness0.9380.898–0.98750
evidence_extraction0.9040.861–0.95250
answerability0.9360.894–0.98250
answer_relevancy0.9280.886–0.97350
Quaerens
Evidence over claims · scoped over global
© 2026 Quaerens AI Labs / Mario Lanzillotta