FLAGSHIP ENVELOPE
Dimension tested · frontier vs open source
Arena question
Does a frontier cloud model and an open-weight model produce different grounded-QA quality under the same source-unit, budget, and evaluation envelope?
Lanes
Primary metrics
groundingevidence_extractionanswerabilitySecondary metrics: faithfulness, answer_relevancy, selected_count, retry count, failure artifact count
Observed per-lane means
| metric | frontier_cloud | open_source_open_weight | spread |
|---|---|---|---|
| grounding | 0.992 | 0.994 | 0.0016 |
| evidence_extraction | 0.958 | 0.964 | 0.0060 |
| answerability | 0.962 | 0.964 | 0.0016 |
Best lane per metric shown in oxblood. Spread = max − min across lanes.
Operational signals
2 failure / repair artifact(s) recorded for this run.
The OpenAI frontier arm and the open-weight Llama arm both reached 50 kept candidates after retrying only the missing Llama AC-6 unit.
This dimension tested model-family posture under the same grounded-QA source-unit workflow. The first pass left the open-weight arm short because of OpenRouter rate limits, not because the source strategy failed. The resume/retry pass preserved successful candidates and filled only the missing unit, so the final comparison is based on normal QA Arena artifacts rather than a fabricated backfill.
The result is suitable as a completed frontier-vs-open-weight run for this envelope. It should not be read as a global model leaderboard. The public explanation should keep the operational detail: frontier and open-weight both reached the floor, while the open-weight transport path required targeted retry handling.
The original Claude/Anthropic frontier-arm plan was superseded by OpenAI on 2026-06-13. The first real-run (-001) was partially_observed; the resume/retry pass retried the failed Llama AC-6 unit (9 new candidates) to complete the dimension.
Experiment spec
QA envelope
Scores
| metric | mean | range | n |
|---|---|---|---|
| grounding | 0.945 | 0.904–0.992 | 50 |
| faithfulness | 0.938 | 0.898–0.987 | 50 |
| evidence_extraction | 0.904 | 0.861–0.952 | 50 |
| answerability | 0.936 | 0.894–0.982 | 50 |
| answer_relevancy | 0.928 | 0.886–0.973 | 50 |