FLAGSHIP ENVELOPE
Dimension tested · local vs cloud
Arena question
Can a local model and a cloud model produce comparable grounded-QA candidates under the same source-unit envelope, and what operational reliability penalty appears?
Lanes
Primary metrics
groundingevidence_extractionanswerabilitySecondary metrics: faithfulness, answer_relevancy, selected_count, repair count, failure artifact count
Observed per-lane means
| metric | cloud_openrouter_gemini | local_ollama_qwen | spread |
|---|---|---|---|
| grounding | 0.993 | 0.994 | 0.0009 |
| evidence_extraction | 0.963 | 0.964 | 0.0009 |
| answerability | 0.963 | 0.964 | 0.0009 |
Best lane per metric shown in oxblood. Spread = max − min across lanes.
Operational signals
9 failure / repair artifact(s) recorded for this run.
Both the local Qwen arm and the cloud Gemini arm reached 50 kept candidates, but the local arm needed resume/retry repair from raw failed outputs.
This dimension is mainly about execution posture. The cloud arm completed more smoothly; the local arm eventually produced comparable selected evidence only after QA Arena reused successful work and repaired failed Qwen outputs without padding or regenerating unrelated data. That history matters because local execution is attractive for cost and privacy, but the run shows it needs stricter operational handling.
The result supports a completed local-vs-cloud comparison inside this envelope, not a universal model-quality claim. The website should show the comparison with the repair note visible: local can work, but reliability and retry discipline are part of the result.
The local Qwen arm hit transport timeouts on the first attempt (-001, partially_observed). The resume/retry pass rebound 36 candidates from raw failure artifacts to reach the 50/arm floor — no new local model calls were made.
Experiment spec
QA envelope
Scores
| metric | mean | range | n |
|---|---|---|---|
| grounding | 0.951 | 0.909–0.991 | 52 |
| faithfulness | 0.944 | 0.904–0.978 | 52 |
| evidence_extraction | 0.909 | 0.870–0.949 | 52 |
| answerability | 0.941 | 0.901–0.981 | 52 |
| answer_relevancy | 0.935 | 0.888–0.973 | 52 |