Flagship Dim Local Vs Cloud

Dimension tested · local vs cloud

§ 01Arena measurement

Arena question

Can a local model and a cloud model produce comparable grounded-QA candidates under the same source-unit envelope, and what operational reliability penalty appears?

Lanes

cloud_openrouter_gemini

candidates 54

selected 50

local_ollama_qwen

candidates 54

selected 50

Primary metrics

groundingevidence_extractionanswerability

Secondary metrics: faithfulness, answer_relevancy, selected_count, repair count, failure artifact count

Observed per-lane means

metric	cloud_openrouter_gemini	local_ollama_qwen	spread
grounding	0.993	0.994	0.0009
evidence_extraction	0.963	0.964	0.0009
answerability	0.963	0.964	0.0009

Best lane per metric shown in oxblood. Spread = max − min across lanes.

Operational signals

9 failure / repair artifact(s) recorded for this run.

§ 02Interpretation

Both the local Qwen arm and the cloud Gemini arm reached 50 kept candidates, but the local arm needed resume/retry repair from raw failed outputs.

This dimension is mainly about execution posture. The cloud arm completed more smoothly; the local arm eventually produced comparable selected evidence only after QA Arena reused successful work and repaired failed Qwen outputs without padding or regenerating unrelated data. That history matters because local execution is attractive for cost and privacy, but the run shows it needs stricter operational handling.

The result supports a completed local-vs-cloud comparison inside this envelope, not a universal model-quality claim. The website should show the comparison with the repair note visible: local can work, but reliability and retry discipline are part of the result.

complete

§ 03Limits & next step

The local Qwen arm hit transport timeouts on the first attempt (-001, partially_observed). The resume/retry pass rebound 36 candidates from raw failure artifacts to reach the 50/arm floor — no new local model calls were made.

§ 04Planning

Experiment spec

Tools

agent_skill

Budget

Not specified

Judge

taxonomy_agent (agent_skill)

Source: NIST SP 800-53

§ 05Envelope & scores

QA envelope

Use case

flagship:grounded-document-qa:local_vs_cloud

Artifact contract

answer.qa_pair.grounded.v1

Metric set

answer.grounded.v1

Operator

nico

Models

agent_internal, qwen3.5:35b,google/gemini-2.5-flash

Scores

metric	mean	range	n
grounding	0.951	0.909–0.991	52
faithfulness	0.944	0.904–0.978	52
evidence_extraction	0.909	0.870–0.949	52
answerability	0.941	0.901–0.981	52
answer_relevancy	0.935	0.888–0.973	52

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

© 2026 Quaerens AI Labs / Mario Lanzillotta