Quaerens AI Labs Vol. I · Experiment record · MMXXVI
Quaerens
QA Arena Inquirer Knowledge Graph Methodology Work with us →
← Grounded Document QA · flagship envelope

FLAGSHIP ENVELOPE

Flagship Dim Local Vs Cloud

Dimension tested · local vs cloud

§ 01Arena measurement

Arena question

Can a local model and a cloud model produce comparable grounded-QA candidates under the same source-unit envelope, and what operational reliability penalty appears?

Lanes

cloud_openrouter_gemini
candidates  54
selected  50
local_ollama_qwen
candidates  54
selected  50

Primary metrics

groundingevidence_extractionanswerability

Secondary metrics: faithfulness, answer_relevancy, selected_count, repair count, failure artifact count

Observed per-lane means

metriccloud_openrouter_geminilocal_ollama_qwenspread
grounding0.9930.9940.0009
evidence_extraction0.9630.9640.0009
answerability0.9630.9640.0009

Best lane per metric shown in oxblood. Spread = max − min across lanes.

Operational signals

9 failure / repair artifact(s) recorded for this run.

§ 02Interpretation

Both the local Qwen arm and the cloud Gemini arm reached 50 kept candidates, but the local arm needed resume/retry repair from raw failed outputs.

This dimension is mainly about execution posture. The cloud arm completed more smoothly; the local arm eventually produced comparable selected evidence only after QA Arena reused successful work and repaired failed Qwen outputs without padding or regenerating unrelated data. That history matters because local execution is attractive for cost and privacy, but the run shows it needs stricter operational handling.

The result supports a completed local-vs-cloud comparison inside this envelope, not a universal model-quality claim. The website should show the comparison with the repair note visible: local can work, but reliability and retry discipline are part of the result.

complete
§ 03Limits & next step

The local Qwen arm hit transport timeouts on the first attempt (-001, partially_observed). The resume/retry pass rebound 36 candidates from raw failure artifacts to reach the 50/arm floor — no new local model calls were made.

§ 04Planning

Experiment spec

Tools
agent_skill
Budget
Not specified
Judge
taxonomy_agent (agent_skill)
Source: NIST SP 800-53
§ 05Envelope & scores

QA envelope

Use case
flagship:grounded-document-qa:local_vs_cloud
Artifact contract
answer.qa_pair.grounded.v1
Metric set
answer.grounded.v1
Operator
nico
Models
agent_internal, qwen3.5:35b,google/gemini-2.5-flash

Scores

metricmeanrangen
grounding0.9510.909–0.99152
faithfulness0.9440.904–0.97852
evidence_extraction0.9090.870–0.94952
answerability0.9410.901–0.98152
answer_relevancy0.9350.888–0.97352
Quaerens
Evidence over claims · scoped over global
© 2026 Quaerens AI Labs / Mario Lanzillotta