Quaerens AI Labs Vol. I · Inquiry as Method · MMXXVI

QA Arena Inquirer Knowledge Graph Methodology Work with us →

Flagship · QA envelope

Grounded Document QA.

One use case, many possible QA strategies. We fix the goal — generate trustworthy QAs from source documents — and explore the dimensions that may change the result.

Audited observed · 2026-06-13 · Evidence stage: Pilot

Notebook diagram — Grounded Document QA envelope: NIST/policy source, six dimension levers, grounded QAs + evidence refs + scale decision

Fig. 1 — the worked envelope

§ 01 The user use case

A team has source documents and wants grounded question-answer pairs. The QAs must be answerable from the source, faithful to the source, and useful for evaluation, training, compliance, or RAG testing.

§ 02 The QA envelope

The QA envelope is the controlled experimental frame around the use case. It fixes use case, source, output format, evaluator, and budget; then varies one dimension at a time.

Use case

Grounded Document QA

Source family

Regulatory / policy document

Artifact

qa_pair_with_refs

Floor per arm

50 candidates

Dimensions varied

6 (one at a time)

Audit

observed · 2026-06-13

§ 03 Six dimensions explored

Each dimension changes one thing about the envelope, while every other choice stays fixed.

Source shape

Does the same grounded-QA pipeline hold across policy, technical, and code source documents?

Read the experiment →

All three source shapes produced comparable grounded candidates; no shape collapsed.

observed

II.

Source enrichment

Does pre-processing the source (raw vs summary vs clause map) change grounded-QA quality?

Read the experiment →

All three enrichments produced means within 0.94–0.97 — partial confirmation only.

observed

III.

Local vs cloud

Does the same pipeline work on a local Ollama model and a cloud API model?

Read the experiment →

Local Qwen and cloud Gemini both reached the 50-pair floor; local needed a resume retry.

observed

IV.

Frontier vs open source

Does a frontier model meaningfully outperform an open-weight model under the same envelope?

Read the experiment →

One model per arm; no separation observed at the 50-pair floor.

observed

Evaluation suite

Does the routed metric suite discriminate quality differently from a generic LLM-judge suite?

Read the experiment →

Routed and generic suites agreed on top candidates for this 150-row set — null result.

null result

VI.

Budget & scale

Does candidate volume (50 / 100 / 200) change which strategies win, or just how many you keep?

Read the experiment →

Score means held in a narrow 0.91–0.95 band across all three arms — volume yield, not quality separation.

observed

§ 04 What we know

The current evidence suggests QA-generation quality depends on more than the generator. Source preparation, evaluator choice, model/runtime, and candidate budget all affect what can be trusted. The strongest current value of this envelope is not declaring a universal winner, but showing which dimensions are worth testing before scale-up.

§ 05 What we do not know yet

The current evidence is still pilot-grade. It does not yet prove a universal best tool, best model, or best enrichment strategy. More source families, repeated runs, and clearer arm-level comparisons are needed before making robust claims.

§ 06 What is worth scaling next

A second source family (codebase, scientific paper) to retest source enrichment and source shape outside the policy domain.
A second model per arm in the frontier-vs-open-source dimension.
A noisier candidate set for the evaluation-suite null result.
A larger budget arm (500 candidates) to test the volume-without-quality-collapse finding at scale.

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

Legal Privacy Policy Cookie Settings

© 2026 Quaerens AI Labs