Quaerens AI Labs Vol. I · Inquiry as Method · MMXXVI
Quaerens
QA Arena Inquirer Knowledge Graph Methodology Work with us →
Flagship · QA envelope

Grounded Document QA.

One use case, many possible QA strategies. We fix the goal — generate trustworthy QAs from source documents — and explore the dimensions that may change the result.

Audited observed · 2026-06-13 · Evidence stage: Pilot

Notebook diagram — Grounded Document QA envelope: NIST/policy source, six dimension levers, grounded QAs + evidence refs + scale decision

Fig. 1 — the worked envelope

§ 01 The user use case

A team has source documents and wants grounded question-answer pairs. The QAs must be answerable from the source, faithful to the source, and useful for evaluation, training, compliance, or RAG testing.

§ 02 The QA envelope

The QA envelope is the controlled experimental frame around the use case. It fixes use case, source, output format, evaluator, and budget; then varies one dimension at a time.

Use case
Grounded Document QA
Source family
Regulatory / policy document
Artifact
qa_pair_with_refs
Floor per arm
50 candidates
Dimensions varied
6 (one at a time)
Audit
observed · 2026-06-13
§ 03 Six dimensions explored

Each dimension changes one thing about the envelope, while every other choice stays fixed.

I.

Does the same grounded-QA pipeline hold across policy, technical, and code source documents?

Read the experiment →

All three source shapes produced comparable grounded candidates; no shape collapsed.
observed
II.

Does pre-processing the source (raw vs summary vs clause map) change grounded-QA quality?

Read the experiment →

All three enrichments produced means within 0.94–0.97 — partial confirmation only.
observed
III.

Does the same pipeline work on a local Ollama model and a cloud API model?

Read the experiment →

Local Qwen and cloud Gemini both reached the 50-pair floor; local needed a resume retry.
observed
IV.

Does a frontier model meaningfully outperform an open-weight model under the same envelope?

Read the experiment →

One model per arm; no separation observed at the 50-pair floor.
observed
V.

Does the routed metric suite discriminate quality differently from a generic LLM-judge suite?

Read the experiment →

Routed and generic suites agreed on top candidates for this 150-row set — null result.
null result
VI.

Does candidate volume (50 / 100 / 200) change which strategies win, or just how many you keep?

Read the experiment →

Score means held in a narrow 0.91–0.95 band across all three arms — volume yield, not quality separation.
observed
§ 04 What we know

The current evidence suggests QA-generation quality depends on more than the generator. Source preparation, evaluator choice, model/runtime, and candidate budget all affect what can be trusted. The strongest current value of this envelope is not declaring a universal winner, but showing which dimensions are worth testing before scale-up.

§ 05 What we do not know yet

The current evidence is still pilot-grade. It does not yet prove a universal best tool, best model, or best enrichment strategy. More source families, repeated runs, and clearer arm-level comparisons are needed before making robust claims.

§ 06 What is worth scaling next
  • A second source family (codebase, scientific paper) to retest source enrichment and source shape outside the policy domain.
  • A second model per arm in the frontier-vs-open-source dimension.
  • A noisier candidate set for the evaluation-suite null result.
  • A larger budget arm (500 candidates) to test the volume-without-quality-collapse finding at scale.
Quaerens
Evidence over claims · scoped over global
© 2026 Quaerens AI Labs