Quaerens AI Labs Vol. I · Experiment record · MMXXVI
Quaerens
QA Arena Inquirer Knowledge Graph Methodology Work with us →
← Grounded Document QA · flagship envelope

FLAGSHIP ENVELOPE

Flagship Dim Source Shape

Dimension tested · source shape

§ 01Arena measurement

Arena question

Can the fixed grounded-document QA envelope maintain quality when the source shape changes across policy regulation, technical security controls, and code/API documentation?

Lanes

code_api_software_doc
candidates  54
selected  50
policy_regulatory
candidates  54
selected  50
technical_security_control
candidates  54
selected  50

Primary metrics

groundingfaithfulnessevidence_extraction

Secondary metrics: answerability, answer_relevancy, selected_count, duplicate rejection

Observed per-lane means

metriccode_api_software_docpolicy_regulatorytechnical_security_controlspread
grounding0.9710.9660.9670.0053
faithfulness0.9610.9580.9600.0031
evidence_extraction0.9180.9190.9200.0021

Best lane per metric shown in oxblood. Spread = max − min across lanes.

§ 02Interpretation

The same grounded-QA pipeline produced usable selected data across policy, technical-control, and code-documentation sources, with 50 kept candidates per arm and no collapse in evaluation discrimination.

This dimension is best read as robustness evidence. Changing the source shape did not break the agentic source-unit workflow: each arm produced literal-evidence candidates, per-unit evaluation prompts, and a clean contract audit. The arm scores were close, so the result does not support a strong claim that one source shape is intrinsically better for grounded QA. It shows that this fixed envelope can carry the same QA task across three very different document styles.

The practical takeaway is that source shape is a real dimension to track, but the current run argues more for cross-source reliability than for winner selection. A stronger claim would need more source families, independent evaluation, or statistical analysis of the small score differences.

complete
§ 03Planning

Experiment spec

Tools
agent_skill
Budget
Not specified
Judge
taxonomy_agent (agent_skill)
§ 04Envelope & scores

QA envelope

Use case
c_axis:C2_collect
Artifact contract
answer.qa_pair.grounded.v1
Metric set
answer.grounded.v1
Operator
codex
Models
agent_internal

Scores

metricmeanrangen
grounding0.9450.918–0.97054
faithfulness0.9210.894–0.94954
evidence_extraction0.8770.850–0.90554
answerability0.9150.888–0.94354
answer_relevancy0.9010.874–0.92954
Quaerens
Evidence over claims · scoped over global
© 2026 Quaerens AI Labs / Mario Lanzillotta