Flagship Dim Source Shape

Dimension tested · source shape

§ 01Arena measurement

Arena question

Can the fixed grounded-document QA envelope maintain quality when the source shape changes across policy regulation, technical security controls, and code/API documentation?

Lanes

code_api_software_doc

candidates 54

selected 50

policy_regulatory

candidates 54

selected 50

technical_security_control

candidates 54

selected 50

Primary metrics

groundingfaithfulnessevidence_extraction

Secondary metrics: answerability, answer_relevancy, selected_count, duplicate rejection

Observed per-lane means

metric	code_api_software_doc	policy_regulatory	technical_security_control	spread
grounding	0.971	0.966	0.967	0.0053
faithfulness	0.961	0.958	0.960	0.0031
evidence_extraction	0.918	0.919	0.920	0.0021

Best lane per metric shown in oxblood. Spread = max − min across lanes.

§ 02Interpretation

The same grounded-QA pipeline produced usable selected data across policy, technical-control, and code-documentation sources, with 50 kept candidates per arm and no collapse in evaluation discrimination.

This dimension is best read as robustness evidence. Changing the source shape did not break the agentic source-unit workflow: each arm produced literal-evidence candidates, per-unit evaluation prompts, and a clean contract audit. The arm scores were close, so the result does not support a strong claim that one source shape is intrinsically better for grounded QA. It shows that this fixed envelope can carry the same QA task across three very different document styles.

The practical takeaway is that source shape is a real dimension to track, but the current run argues more for cross-source reliability than for winner selection. A stronger claim would need more source families, independent evaluation, or statistical analysis of the small score differences.

complete

§ 03Planning

Experiment spec

Tools

agent_skill

Budget

Not specified

Judge

taxonomy_agent (agent_skill)

§ 04Envelope & scores

QA envelope

Use case

c_axis:C2_collect

Artifact contract

answer.qa_pair.grounded.v1

Metric set

answer.grounded.v1

Operator

codex

Models

agent_internal

Scores

metric	mean	range	n
grounding	0.945	0.918–0.970	54
faithfulness	0.921	0.894–0.949	54
evidence_extraction	0.877	0.850–0.905	54
answerability	0.915	0.888–0.943	54
answer_relevancy	0.901	0.874–0.929	54

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

© 2026 Quaerens AI Labs / Mario Lanzillotta