Quaerens AI Labs Vol. I · Inquiry as Method · MMXXVI

QA Arena Inquirer Knowledge Graph Methodology Work with us →

Methodology

How a claim earns the word evidence.

The envelope, the dimensions we vary, the Strategy Execution Test, how results are interpreted, and the maturity scale that bounds every claim.

§ 01

What is a QA envelope?

The controlled experimental frame around a use case. It fixes the use case, the source family, the artifact contract, and the candidate floor — then varies one dimension at a time. Claims are valid only inside the envelope tested.

An envelope communicates

“Under these conditions, this strategy worked better than the alternatives we tested.”

§ 02

Why global leaderboards mislead

A strategy that wins for compliance QA on long policy docs may lose for medical QA on short clinical notes. A strategy that wins under one evaluator may tie under another. Global “best strategy” claims hide the dimensions that actually changed the result.

Notebook diagram — a global leaderboard hides context; two QA envelopes flip which strategy works

Fig. 1 — same tools, different envelopes, opposite winners

§ 03

Dimensions that can change a result

Source enrichment — raw vs summary vs clause map

Model & runtime — local vs hosted vs frontier

Provider or tool — which tool as a component

Research strategy — a paper turned into an arm

Evaluation suite — generic vs grounded judge

Budget & scale — more candidates, more slices

§ 04 Strategy Execution Tests

Each run carries a pre-registered statement of what it must demonstrate and what would count as failure. After the run, an audit closes it as one of four states.

Observed

Produced the artifacts the contract required.

Partial

Some arms met the floor; others fell short.

Not observed

Did not produce the required artifacts.

Blocked

Hit a boundary — recorded as evidence about it.

§ 05 Evidence maturity scale

Five levels. Maturity is not output quality — a pilot can produce high-quality outputs and still be weak evidence because the sample is small. Everything on this site is currently Pilot.

Illustrative single small run, narrative only.

Pilot real components, small N. Hypothesis-generating.

Credible repeated runs or slices. Scoped claim defensible.

Validated adequate sample, independent judge, matched baseline.

Robust holds across envelopes. Target, not yet reached.

§ 06 What we will and won’t claim

Allowed

“In this envelope, strategy Y performed better on the tested slice.”

“This pilot suggests Y is worth scaling, but the sample is small.”

“Changing the evaluator changed which arm looked best.”

Not allowed

“Tool X is best.” · “Strategy Y is best.”

Any global ranking not scoped to an envelope.

Any statistical-significance claim — no p-values here.

Quaerens

Evidence over claims · scoped over global

Labs QA Arena Inquirer

More Knowledge Graph Methodology

Legal Privacy Policy Cookie Settings

© 2026 Quaerens AI Labs