Main Results
This page summarizes the current paper evaluation state:
evaluation_v2core bundle underoutput/paper/evaluation_v2/evaluation_v3strengthening bundle underoutput/paper/evaluation_v3/
Evaluation Design
The current paper-facing evaluation is organized around one main battleground and several supporting studies:
E1: MaxCut ranking, the primary heterogeneous claim populationE2: GHZ structural ranking calibrationE3: Bernstein-Vazirani decision calibrationE4: Grover distribution fragility caseE5: multi-policy cost/agreement study on an expanded 495-configuration gridS1: backend-conditioned transpile-only structural portabilityS2: boundary-stress packQEC: repetition-code portability illustration
The core experiments (E1-E4, S2, QEC) are evaluated over small exact scopes:
compilation_only_exactsampling_only_exactcombined_light_exact
E5 uses an expanded sampling_policy_eval space to make cost tradeoffs non-trivial.
The strengthening bundle adds:
W1: second-family extensions (VQE/H2pilot andMax-2-SAT/QAOA)W3: stronger metric-centric baselines for RQ1W4: admissibility-study checklist plus human-rating summary scaffoldW5: near-boundary policy pack
Headline Findings
E1yields25 unstable,2 stable,0 inconclusiveclaim-space-delta variants.E2yields4 stable,8 inconclusive.E3yields4 stable.E4yields4 unstable.S2yields16 unstable.QECyields3 stable,1 unstable.E5shows perfect agreement (1.0) for all evaluated policies againstfull_factorial.
RQ1: Why Claim-Centric Validation Is Needed
E1 is the strongest necessity signal. Fragility is not a rare edge case in the main MaxCut population: almost every comparative claim variant becomes unstable under admissible perturbations.
By scope, the E1 verdict distribution is:
compilation_only_exact:2 stable / 7 unstablesampling_only_exact:0 stable / 9 unstablecombined_light_exact:0 stable / 9 unstable
By delta, the distribution is:
delta = 0.00:1 stable / 8 unstabledelta = 0.01:1 stable / 8 unstabledelta = 0.05:0 stable / 9 unstable
The strongest baseline mismatch in the current rerun comes from conventional metric reporting rather than from the stored single-run baseline. Using a fixed 5-run metric summary, 9/9 apparently consistent advantages are still classified as unstable by ClaimStab-QC, yielding a metric-based false-reassurance rate of 1.0.
The representative mismatch case is:
- claim:
QAOA_p2 > QAOA_p1 - scope:
compilation_only_exact delta = 0.05- metric view:
mean_diff = 0.1190,95% CI = [0.1059, 0.1321] - claim-centric view:
stability_hat = 0.7897,95% CI = [0.7598, 0.8169], verdict =unstable
This is the main empirical reason the project argues for claim-centric validation rather than metric-centric reporting alone.
RQ2: Semantic Discrimination Across Claim Families
ClaimStab-QC does not collapse to a trivial all-unstable pattern.
Aggregated over the completed evaluation_v2 runs:
ranking:9 stable / 42 unstable / 8 inconclusivedecision:4 stable / 0 unstable / 0 inconclusivedistribution:0 stable / 4 unstable / 0 inconclusive
Interpretation:
E3is the clearest stable control: all four BV decision variants are stable.E4is the clearest fragile control: all four Grover distribution variants are unstable.E2is mixed rather than uniformly stable: it becomes inconclusive at higher deltas because the confidence interval overlaps thetau = 0.95threshold.S2does not show abstention in the current rerun; it collapses into direct fragility (16 unstable).QECis supporting portability evidence only, not a main source of generalization claims.
RQ3: Diagnostic Value
The strongest diagnostic evidence comes from the ranking experiments (E1 and S2).
Dominant perturbation drivers follow the declared scope:
- in
compilation_only_exact, the main drivers arelayout_methodandseed_transpiler - in
sampling_only_exact, the main drivers areseed_simulatorandshots - in
combined_light_exact,seed_simulatorremains dominant, with compilation factors still visible
Driver explanations are reasonably consistent across neighboring unstable variants:
E1top-driver consistency:0.8333S2top-driver consistency:0.9444
The current artifact does not materialize exact MOS objects, so the derived package reports a conservative explanation-compression proxy instead of exact MOS size:
E1median proxy constraint count:1S2median proxy constraint count:1
These are useful as compact explanatory witnesses, but they should not be overstated as exact minimal sufficient sets.
RQ4: Cost-efficiency and Practicality
E5 is now a substantive result rather than a placeholder.
Across the 9 available MaxCut ranking claim-pair/delta variants on the expanded 495-configuration grid:
full_factorial:495selected configurationsrandom_k_32:33random_k_64:65adaptive_ci:57adaptive_ci_tuned:17
All five strategies agree with the full_factorial reference on every evaluated variant (agreement = 1.0).
The strongest practical result is therefore:
adaptive_ci_tunedpreserves all reference decisions at a fraction of the cost
S1 should be described carefully. The current completed output is a backend-conditioned transpile-only structural portability study, not a full noisy-device claim-centric rerun. Within that controlled scope it is fully stable:
90/90 stablerows across five fake IBM backends and two structural metrics
Scope Caveat
Strengthening Additions (evaluation_v3)
W1 VQE/H2 pilot:15 stable / 2 unstable / 1 inconclusiveW1 Max-2-SAT:13 stable / 4 unstable / 1 inconclusiveW3 matched-scope metric baseline:9/9metric-supportive E1 variants remain false reassuranceW3 sensitivity: the metric false-reassurance rate stays at1.0from10through495sampled configurations on the expanded gridW5 near-boundary: adaptive policies remain correct but consume much more budget (adaptive_ci:57 -> 257;adaptive_ci_tuned:17 -> 65)W4: the repository now includes an 18-item admissibility checklist with author-side reference labels and explicit Q1/Q2/Q3 trigger rules for borderline cases such as noise scaling and 10x shot budgets, plus a human-rating summary pipeline
The default repository state intentionally does not report a submission-facing kappa value. Inter-rater agreement should only be reported after collecting real external ratings; otherwise W4 should be described as a checklist and analysis scaffold, not as completed human-subject evidence.
Conditional robustness is not the same as correctness.
All stable, unstable, and inconclusive outcomes reported here are relative to:
- the formalized claim specification
- the declared perturbation scope
- the configured decision threshold and confidence rule
This is especially important for the MaxCut ranking results: a stable verdict means the claim relation is robust under the declared scope, not that the natural-language conclusion has been universally proven true.
Artifact Entry Points
- core summary root:
output/paper/evaluation_v2/README.md - strengthening summary root:
output/paper/evaluation_v3/README.md - core raw runs:
output/paper/evaluation_v2/runs/ - strengthening runs:
output/paper/evaluation_v3/runs/ - core derived tables:
output/paper/evaluation_v2/derived_paper_evaluation/ - strengthening derived tables:
output/paper/evaluation_v3/derived_paper_evaluation/ - strengthening figure pack:
output/paper/evaluation_v3/pack/figures/