Skip to content

Main Results

This page summarizes the current paper evaluation state:

  • evaluation_v2 core bundle under output/paper/evaluation_v2/
  • evaluation_v3 strengthening bundle under output/paper/evaluation_v3/

Evaluation Design

The current paper-facing evaluation is organized around one main battleground and several supporting studies:

  • E1: MaxCut ranking, the primary heterogeneous claim population
  • E2: GHZ structural ranking calibration
  • E3: Bernstein-Vazirani decision calibration
  • E4: Grover distribution fragility case
  • E5: multi-policy cost/agreement study on an expanded 495-configuration grid
  • S1: backend-conditioned transpile-only structural portability
  • S2: boundary-stress pack
  • QEC: repetition-code portability illustration

The core experiments (E1-E4, S2, QEC) are evaluated over small exact scopes:

  • compilation_only_exact
  • sampling_only_exact
  • combined_light_exact

E5 uses an expanded sampling_policy_eval space to make cost tradeoffs non-trivial.

The strengthening bundle adds:

  • W1: second-family extensions (VQE/H2 pilot and Max-2-SAT/QAOA)
  • W3: stronger metric-centric baselines for RQ1
  • W4: admissibility-study checklist plus human-rating summary scaffold
  • W5: near-boundary policy pack

Headline Findings

  • E1 yields 25 unstable, 2 stable, 0 inconclusive claim-space-delta variants.
  • E2 yields 4 stable, 8 inconclusive.
  • E3 yields 4 stable.
  • E4 yields 4 unstable.
  • S2 yields 16 unstable.
  • QEC yields 3 stable, 1 unstable.
  • E5 shows perfect agreement (1.0) for all evaluated policies against full_factorial.

RQ1: Why Claim-Centric Validation Is Needed

E1 is the strongest necessity signal. Fragility is not a rare edge case in the main MaxCut population: almost every comparative claim variant becomes unstable under admissible perturbations.

By scope, the E1 verdict distribution is:

  • compilation_only_exact: 2 stable / 7 unstable
  • sampling_only_exact: 0 stable / 9 unstable
  • combined_light_exact: 0 stable / 9 unstable

By delta, the distribution is:

  • delta = 0.00: 1 stable / 8 unstable
  • delta = 0.01: 1 stable / 8 unstable
  • delta = 0.05: 0 stable / 9 unstable

The strongest baseline mismatch in the current rerun comes from conventional metric reporting rather than from the stored single-run baseline. Using a fixed 5-run metric summary, 9/9 apparently consistent advantages are still classified as unstable by ClaimStab-QC, yielding a metric-based false-reassurance rate of 1.0.

The representative mismatch case is:

  • claim: QAOA_p2 > QAOA_p1
  • scope: compilation_only_exact
  • delta = 0.05
  • metric view: mean_diff = 0.1190, 95% CI = [0.1059, 0.1321]
  • claim-centric view: stability_hat = 0.7897, 95% CI = [0.7598, 0.8169], verdict = unstable

This is the main empirical reason the project argues for claim-centric validation rather than metric-centric reporting alone.

RQ2: Semantic Discrimination Across Claim Families

ClaimStab-QC does not collapse to a trivial all-unstable pattern.

Aggregated over the completed evaluation_v2 runs:

  • ranking: 9 stable / 42 unstable / 8 inconclusive
  • decision: 4 stable / 0 unstable / 0 inconclusive
  • distribution: 0 stable / 4 unstable / 0 inconclusive

Interpretation:

  • E3 is the clearest stable control: all four BV decision variants are stable.
  • E4 is the clearest fragile control: all four Grover distribution variants are unstable.
  • E2 is mixed rather than uniformly stable: it becomes inconclusive at higher deltas because the confidence interval overlaps the tau = 0.95 threshold.
  • S2 does not show abstention in the current rerun; it collapses into direct fragility (16 unstable).
  • QEC is supporting portability evidence only, not a main source of generalization claims.

RQ3: Diagnostic Value

The strongest diagnostic evidence comes from the ranking experiments (E1 and S2).

Dominant perturbation drivers follow the declared scope:

  • in compilation_only_exact, the main drivers are layout_method and seed_transpiler
  • in sampling_only_exact, the main drivers are seed_simulator and shots
  • in combined_light_exact, seed_simulator remains dominant, with compilation factors still visible

Driver explanations are reasonably consistent across neighboring unstable variants:

  • E1 top-driver consistency: 0.8333
  • S2 top-driver consistency: 0.9444

The current artifact does not materialize exact MOS objects, so the derived package reports a conservative explanation-compression proxy instead of exact MOS size:

  • E1 median proxy constraint count: 1
  • S2 median proxy constraint count: 1

These are useful as compact explanatory witnesses, but they should not be overstated as exact minimal sufficient sets.

RQ4: Cost-efficiency and Practicality

E5 is now a substantive result rather than a placeholder.

Across the 9 available MaxCut ranking claim-pair/delta variants on the expanded 495-configuration grid:

  • full_factorial: 495 selected configurations
  • random_k_32: 33
  • random_k_64: 65
  • adaptive_ci: 57
  • adaptive_ci_tuned: 17

All five strategies agree with the full_factorial reference on every evaluated variant (agreement = 1.0).

The strongest practical result is therefore:

  • adaptive_ci_tuned preserves all reference decisions at a fraction of the cost

S1 should be described carefully. The current completed output is a backend-conditioned transpile-only structural portability study, not a full noisy-device claim-centric rerun. Within that controlled scope it is fully stable:

  • 90/90 stable rows across five fake IBM backends and two structural metrics

Scope Caveat

Strengthening Additions (evaluation_v3)

  • W1 VQE/H2 pilot: 15 stable / 2 unstable / 1 inconclusive
  • W1 Max-2-SAT: 13 stable / 4 unstable / 1 inconclusive
  • W3 matched-scope metric baseline: 9/9 metric-supportive E1 variants remain false reassurance
  • W3 sensitivity: the metric false-reassurance rate stays at 1.0 from 10 through 495 sampled configurations on the expanded grid
  • W5 near-boundary: adaptive policies remain correct but consume much more budget (adaptive_ci: 57 -> 257; adaptive_ci_tuned: 17 -> 65)
  • W4: the repository now includes an 18-item admissibility checklist with author-side reference labels and explicit Q1/Q2/Q3 trigger rules for borderline cases such as noise scaling and 10x shot budgets, plus a human-rating summary pipeline

The default repository state intentionally does not report a submission-facing kappa value. Inter-rater agreement should only be reported after collecting real external ratings; otherwise W4 should be described as a checklist and analysis scaffold, not as completed human-subject evidence.

Conditional robustness is not the same as correctness.

All stable, unstable, and inconclusive outcomes reported here are relative to:

  • the formalized claim specification
  • the declared perturbation scope
  • the configured decision threshold and confidence rule

This is especially important for the MaxCut ranking results: a stable verdict means the claim relation is robust under the declared scope, not that the natural-language conclusion has been universally proven true.

Artifact Entry Points

  • core summary root: output/paper/evaluation_v2/README.md
  • strengthening summary root: output/paper/evaluation_v3/README.md
  • core raw runs: output/paper/evaluation_v2/runs/
  • strengthening runs: output/paper/evaluation_v3/runs/
  • core derived tables: output/paper/evaluation_v2/derived_paper_evaluation/
  • strengthening derived tables: output/paper/evaluation_v3/derived_paper_evaluation/
  • strengthening figure pack: output/paper/evaluation_v3/pack/figures/