Evaluation Readiness (evaluation_v2)
Updated: 2026-03-19
Active Bundle Status
The active paper-facing bundle is:
output/paper/evaluation_v2/
This bundle supersedes older output roots such as:
output/presentations/large/output/paper/artifact/output/paper/pack/output/paper/multidevice/
Completed Runs
The current bundle contains completed outputs for:
E1main MaxCut ranking battlegroundE2GHZ structural calibrationE3BV decision calibrationE4Grover distribution fragility caseE5policy comparison on the expanded 495-configuration gridS1backend-conditioned transpile-only structural portabilityS2boundary stressQECrepetition-code portability illustration
Current Counts
E1:25 unstable,2 stable,0 inconclusiveE2:4 stable,8 inconclusiveE3:4 stableE4:4 unstableE5: all evaluated policies matchfull_factorial(agreement = 1.0)S1:90/90 stableS2:16 unstableQEC:3 stable,1 unstable
Readiness Assessment
The current bundle is strong enough to support the main ICSE-style narrative, provided the writing reflects the actual rerun rather than the earlier plan.
What is strong now:
E1clearly supports the main fragility-prevalence claim.- The metric-based baseline mismatch is strong (
9/9false reassurance under the fixed 5-run metric summary). - Claim-family discrimination is visible across ranking, decision, and distribution claims.
E5now supports a real cost/agreement tradeoff claim.
What must still be written carefully:
E2is mixed (stable+inconclusive), not a pure stable control.S2became direct fragility rather than abstention.S1is a controlled transpile-only structural portability result, not a full noisy-device portability study.- exact MOS objects are not materialized; the current diagnostics page uses a conservative proxy.
Main Figure Mapping
The active publication-facing figure set is:
output/paper/evaluation_v2/pack/figures/main/fig1_stability_profile.*output/paper/evaluation_v2/pack/figures/main/fig2_robustness_cells_by_delta.*output/paper/evaluation_v2/pack/figures/main/fig3_claim_distribution.*output/paper/evaluation_v2/pack/figures/main/fig4_e1_prevalence_by_scope.*output/paper/evaluation_v2/pack/figures/main/fig5_claim_metric_mismatch.*output/paper/evaluation_v2/pack/figures/main/fig6_claim_family_verdicts.*output/paper/evaluation_v2/pack/figures/main/fig_rq4_ci_width_vs_cost.*
Supporting figures are staged under:
output/paper/evaluation_v2/pack/figures/appendix/
Derived Tables and Narrative Support
Paper-facing derived outputs live under:
output/paper/evaluation_v2/derived_paper_evaluation/
Most important subdirectories:
RQ1_necessity/RQ2_semantics/RQ3_diagnostics/RQ4_practicality/
The latest prose-ready summary is:
output/paper/evaluation_v2/derived_paper_evaluation/results_draft.md
Bottom Line
The repository is no longer in a “frozen old matrix” state. It now has an active, coherent evaluation_v2 bundle with updated figures, derived tables, and a consistent output structure.
The main remaining work is no longer experiment execution; it is accurate public narration and deployment of the updated docs/pages.