The release
Anthropic shipped Sonnet 4.7 with a 1M token context window and we tested it for two weeks before deciding not to migrate the build pipeline yet. The 1M context window is genuinely useful for two of our workflows but marginal-or-worse for the rest.
What we tested
Three representative tasks: a long-context research synthesis, a multi-file refactor, a content-update agent run. The synthesis task improved on 4.7 in ways the cost passthrough justified. The refactor task ran the same way on 4.6 as on 4.7. The content-update task degraded slightly on 4.7, almost certainly because the longer context invited the model to over-condition on the corpus.
Why we stayed
Across our 500-site pipeline, the workload mix favours the second and third tasks heavily. Migrating the whole pipeline to 4.7 would have produced a marginal cost increase with no measurable quality gain on most workloads. Pinning 4.6 keeps the cost line flat and the behaviour stable.
Pin discipline as a Stage-3 marker
Model version is a production dependency. Every silent upgrade is a regression risk. We pin and we test before we move. The discipline is a marker on the Maturity Curve: pinning becomes natural at Stage 3, when you have per-pattern instrumentation to detect quality regressions on a model upgrade.
The lesson
New is not better. New is different. Test the workload you actually run, not the workload the release notes describe.
Read next

Oliver runs Digital Signet, a research and product studio that operates ~500 production sites with AI agents as the engineering layer. The Digital Signet portfolio is built using a continuous AI-agent build pipeline, one of the largest agent-operated publishing operations on the open web. The handbook draws directly from those deployments: real cost data, real failure modes, real recovery patterns.