Better Models, Worse Results: Why Prompt Adaptation Matters, Part 1

September 30,  2025

This post is part of an ongoing series where we share examples of Prompt Adaptation in practice. The goal is to highlight real scenarios, real results, and the insights that emerge from them. Want us to cover your scenario? Share more about what you’re evaluating and we’ll run Prompt Adaptation on your use case.

When a new model is released, the expectation is simple: better benchmarks should mean better real-world performance. But we generally see the opposite.

Customers often migrate from an older model to a newer one, only to find that performance degrades when they reuse their existing prompts. The reason is that prompts aren’t portable, each model version interprets instructions differently.

Take as an example a prompt originally written for GPT-4o (released Nov ’24, now ~2 generations old) to perform intent classification in the banking domain on the Banking77 dataset:

With this prompt, GPT-4o reached 82.5% accuracy. Running the exact same prompt on Sonnet 4 (released May ’25) dropped accuracy to 80%.

Even though Sonnet 4 is newer and more powerful, it underperformed on the same prompt.

Traditionally, the fix is manual: engineers spend up to 40 hours rewriting and testing prompts to adapt them to a new model.

With Prompt Adaptation, the process is automatic. In about 30 minutes of background processing, the system generates many prompt variations and identifies the best-performing one. The adapted prompt achieved 89% accuracy on Sonnet 4, not only reversing the regression but also surpassing both baselines:

The takeaway: newer models don’t automatically guarantee better results. Without prompt adaptation, migrations often lead to performance degradation.

Too often, this challenge leads companies to stay locked in to older models, accumulating technical debt and missing out on the benefits of state-of-the-art models, all because migrating requires so much manual trial and error effort. That decision also creates bigger problems down the line: when a model is eventually deprecated (as with Claude 3.5 Sonnet in October), teams are forced into a last-minute scramble to migrate under pressure.

With Prompt Adaptation, migrations become an opportunity, not a setback. Prompts are automatically optimized so teams improve accuracy, stay current with state-of-the-art models, and avoid future technical debt, while collapsing what could take ~40 hours of engineering effort into ~30 minutes of background processing.

If you’re preparing to migrate to newer models, reach out for access to Prompt Adaptation or book time here.

Evaluation based on a 200-sample subset of Banking77, a dataset for fine-grained intent detection in the banking domain.