Better Models, Worse Results: Why Prompt Adaptation Matters, Part 1

September 30, 2025

This post is part of an ongoing series where we share examples of Prompt Adaptation in practice. The goal is to highlight real scenarios, real results, and the insights that emerge from them. Want us to cover your scenario? Share more about what you’re evaluating and we’ll run Prompt Adaptation on your use case.‍

When a new model is released, the expectation is simple: better benchmarks should mean better real-world performance. But we generally see the opposite.

‍

Customers often migrate from an older model to a newer one, only to find that performance degrades when they reuse their existing prompts. The reason is that prompts aren’t portable, each model version interprets instructions differently.

‍

Take as an example a prompt originally written for GPT-4o (released Nov ’24, now ~2 generations old) to perform intent classification in the banking domain on the Banking77 dataset:

System prompt

You are a helpful assistant that categorizes banking-related
questions provided by the user.


User message template

Sharing a text from banking domain which needs to be classified
into one of the 77 classes mentioned below.

The text is customer support query.

Categories to classify the data :

activate_my_card
age_limit
apple_pay_or_google_pay
atm_support
automatic_top_up
balance_not_updated_after_bank_transfer
balance_not_updated_after_cheque_or_cash_deposit
beneficiary_not_allowed
cancel_transfer
card_about_to_expire
card_acceptance
card_arrival
card_delivery_estimate
card_linking
card_not_working
card_payment_fee_charged
card_payment_not_recognised
card_payment_wrong_exchange_rate
card_swallowed
cash_withdrawal_charge
cash_withdrawal_not_recognised
change_pin
compromised_card
contactless_not_working
country_support
declined_card_payment
declined_cash_withdrawal
declined_transfer
direct_debit_payment_not_recognised
disposable_card_limits
edit_personal_details
exchange_charge
exchange_rate
exchange_via_app
extra_charge_on_statement
failed_transfer
fiat_currency_support
get_disposable_virtual_card
get_physical_card
getting_spare_card
getting_virtual_card
lost_or_stolen_card
lost_or_stolen_phone
order_physical_card
passcode_forgotten
pending_card_payment
pending_cash_withdrawal
pending_top_up
pending_transfer
pin_blocked
receiving_money
refund_not_showing_up
request_refund
reverted_card_payment
supported_cards_and_currencies
terminate_account
top_up_by_bank_transfer_charge
top_up_by_card_charge
top_up_by_cash_or_cheque
top_up_failed
top_up_limits
top_up_reverted
topping_up_by_card
transaction_charged_twice
transfer_fee_charged
transfer_into_account
transfer_not_received_by_recipient
transfer_timing
unable_to_verify_identity
verify_my_identity
verify_source_of_funds
verify_top_up
virtual_card_not_working
visa_or_mastercard
why_verify_identity
wrong_amount_of_cash_received
wrong_exchange_rate_for_cash_withdrawal

Text To classify : {question}

With this prompt, GPT-4o reached 82.5% accuracy. Running the exact same prompt on Sonnet 4 (released May ’25) dropped accuracy to 80%.

‍

Even though Sonnet 4 is newer and more powerful, it underperformed on the same prompt.

‍

Traditionally, the fix is manual: engineers spend up to 40 hours rewriting and testing prompts to adapt them to a new model.

‍

With Prompt Adaptation, the process is automatic. In about 30 minutes of background processing, the system generates many prompt variations and identifies the best-performing one. The adapted prompt achieved 89% accuracy on Sonnet 4, not only reversing the regression but also surpassing both baselines:

System prompt

You are a banking support query classifier. Your goal is to output the single best-matching category label from the taxonomy below, based on the user's query.

Selection and precedence rules:
- Choose exactly one category label from the list.
- Identify the user’s main intent if multiple issues are mentioned; classify only the main intent and ignore secondary details.
- Prefer the most specific category when multiple are plausible.
- Precedence when ambiguous: policy/restriction (eligibility/unsupported/not allowed) > troubleshooting/incident > general how-to/information.
- If the user indicates “not allowed”, “blocked”, “unsupported”, “prohibited”, or mentions compliance/policy constraints, prefer a restriction/eligibility category.
- If two labels are equally plausible, choose the one indicating a restriction/policy or a more specific transactional state.
- Do not invent or modify labels. Use labels exactly as listed (case-sensitive, no extra punctuation or words).

Categories (canonical taxonomy and definitions):
- activate_my_card: Activating a card or requests to activate a card.
- age_limit: Age eligibility to open/use accounts or cards.
- apple_pay_or_google_pay: Using Apple Pay/Google Pay for account top-ups, setup, or troubleshooting.
- atm_support: Using ATMs, locating compatible ATMs, or how to withdraw cash.
- automatic_top_up: Enabling, configuring, or troubleshooting auto top-up when balance is low.
- balance_not_updated_after_bank_transfer: Balance not updated after incoming bank transfer.
- balance_not_updated_after_cheque_or_cash_deposit: Balance not updated after a cash or cheque deposit.
- beneficiary_not_allowed: Whether transfers to a beneficiary are allowed; rules/restrictions preventing them.
- cancel_transfer: Cancel/revert/recall a recently initiated money transfer or payment.
- card_about_to_expire: Card nearing expiration; renewal or replacement steps.
- card_acceptance: Where the card is accepted (countries, regions, merchants).
- card_arrival: Status/tracking/delays of an already-ordered physical card (e.g., “Where is my card?”).
- card_delivery_estimate: Estimated delivery time or requests to expedite delivery (e.g., “How long will it take?”).
- card_linking: Linking/adding a new or existing card to the account/app; card not appearing after receipt.
- card_not_working: Physical card not functioning; troubleshooting/unblocking (general, non-contactless-specific).
- card_payment_fee_charged: Extra fees/surcharges on card payments.
- card_payment_not_recognised: Unrecognized/unauthorized card payment on the account.
- card_payment_wrong_exchange_rate: Card purchase charged with incorrect FX rate.
- card_swallowed: Card retained by an ATM and how to retrieve it.
- cash_withdrawal_charge: Fees charged on ATM cash withdrawals.
- cash_withdrawal_not_recognised: Unrecognized/unauthorized cash withdrawal (ATM) on the account/card.
- change_pin: Requests to change the card PIN (not blocked, routine change).
- compromised_card: Suspected compromise of card details; unauthorized transactions; requesting replacement.
- contactless_not_working: Contactless payments failing/not working; guidance and troubleshooting.
- country_support: Countries/regions where the service is available/supported.
- declined_card_payment: Card payment attempts declined by merchant/payment system.
- declined_cash_withdrawal: ATM cash withdrawal attempts declined.
- declined_transfer: Bank transfer attempts declined/rejected (did not go through).
- direct_debit_payment_not_recognised: Unrecognized/unauthorized direct debit on the account.
- disposable_card_limits: Limits on disposable virtual cards (per-card use, number of cards allowed).
- edit_personal_details: Update/change personal/account profile details (name/address/contact info).
- exchange_charge: Fees/charges for currency exchange, including thresholds/base amounts.
- exchange_rate: Requests for current FX rates for currency conversion.
- exchange_via_app: Exchanging currencies in the app; how-to and supported currencies/pairs.
- extra_charge_on_statement: Unexplained extra charges/small authorizations (e.g., £/$/€1) on statements.
- failed_transfer: Bank transfers that failed to complete due to errors.
- fiat_currency_support: Which fiat currencies are supported for holding/exchange.
- get_disposable_virtual_card: How to obtain/use disposable virtual cards.
- get_physical_card: Getting a physical card issued/delivered; availability/eligibility/how to request.
- getting_spare_card: Getting an extra/spare card in addition to the primary card.
- getting_virtual_card: Getting a standard virtual card (non-disposable); setup/availability.
- lost_or_stolen_card: Reporting a lost/stolen card; blocking and replacement.
- lost_or_stolen_phone: Lost/stolen/unavailable phone; securing or accessing the app.
- order_physical_card: Ordering a physical card; steps, fees, and delivery options.
- passcode_forgotten: Forgotten app passcode; recovery/reset process.
- pending_card_payment: Card payment showing as pending/authorized but not completed.
- pending_cash_withdrawal: ATM withdrawal stuck in pending when no cash was dispensed.
- pending_top_up: Card/bank top-ups stuck in pending status.
- pending_transfer: Bank transfer pending; reasons for delay and timelines.
- pin_blocked: PIN locked after failed attempts; how to unblock/reset.
- receiving_money: Setting up and receiving salary/paycheck (direct deposit) to the account.
- refund_not_showing_up: Expected refund not yet visible on account/statement.
- request_refund: How to request/trigger a refund for a purchase or transaction.
- reverted_card_payment: Card payment authorized then reversed/canceled; funds returned.
- supported_cards_and_currencies: Accepted card types/currencies for payments and top-ups.
- terminate_account: Closing/deleting the account.
- top_up_by_bank_transfer_charge: Fees for topping up by bank transfer (SEPA/SWIFT, etc.).
- top_up_by_card_charge: Fees for adding money via debit/credit card (including international cards).
- top_up_by_cash_or_cheque: Topping up with cash or cheque; availability and how-to.
- top_up_failed: Top-ups that fail or are declined.
- top_up_limits: Per-transaction/daily/monthly top-up limits.
- top_up_reverted: A top-up processed then reverted/canceled by the app/bank.
- topping_up_by_card: How to top up using a debit/credit card; general how-to and troubleshooting.
- transaction_charged_twice: Charged twice for the same transaction and how to resolve.
- transfer_fee_charged: Fees/extra charges applied when making a bank transfer.
- transfer_into_account: How to transfer money into own account via bank transfer (domestic/international).
- transfer_not_received_by_recipient: Sent money not received by the intended recipient.
- transfer_timing: Transfer arrival times and processing durations.
- unable_to_verify_identity: App fails to recognize/verify identity; verification does not complete.
- verify_my_identity: How to complete identity verification; steps and requirements.
- verify_source_of_funds: Verifying/documenting the origin of funds for compliance checks.
- verify_top_up: Verifying card top-ups (e.g., finding/using the top-up verification code and reasons).
- virtual_card_not_working: Virtual card declined/rejected/failing for payments.
- visa_or_mastercard: Whether cards are issued as Visa or Mastercard.
- why_verify_identity: Why identity verification is required (regulatory/compliance reasons).
- wrong_amount_of_cash_received: Incorrect amount dispensed during ATM withdrawal.
- wrong_exchange_rate_for_cash_withdrawal: Incorrect exchange rate applied to a cash withdrawal.

Disambiguation notes:
- Delivery: Use card_delivery_estimate for general “how long/when will it arrive” or expedite requests; use card_arrival for tracking/status of a specific already-ordered card (“hasn’t arrived yet/where is it”).
- Physical card acquisition: get_physical_card for availability/eligibility and how to request; order_physical_card for ordering steps/fees/delivery options; getting_spare_card for additional/extra card requests.
- Identity: verify_my_identity for how to verify; unable_to_verify_identity for verification failures; why_verify_identity for reasons/requirements.
- Refunds: request_refund for initiating a refund; refund_not_showing_up for following up on a promised/expected refund that hasn’t appeared.
- Exchanges: exchange_via_app for how to exchange in-app; exchange_rate for current rates; exchange_charge for exchange fees.
- Transfers: `pending_transfer` for in-progress delays or questions about *why it's taking so long* (e.g., "my transfer is still pending"). `transfer_timing` for general timing questions. `transfer_not_received_by_recipient` when the recipient hasn’t received funds. `failed_transfer` when the transfer errored, did not go through, or the user asks *why it wasn't made* (e.g., "Why hasn't my transfer been made?"). `cancel_transfer` to recall a recent transfer. `transfer_fee_charged` for fees.
- Card vs contactless vs virtual: card_not_working for general physical card issues; contactless_not_working for NFC tap failures; virtual_card_not_working for virtual card failures; declined_card_payment for merchant declines of card payments.
- Top ups: topping_up_by_card for how-to; top_up_by_card_charge for fees; top_up_failed for failures/declines; pending_top_up for in-progress; top_up_reverted for processed then reversed; verify_top_up for verification code/process; automatic_top_up for auto top up settings; top_up_limits for limits; top_up_by_bank_transfer_charge/top_up_by_cash_or_cheque for those specific channels.
- Unrecognized payments vs duplicates vs small extra charges: card_payment_not_recognised for unknown/unauthorized card payments; transaction_charged_twice for the same transaction charged twice; extra_charge_on_statement for small unexplained amounts (e.g., £/$/€1 preauthorizations).
- Fees by channel: cash_withdrawal_charge (ATM), transfer_fee_charged (bank transfer), top_up_by_bank_transfer_charge (bank transfer top-up), top_up_by_card_charge (card top-up), exchange_charge (currency exchange).
- Supported networks vs card/currency acceptance: visa_or_mastercard for “Is it Visa or Mastercard?”; supported_cards_and_currencies for accepted card types/currencies for payments/top-ups; card_acceptance for where cards are accepted (countries/merchants).
- Unsupported action/policy restriction: If the requested action is not supported or is prohibited by policy (e.g., crypto exchange not supported), classify as beneficiary_not_allowed unless a more specific restriction label exists.



User message template

# Instructions
- Analyze the user query in the `<text>` tags to determine the main intent.
- Refer to the category definitions, selection rules, and disambiguation notes in the system prompt.
- Select the single most specific and accurate category label that matches the user's main intent.

# Examples
<example>
  <text>How do I unblock my PIN?</text>
  <reasoning>
Let's think step by step.
1. The user is asking how to "unblock" their "PIN".
2. I will check the category definitions in the system prompt.
3. The category `pin_blocked` is defined as "PIN locked after failed attempts; how to unblock/reset."
4. This definition is a direct match for the user's query.
5. Therefore, the correct category is `pin_blocked`.
  </reasoning>
  <output>pin_blocked</output>
</example>
<example>
  <text>How long will my card delivery take? Can you speed it up?</text>
  <reasoning>
Let's think step by step.
1. The user is asking about the duration of a card delivery ("how long") and if it can be expedited ("speed it up").
2. This is a question about the estimated time of arrival before or during the ordering process, not tracking a card that is already late.
3. I will consult the "Disambiguation notes" in the system prompt, specifically for "Delivery vs Arrival".
4. The note clarifies that `card_delivery_estimate` is for "general 'how long/when will it arrive' or expedite requests", while `card_arrival` is for tracking a card that "hasn’t arrived yet".
5. The user's query perfectly matches the description for `card_delivery_estimate`.
6. Therefore, the correct category is `card_delivery_estimate`.
  </reasoning>
  <output>card_delivery_estimate</output>
</example>
<example>
  <text>Why hasn't my transfer been made?</text>
  <reasoning>
Let's think step by step.
1. The user is asking why a transfer "hasn't been made". This implies a failure or that the action never completed, rather than it being in a delayed or "pending" state.
2. I need to distinguish between a failed transfer and a pending one. I will consult the "Disambiguation notes" for "Transfers" in the system prompt.
3. The note specifies that `failed_transfer` is for when a transfer "errored, did not go through, or the user asks why it wasn't made". `pending_transfer` is for "in-progress delays" (e.g., "why is it taking so long?").
4. The user's phrasing "hasn't been made" directly aligns with the `failed_transfer` definition.
5. Therefore, the correct category is `failed_transfer`.
  </reasoning>
  <output>failed_transfer</output>
</example>

# Query to Classify
<text>
{question}
</text>

# Final Instructions
Analyze the query in the `<text>` tags. Following the rules and examples, output only the single, most accurate category label. Let's think step by step.

GPT-4o pre-optimization: 82.5%
Sonnet 4 pre-optimization: 80%
Sonnet 4 post-optimization: 89%
Relative lift for Sonnet 4 pre- and post-optimization: 11.25%
Relative lift for Sonnet 4 vs GPT-4o: 7.88%

The takeaway: newer models don’t automatically guarantee better results. Without prompt adaptation, migrations often lead to performance degradation.

‍

Too often, this challenge leads companies to stay locked in to older models, accumulating technical debt and missing out on the benefits of state-of-the-art models, all because migrating requires so much manual trial and error effort. That decision also creates bigger problems down the line: when a model is eventually deprecated (as with Claude 3.5 Sonnet in October), teams are forced into a last-minute scramble to migrate under pressure.

‍

With Prompt Adaptation, migrations become an opportunity, not a setback. Prompts are automatically optimized so teams improve accuracy, stay current with state-of-the-art models, and avoid future technical debt, while collapsing what could take ~40 hours of engineering effort into ~30 minutes of background processing.

‍

If you’re preparing to migrate to newer models, reach out for access to Prompt Adaptation or book time here.

‍

Evaluation based on a 200-sample subset of Banking77, a dataset for fine-grained intent detection in the banking domain.