Weaker Models, Stronger Results: How Prompt Adaptation Levels the Field, Part 2

October 7,  2025

This post is part of an ongoing series where we share examples of Prompt Adaptation in practice. The goal is to highlight real scenarios, real results, and the insights that emerge from them. Want us to cover your scenario? Share more about what you’re evaluating and we’ll run Prompt Adaptation on your use case.

When comparing a premium model to a lighter or faster one, the tradeoff is usually clear: the premium model delivers higher accuracy but is slower and more expensive, while the lighter one is faster and cheaper but sacrifices some performance.

Prompt Adaptation levels the field. Optimized prompts enable smaller models to reach or even exceed the performance of larger ones at lower cost and latency.

We tested this on clinc150, a public dataset for evaluating how well models classify user intents in conversational assistants, including their ability to handle out-of-scope queries.

A prompt originally written for Gemini 2.5 Pro (a stronger model) scored 93% accuracy on the evaluation. When we ran the same prompt on Gemini 2.5 Flash (a weaker, faster, cheaper model), accuracy dropped to 86.75%.

This regression is expected. Smaller models are often dismissed outright, with the assumption they won’t perform as well out of the box. But with Prompt Adaptation, we optimized the prompt for Gemini 2.5 Flash in ~30 minutes of background processing.

Results

  • Gemini 2.5 Pro (original prompt): 93%
  • Gemini 2.5 Flash (original prompt): 86.75%
  • Gemini 2.5 Flash (adapted prompt): 97.5%

The adapted prompt on Gemini 2.5 Flash didn’t just close the gap. It outperformed the stronger Gemini 2.5 Pro baseline by 4.5%.

Why it matters: Manually rewriting and testing prompts for smaller models is time-consuming and unpredictable, and you never know if the effort will pay off. Prompt Adaptation automates that process by surfacing optimized prompts that make smaller models competitive with premium ones, often matching or even outperforming them.

Optimizing prompts for weaker models is especially important in applications built from multiple chained prompts, where cost and latency compounds at every step. Prompt Adaptation reduces both while maintaining or even improving accuracy.

If you’re evaluating models and want to explore this tradeoff, reach out for access or book time here.

Evaluation based on an 200-sample subset of clinc150, a conversational intent classification dataset spanning 150 intents in 10 domains plus out-of-scope queries.