The Open-Source Shift: Removing Friction with Prompt Adaptation, Part 3
December 3, 2025
This post is part of an ongoing series where we share examples of Prompt Adaptation in practice. The goal is to highlight real scenarios, real results, and the insights that emerge from them. Want us to cover your scenario? Share more about what you’re evaluating and we’ll run Prompt Adaptation on your use case.
Why Adoption Still Lags
Open-source models like Kimi K2 Thinking are improving at an extraordinary pace and, in many benchmarks, are now competitive with premium closed models like Claude 4.5 Sonnet, GPT-5.1, and Gemini 3 Pro.
Across the teams we work with, from fast-moving startups to global enterprises, we see a consistent pattern:
Even when an open-source model performs extremely well, adoption usually lags behind availability.
This is not about model quality or provider readiness.
It reflects how engineering teams typically operate:
rewriting prompts is time-consuming
validating a new model requires real workflow testing
production systems have a high bar for change
bandwidth is limited, especially when new models release every week
As a result, companies often stay on their existing proprietary models longer than they want to. The issue is not whether open-source is ready. The issue is the migration overhead.
Prompt Adaptation solves that problem.
Prompt Portability Is the Real Bottleneck
As shown in Part 1 and Part 2, the biggest challenge in model migration is that prompts rarely transfer cleanly, regardless of whether the migration is closed to open, open to open, or closed to closed.
A prompt that performs well on Claude 4.5 Sonnet will often underperform when used directly with a model like Kimi K2 Thinking. In practice, we see:
accuracy drops significantly
different interpretations of constraints
changes in formatting expectations
differences in reasoning structure and verbosity
Manually resolving these issues often takes between 8 and 40 hours per prompt. For teams with multiple workflows, the cost compounds quickly and becomes the primary reason migrations get delayed.
Prompt Adaptation removes this friction entirely.
A Real Migration: Claude 4.5 Sonnet to Kimi K2 Thinking
For this example, we evaluated a 200-sample subset of HotpotQA, a question answering dataset that requires multi-hop reasoning across multiple pieces of evidence. It is designed to test whether models can combine information from different sources to reach a correct answer, which makes it a strong benchmark for real-world reasoning tasks.
The prompt used in this evaluation was originally written and tuned for Claude 4.5 Sonnet.
Original prompt
We tested three conditions:
Claude 4.5 Sonnet with the original prompt
Kimi K2 Thinking with the same prompt
Kimi K2 Thinking with a prompt optimized by Prompt Adaptation
Results
Claude 4.5 Sonnet (original prompt): 82%
Kimi K2 Thinking (original prompt): 49.5%
Kimi K2 Thinking (adapted prompt): 87%
The raw drop is very significant. Even though K2 is a very capable model, prompts tuned for Claude often do not transfer cleanly, especially on tasks that require structured or multi-step reasoning.
Prompt Adaptation closed that gap and exceeded the original Claude baseline, turning a difficult migration into a measurable improvement.
Adapted prompt
This pattern appears across many model pairs:
models interpret instructions differently, so prompts rarely transfer one-to-one
newer or stronger models will still underperform when used with prompts tuned for others
Prompt Adaptation consistently reverses regressions and often outperforms both baselines
Why This Matters for Open-Source Adoption
The open-source ecosystem is accelerating quickly:
models like Kimi K2 Thinking, DeepSeek R1, and GLM-4.6 are becoming extremely strong
inference providers are offering competitive pricing and rapid improvements
enterprises want flexibility and to avoid single-provider dependence due to risk, compliance, and reliability requirements
Yet the practical step of migrating prompts remains one of the biggest blockers to adoption.
Prompt Adaptation reduces the migration overhead from days to minutes. It allows teams to:
test new open-source models immediately
quantify quality differences objectively
avoid manually rewriting dozens of prompts
adopt the right model for each workflow
keep up with a rapidly evolving model landscape
This is what makes open-source models not only competitive on paper but viable for production systems at scale.
If you’re evaluating models and want to explore this tradeoff, reach out for access or book time here.
Evaluation based on a 200-sample subset of HotpotQA, a multi-hop question answering dataset designed to test a model’s ability to retrieve and integrate supporting facts across multiple documents.