The open-source shift: Removing friction with prompt adaptation, part 3

Prompt optimization improves Kimi K2 Thinking from 49.5% to 87%, outperforming the 82% Claude 4.5 baseline

Why Adoption Still Lags

Open-source models like Kimi K2 Thinking are improving at an extraordinary pace and, in many benchmarks, are now competitive with premium closed models like Claude 4.5 Sonnet, GPT-5.1, and Gemini 3 Pro.

Across the teams we work with, from fast-moving startups to global enterprises, we see a consistent pattern: Even when an open-source model performs extremely well, adoption usually lags behind availability.

This is not about model quality or provider readiness.

It reflects how engineering teams typically operate:

  • rewriting prompts is time-consuming
  • validating a new model requires real workflow testing
  • production systems have a high bar for change
  • bandwidth is limited, especially when new models release every week

As a result, companies often stay on their existing proprietary models longer than they want to. The issue is not whether open-source is ready. The issue is the migration overhead.

Prompt Adaptation solves that problem.

Prompt Portability Is the Real Bottleneck

As shown in Part 1 and Part 2, the biggest challenge in model migration is that prompts rarely transfer cleanly, regardless of whether the migration is closed to open, open to open, or closed to closed.

A prompt that performs well on Claude 4.5 Sonnet will often underperform when used directly with a model like Kimi K2 Thinking. In practice, we see:

  • accuracy drops significantly
  • different interpretations of constraints
  • changes in formatting expectations
  • differences in reasoning structure and verbosity

Manually resolving these issues often takes between 8 and 40 hours per prompt. For teams with multiple workflows, the cost compounds quickly and becomes the primary reason migrations get delayed.

Prompt Adaptation removes this friction entirely.

A Real Migration: Claude 4.5 Sonnet to Kimi K2 Thinking

For this example, we evaluated a 200-sample subset of HotpotQA, a question answering dataset that requires multi-hop reasoning across multiple pieces of evidence. It is designed to test whether models can combine information from different sources to reach a correct answer, which makes it a strong benchmark for real-world reasoning tasks.

The prompt used in this evaluation was originally written and tuned for Claude 4.5 Sonnet. Original prompt:

System prompt
You are an AI assistant specializing in Question Answering. Your task is to answer questions based solely on a provided text.
User message template
Role and Objective
You are an expert Question Answering AI. Your objective is to answer questions based only on the information provided in the {context}.
Instructions
- Answer the user's question using only the information provided in the {context}.
- Your answer must be direct and concise. Do not include introductory phrases, conversational filler, or information that does not directly answer the question.
- Pay extremely close attention to dates, timeframes, and any temporal constraints mentioned in the question. Ensure the information you extract from the context precisely matches these requirements.
- If a question asks for a person's name, provide the full name mentioned in the context.
- If the answer cannot be found in the context after careful analysis, you must follow the output format for unanswerable questions.
Reasoning Steps
Follow these steps to generate your response:
1.Deconstruct the Question: Break down the question to understand exactly what is being asked.
2.Scan and Locate: Search the {context} for keywords, entities, and relationships relevant to the question.
3.Synthesize Evidence: Combine the relevant pieces of information found in the context to formulate a preliminary answer.
4.Verify and Refine: Ensure every part of your answer is explicitly supported by the text. If the text does not contain the answer, acknowledge this.
5.Format Output:Structure your internal reasoning and final answer according to the specified output format.
Now, follow all instructions and generate your response. Analyze the {context} to answer the {question}.

We tested three conditions:

  1. Claude 4.5 Sonnet with the original prompt
  2. Kimi K2 Thinking with the same prompt
  3. Kimi K2 Thinking with a prompt optimized by Prompt Adaptation

Results

  • Claude 4.5 Sonnet (original prompt): 82%
  • Kimi K2 Thinking (original prompt): 49.5%
  • Kimi K2 Thinking (adapted prompt): 87%

The raw drop is very significant. Even though K2 is a very capable model, prompts tuned for Claude often do not transfer cleanly, especially on tasks that require structured or multi-step reasoning.

Prompt Adaptation closed that gap and exceeded the original Claude baseline, turning a difficult migration into a measurable improvement.

Adapted prompt:

System prompt
You are an expert AI assistant specializing in Question Answering.
User message template
**Instructions**
- Analyze the `<context>` below to answer the user's `<question>`.
- Your answer must be strongly grounded in the information within the `<context>`. Avoid using external knowledge.
- **Handling Discrepancies:** If the question contains a minor inaccuracy (e.g., a date is off by a year, a detail is slightly misattributed) but the user's intent is clear, provide the answer based on the closest information in the context. Briefly note the discrepancy in your reasoning.
- **Unanswerable Questions:** If the context does not contain the necessary information, or if the question's premise is fundamentally contradicted by the text and cannot be resolved as a minor inaccuracy, you must state that the answer cannot be found.
- Provide a complete and concise answer in a full sentence. Avoid overly truncated or single-word answers.
**Reasoning Steps**
Follow these steps to construct your answer:
1. **Analyze the Question:** Deconstruct the `<question>` to identify the key entities and the specific information being sought.
2. **Scan and Locate:** Search the `<context>` for sentences or passages containing the keywords and entities from the question.
3. **Synthesize Evidence:** If the answer is spread across multiple parts of the text, combine the relevant pieces of information.
4. **Verify and Handle Discrepancies:** Compare the synthesized evidence against the question's premise. If you find a minor discrepancy (e.g., a date mismatch of `1912` vs. `1913`), assess if the intent is clear. If so, proceed with the answer based on the context's information. If the information is entirely missing or the contradiction is major, conclude the question is unanswerable.
5. **Format the Output:** Structure your response according to the `# Output Format` section.
**Output Format**
Your final output must be a single, valid JSON object with two keys: "reasoning" and "answer".
- `reasoning`: A string containing your step-by-step thinking process, explaining how you arrived at the answer using the provided context.
- `answer`: A string containing the final, complete-sentence answer. If the answer cannot be found in the context, this field should contain the string: "The provided context does not contain the information needed to answer this question."
**Examples**
**Example 1**
<context>
Jeremy Leo Kewley (born 16 August 1960) is an Australian actor, writer and producer. He made his professional acting debut at the age of 14 in the feature film "The Devil's Playground". In December 2014 Kewley was arrested and charged with child sex offences. In November 2015 he pleaded guilty to multiple offences of 16 boys and in May 2016 he was sentenced to 23 months in jail.
Grigory Anatolyevich Drozd (Russian: Григорий Анатольевич Дрозд ; born 26 August 26, 1979) is a Russian professional boxer. As of the end of May 2016, he is the former WBC cruiserweight champion, having lost the title when he was unable to defend it due to injury and Tony Bellew won the bout for the open title. He also is a former European cruiserweight champion.
Anthony Bellew (born 30 November 1982) is a British professional boxer. He held the WBC cruiserweight title from 2016 to 2017, and previously the British and Commonwealth light-heavyweight titles between 2010 and 2014, and the European cruiserweight title from 2015 to 2016. As an amateur, Bellew is a three-time ABA heavyweight champion. He made his acting debut with a supporting role in the 2015 sports drama and "Rocky" franchise spin-off, "Creed".
Kim Hee-jung (born April 16, 1992) is a South Korean actress. She made her acting debut in 2000 as a child actress, playing the titular character in "Kkokji" (also known as "Tough Guy's Love"). As Kim reached her early twenties, one of her notable roles was Gwanghae's queen consort in the 2014 period drama "The King's Face". On May 2016, Kim signed with YG Entertainment. She is also a member of the South Korean female dance crew Purplow, known by the stage name Bibi..
</context>
<question>
The title Grigory Drozd lost in May 2016 was won by a boxer who made his acting debut in what film?
</question>
<think>
1. **Analyze the Question:** I need to find out what title Grigory Drozd lost in May 2016, who won it, and what film that winner acted in.
2. **Scan and Locate:** The text states: "As of the end of May 2016, he [Grigory Drozd] is the former WBC cruiserweight champion, having lost the title... and Tony Bellew won the bout for the open title." This identifies the winner as Tony Bellew.
3. **Synthesize Evidence:** Now I need to find information about Tony Bellew's acting career. The text says: "Anthony Bellew... made his acting debut with a supporting role in the 2015 sports drama and 'Rocky' franchise spin-off, 'Creed'."
4. **Verify and Handle Discrepancies:** All information is present and directly linked in the context. There are no discrepancies to handle.
5. **Format the Output:** I will formulate the reasoning and the final answer as a complete sentence.
</think>
{
"reasoning": "The user wants to know the film debut of the boxer who won the title Grigory Drozd lost in May 2016. The context states that at the end of May 2016, Grigory Drozd lost the WBC cruiserweight champion title, and 'Tony Bellew won the bout for the open title'. The context for 'Anthony Bellew' specifies that he 'made his acting debut with a supporting role in the 2015 sports drama and \"Rocky\" franchise spin-off, \"Creed\"'. Therefore, the film is 'Creed'.",
"answer": "The title Grigory Drozd lost in May 2016 was the WBC cruiserweight title, which was subsequently won by Tony Bellew, who made his acting debut in the film 'Creed'."
}
**Example 2**
<context>
In probability, and statistics, a multivariate random variable or random vector is a list of mathematical variables each of whose value is unknown, either because the value has not yet occurred or because there is imperfect knowledge of its value. The individual variables in a random vector are grouped together because they are all part of a single mathematical system — often they represent different properties of an individual statistical unit. For example, while a given person has a specific age, height and weight, the representation of these features of "an unspecified person" from within a group would be a random vector. Normally each element of a random vector is a real number.
Vector was a rock band formed in Sacramento, California in the early 1980s by Jimmy Abegg, Steve Griffith and Charlie Peacock. The band had several drummers over the years, including Aaron Smith and Bruce Spencer, both of whom also played drums for The 77s.
The SX-6 is a supercomputer built by NEC Corporation that debuted in 2001; the SX-6 was sold under license by Cray Inc. in the U.S. Each SX-6 single-node system contains up to eight vector processors, which share up to 64 GB of computer memory. The SX-6 processor is a single chip implementation containing a vector processor unit and a scalar processor fabricated in a 0.15 μm CMOS process with copper interconnects, whereas the SX-5 was a multi-chip implementation.
Age of Chance were a British alternative rock-dance crossover band from Leeds, England active from 1985 to 1991. They were perhaps most known for their mutant metallic cover of Prince's "Kiss" which topped the UK Indie Chart in 1986, and peaked at No. 50 in the UK Singles Chart in January the following year. Despite signing for major label Virgin, and being favourites with the UK music press, they never enjoyed a major hit in the UK, although "Don't Get Mad… Get Even" reached No. 8 on the US "Billboard" Hot Dance/Club Play chart..
</context>
<question>
What does Age of Chance and Vector share in common?
</question>
<think>
1. **Analyze the Question:** The question asks for a common attribute between "Age of Chance" and "Vector".
2. **Scan and Locate:** The text says: "Vector was a rock band...". It also says: "Age of Chance were a British alternative rock-dance crossover band...".
3. **Synthesize Evidence:** Both descriptions explicitly categorize them as a "band".
4. **Verify and Handle Discrepancies:** The shared characteristic is directly stated for both entities. There are no discrepancies.
5. **Format the Output:** I will state the common attribute in the reasoning and formulate a complete sentence for the answer.
</think>
{
"reasoning": "The user is asking for a shared characteristic between 'Age of Chance' and 'Vector'. The context explicitly describes 'Vector was a rock band' and 'Age of Chance were a British alternative rock-dance crossover band'. The common element identified from the text is that both entities are musical bands.",
"answer": "Based on the provided context, both Age of Chance and Vector are musical bands."
}
**Example 3**
<context>
Henry Gwyn Jeffreys Moseley (23 November 1887 – 10 August 1915) was an English physicist, whose contribution to the science of physics was the justification from physical laws of the previous empirical and chemical concept of the atomic number. This stemmed from his development of Moseley's law in X-ray spectra. Moseley's Law justified many concepts in chemistry by sorting the chemical elements of the periodic table of the elements in a logical order based on their physics. He published first the Long Form periodic table or Modern periodic table which is used till date.
Sir Charles Drummond Ellis FRS (b.Hampstead, 11 August 1895; died Cookham 10 January 1980) was an English physicist and scientific administrator. His work on the magnetic spectrum of the beta-rays helped to develop a better understanding of nuclear structure.
Thomas Henry Moray (August 28, 1892 - May 18, 1974) was an inventor from Salt Lake City, Utah. He received a US patent 2,460,707 in February 1949, after a process of 17 years in discussions with the patent office. The main components of the patent were an LC circuit resonator and a set of vacuum power tubes of diode type using uranium and radium power sources and doped germanium semiconductors on the cathodes. It was an early example of doped semiconductors and a forerunner of radioactive power supplies using radioactive isotopes in space research. Moray's device followed other work on nuclear batteries first done in 1913 by Henry Moseley using a radium source.
The Thomas Henry Thompson House is a historic house in St. Croix Falls, Wisconsin, United States, built in 1882. It was added to the National Register of Historic Places in 1984. It was listed for its local significance in architecture and association with a significant individual. It is a locally notable example of Italianate architecture and was the home of business leader Thomas Henry Thompson, an Irish immigrant whose 1866 general store was the primary mercantile outfit in northwest Polk County. Thompson later served as vice-president of the Bank of St. Croix Falls, and was a local promoter of telephones and automobiles. He died in 1911..
</context>
<question>
Thomas Henry Moray Followed work down on nuclear batteries in 1912 by this English physicist?
</question>
<think>
1. **Analyze the Question:** I need to identify an English physicist whose work on nuclear batteries, purportedly done in 1912, was followed by Thomas Henry Moray.
2. **Scan and Locate:** The text links "Thomas Henry Moray" to "work on nuclear batteries first done in 1913 by Henry Moseley", who was an "English physicist".
3. **Synthesize Evidence:** The context matches the person (Moray), the topic (nuclear batteries), the other person (Moseley), and the nationality (English). The only mismatch is the year: the question says 1912, but the text says 1913.
4. **Verify and Handle Discrepancies:** The date mismatch of one year is a minor inaccuracy. The user's intent is clearly to identify the physicist. The context provides a firm answer for all other aspects of the question. I will provide the answer based on the context's information and note the date correction in my reasoning.
5. **Format the Output:** I will formulate the reasoning, acknowledging the date correction, and provide the correct person and date in the final answer.
</think>
{
"reasoning": "The user is asking to identify an English physicist whose work on nuclear batteries in 1912 was followed by Thomas Henry Moray. The context states that Moray's device followed 'work on nuclear batteries first done in 1913 by Henry Moseley'. Henry Moseley is identified as an 'English physicist'. The question specifies the year 1912, while the context specifies 1913. This is a minor date discrepancy. Assuming the user's intent was to identify the physicist, the correct answer based on the context is Henry Moseley, noting his work was in 1913.",
"answer": "Thomas Henry Moray's work on nuclear batteries followed the work of the English physicist Henry Moseley, which the context states was done in 1913."
}
**Task**
<context>
{{context}}
</context>
<question>
{{question}}
</question>
You must now follow all instructions, reason through the user's question, and generate a response in the required JSON format. Let's think step by step.

This pattern appears across many model pairs:

  • models interpret instructions differently, so prompts rarely transfer one-to-one
  • newer or stronger models will still underperform when used with prompts tuned for others
  • Prompt Adaptation consistently reverses regressions and often outperforms both baselines

Why This Matters for Open-Source Adoption

The open-source ecosystem is accelerating quickly:

  • models like Kimi K2 Thinking, DeepSeek R1, and GLM-4.6 are becoming extremely strong
  • inference providers are offering competitive pricing and rapid improvements
  • enterprises want flexibility and to avoid single-provider dependence due to risk, compliance, and reliability requirements

Yet the practical step of migrating prompts remains one of the biggest blockers to adoption.

Prompt Adaptation reduces the migration overhead from days to minutes. It allows teams to:

  • test new open-source models immediately
  • quantify quality differences objectively
  • avoid manually rewriting dozens of prompts
  • adopt the right model for each workflow
  • keep up with a rapidly evolving model landscape

This is what makes open-source models not only competitive on paper but viable for production systems at scale.

If you’re evaluating models and want to explore this tradeoff, reach out for access or book time here.

Evaluation based on a 200-sample subset of HotpotQA, a multi-hop question answering dataset designed to test a model’s ability to retrieve and integrate supporting facts across multiple documents.

100x your AI development cycles

Let the machine build the machine

Get started