September 24, 2024

By Devansh Jain, Tze-Yang Tung, and Tomás Hernando Kofman

By Devansh Jain, Tze-Yang Tung, and Tomás Hernando Kofman

*Two roads diverged in a yellow wood,And sorry I could not travel both…*

Today we are open sourcing RoRF (Routing on Random Forests), a pairwise LLM router which we developed earlier this year as a previous iteration of the more powerful router currently available through our API. In addition to our inference and training code, we are also releasing a collection of 6 open-source pre-trained LLM routers on the following model pairs:

- GPT-4o <> Claude 3.5 Sonnet
- Llama 3.1 405B <> Mistral Large 2407
- Claude 3.5 Sonnet <> Llama 3.1 405B
- GPT-4o <> Llama 3.1 405B
- GPT-4o <> GPT-4o-mini
- Llama 3.1 405B <> Llama 3.1 70B

Unlike the full-fledged router that we released last month, RoRF can only route between two models and it does not have the ability to learn in real-time based on feedback. Even still, RoRF is both *more powerful* and *less expensive* than using an individual model. It also outperforms other open-source router architectures such as those found in RouteLLM.

RoRF is a Random Forest Classifier that takes evaluation data from LLMs and learns a mapping between prompt embeddings and the most suitable model for each prompt. The model is trained to predict the probabilities of four possible outcomes:

- model \(A\) is correct but model \(B\) is not (label
`0`

) - model \(A\) and model \(B\) are both incorrect (label
`1`

) - model \(A\) and model \(B\) are both correct (label
`2`

) - model \(A\) is incorrect but model \(B\) is correct (label
`3`

)

We let \(l_i \in \{0, 1, 2, 3\}\) be the label assigned to prompt \(x_i \in \mathbb{R}^N\). The router \(f: \mathbb{R}^N \mapsto \mathbb{R}^4\) predicts the probability of each label \(P(l_i = j | x_i) = f_j(x_i)\), where \(f_j\) refers to the \(j\)th dimension of the model output. In our routers, we use `jinaai/jina-embeddings-v3`

to embed the prompts (\(N = 1024\)) and we parameterize \(f\) as a Random Forest Classifier, where the probabilities are computed empirically based on the proportion of the number of estimator outputs that predicted the label.

Based on the predicted probabilities, we then compute the probability of selecting model \(A\) as the sum of the probabilities \(P(A | x_i) = P(l_i = 0 | x_i) + P(l_i = 1 | x_i)\). Likewise, for the probability of selecting model \(B\), \(P(B | x_i) = P(l_i = 2 | x_i) + P(l_i = 3 | x_i)\). This means we choose model \(B\) if both models are predicted to be correct and model \(A\) if both models are predicted to be incorrect.

Finally, we apply a threshold \(\tau \in [0, 1]\) to determine the model \(M\) to route to $$M = \begin{cases}A, ~ \text{if } P(A|x_i) \leq \tau \\B, ~ \text{otherwise.}\end{cases}$$ In essence, this threshold rule allows us to tune the cost-performance tradeoff.

In Fig. 2, we compare RoRF to RouteLLM when routing between Claude 3.5 Sonnet and Llama 3.1 405B Instruct. We can see that when routing between these two strong models, the highest accuracy is achieved when a mixture of the two models is used, clearly showing how routing queries improves accuracy. It also shows that even for models that are similarly strong on average, there is significant disjointedness between the prompts one can respond to accurately versus the other. Moreover our router achieves its highest performance with a 12.5% cost reduction with respect to Claude 3.5 Sonnet.

When routing between a strong and a weak model, such as Llama 3.1 405B Instruct and Llama 3.1 70B Instruct (Fig. 3), we see a smooth transition between the ratio of strong (Llama 3.1 405B) and weak (Llama 3.1 70B) model calls and the performance of the router. The threshold therefore acts as a lever to tune the trade off between accuracy and cost for individual use cases.

Lastly, we evaluated our routers on MMLU and BigBenchHard using optimal model \(A\) percentage call values. We achieve higher scores than individual models when we route between two strong models while also reducing cost. In strong versus weak model pairs, we outperform RouteLLM at a lower cost.

RoRF demonstrates strong performance across model pairs, conceptually illustrating how we're able to achieve even better performance by routing between a greater number of LLMs.

You can try out our open-source routers in the open-source RoRF repo, or by directly downloading them from our HuggingFace repo. We have also open sourced the training method so that you can bring your own data and train your own routers on any model pair you want.

We believe that routing can not only drive stronger performance and computational efficiencies in AI, but that it can also improve the safety of AI systems. We’re excited to share this architecture with the community—please reach out if you have any questions or if you want to help us build a multi-model future.