The future is multi-model.

notdiamond-0001 is our first model router available on Hugging Face. notdiamond-0001 takes any input and determines whether to send it to GPT-3.5 or GPT-4, optimizing for the highest accuracy while drastically reducing your costs and latency.

We’ve spent the last month working to make sure notdiamond-0001 meets the following five criteria:

  • Accurate: Trained on hundreds of thousands of data points from robust, cross-domain evaluation benchmarks, notdiamond-0001 outperforms GPT-4 by a factor of 1.51x when used as a router.
  • Transparent: When you call the notdiamond-0001 endpoint from your application, it will return a label for either GPT-3.5, GPT-4. You determine which version of each model you want to use and make the calls client-side with your own keys.
  • Fast: notdiamond-0001 will determine which model to call in <10ms.1 That’s less than the time it takes for GPT-4 to stream a single token. And by routing queries to GPT-3.5 when appropriate, you can expect to see drastic net speedups in your model calls.
  • Secure: We don’t store any of your input data or use it for training.
  • Free: notdiamond-0001 is entirely free to use. We want to help everyone, everywhere maximize their LLM performance. Making notdiamond-0001 free is the best way we believe we can support AI developers around the world.

To get started with notdiamond-0001, read our documentation and download it on Hugging Face.

Frequently asked questions

How does notdiamond-0001 determine which model to call?

Our router has been trained on 250,000 data points from robust, cross-domain evaluation benchmarks to optimize for measurable quality metrics like accuracy, ROUGE, BLEU, etc. These benchmarks cover everything from code generation to text summarization, medicine, and law.

Unlike deterministic routers, we don't route based on simple categories or domains. Instead, routing decisions are far more fine-grained. Here are some examples of prompts that get routed to either GPT-3.5 or GPT-4:

Sends to GPT-3.5

Sends to GPT-4

What is prostate cancer?

Can you help me understand whether this prostate pathology report could indicate the presence of cancer? CLINICAL DATA: A-H: ELEVATED PROSTATE, PROSTATE [A] LEFT BASE… [continues]

What are common causes of a 401 error?

This function is throwing a 401 error when it's being called. Do you see anything that could be contributing to that? def get_user_data(request, response, db: Session = Depends(get_db))… [continues]

What does this paragraph mean?: “As your perspective of the world increases not only is the pain it inflicts on you less but also its meaning… [continues]

Please help me complete this paragraph with a pared down description in the style of Karl Ove Knausgård: The broken cup lay on the table.

On average, what percentage of queries do you send to each model?

We found no skew towards either model across our classification dataset, meaning GPT-3.5 is capable of handling queries roughly 50% of the time. We route to GPT-4 in the other 50% of cases. This ratio may vary for your application depending on the distribution of inputs.

Do you keep track of how many of my queries get sent to which model?

We don't keep track of your model routing destinations. Since all calls go out client-side, we encourage you to log calls locally.

Will this slow down my application by adding an extra step in the model call?

Our router is extremely fast and will determine which model to call in under 10ms. By routing appropriate queries to GPT-3.5 rather than GPT-4, you'll also see significant net speedups compared to sending everything to GPT-4.

Will you be adding other models, such as Gemini, Mistral, Claude, Llama, or Cohere?

We are working to onboard more models as quickly as possible. If you have a model you'd like to request, please email us!

Why we’re building Not Diamond

We believe in a multi-model future. The world won't have one single, giant model that everyone sends everything to—instead, there will be many foundation models, millions of fine-tuned variants of those models, and countless custom inference engines running on top of them. We believe this is not only a better future for AI, but a safer one as well. We started Not Diamond to enable this multi-model future, starting with safe and robust infrastructure for routing between models.

Why routing? Over the past months, we’ve talked to hundreds of developers and companies building on top of LLMs, from early-stage startups to Fortune 500 companies. For nearly everyone, model routing is a big, hairy, audacious problem. It sucks. Teams are using heuristics to route deterministically with if/else statements and regex expressions, trying to train their own classifiers to route inputs, A/B testing model selections, or handwriting prompts in an attempt to use slow and bulky agents as routers. When teams do manage to get a router working, it frequently breaks whenever the underlying models update. Meanwhile, those who have managed to build functional routers have seen huge gains in their product quality and margins. We decided there had to be a better way.

If you’re using GPT-4, notdiamond-0001 will lead to an immediate and drastic reduction in your inference costs and latency without any degradation in quality. Or, if you’re using GPT-3.5, you can enjoy a much higher response quality without significantly increasing your bill. And in either case, our router will protect you from expensive outages—we continuously monitor OpenAI's models 24/7 so that you never need to experience a gap in service.

But notdiamond-0001 is just the first step in a much more comprehensive product roadmap. Over the coming months, we’ll be releasing a lot more, including the ability to dynamically route to Claude, Llama, Mistral, and many more public models, as well as your own fine-tuned models and custom workflows, agents, RAG applications, and chains.

As a team, we've built venture scale companies, developed products for billions of users, and published cutting edge research in top AI journals. We’re excited to be backed by some of the world's best developers, including the founders of companies like Hugging Face, Replicated, Giphy, Indeed, and many more.

We’re actively hiring, so drop us a line if you want to help us build a multi-model future.

1. This value refers to notdiamond-0001's inference speed.