I’m often asked what the best model is.
It’s a REALLY hard question to answer because i) it depends on the goal and ii) my recommendation today will be different to my recommendation in a week or two.
Everything just moves so fast.
In a field moving this quickly, it's futile to rely on static recommendations. Rather than telling you "Model X is best for task Y" (which might be outdated before you even finish reading), I realised we need something much more valuable – a framework for evaluating models yourself, along with tools that stay current as the landscape evolves.
In this Part I’ll show you the tools you need to make this decision yourself.
Let's get started:
If there's one constant in the AI world, it's change. Here's a sobering fact: over the last 12 months, we've seen more than 50 significant model releases and updates from major providers. And that’s not including the hundreds (nay, thousands) of smaller models out there.
I keep up with this stuff for a living. And it’s overwhelming.
So for anyone out there also building a business and, you know, having a life…I can’t imagine how overwhelming it is!
What was state-of-the-art last quarter might be middling today, and what works best for summarisation might be different from what excels at code generation. It’s a LOT.
This rapid pace of development means that any article, guide, or newsletter (yes, even this one!) that makes specific claims about which model is "best" comes with an expiration date. And it’s less than a pint of milk.
I've learned this lesson the hard way. Last year, I wrote a detailed comparison of the top models for content creation, spent hours testing and benchmarking... and it was rendered largely obsolete just two weeks later when two major providers released updated models.
Does this mean we just give up? Nah.
Let's focus on equipping you with tools and frameworks that remain valuable regardless of which specific models are leading the pack this month.
I’m going to delegate the hard work here and just tell you the resources I personally use! This is a great starting point.
This site offers detailed comparisons across multiple parameters, with summaries of which models currently excel at specific tasks.
The platform allows you to filter models based on specific criteria like reasoning ability, factual accuracy, or creative writing – helping you quickly narrow down which models might be appropriate for your specific needs.
Tracking AI gives you a quick and dirty “what’s the smartest” model overview.
It regularly subjects AI models to IQ-style tests. Whilst this isn’t a foolproof benchmark (spoiler: there aren’t any) it’s a good quick look at the smarter models.
Getting a bit more hands on now. This tool let’s you run side by side tests on two models Rather than relying on generic benchmarks, you can test models on the exact tasks you care about, seeing how different models respond to the same input.
This site combines a comprehensive leaderboard with a side-by-side comparison tool, giving you both the big picture and the ability to run detailed comparisons.
As a user you can actually run prompts and tell LM Arena which model does the best job. This in turn helps LM Arena with the ranking of the models. Unlike the other sites here you actively contribute to the rankings.
These are all useful tools to hone in on potential models for your prompting task. Think of them as a way to give you a shortlist. From here we can further narrow down the options until we find the best model. First though a major warning!
One of the most important concepts to understand when selecting models is the "capability cliff" – the point at which a model becomes good enough for your specific needs. Paying for capabilities beyond this cliff often yields diminishing returns.
For example let’s say you want to build prompts and workflows to answer customer service emails. You would test the same prompt across several models:
Seeing this you might naturally think “OK, easy, Model D. We’re done here.”
But quality isn’t the only factor in play here.
Imagine the price difference between Model B and Model D is 10x. Yet the actual improvement in usable outputs was just 9 percentage points.
Huh. OK that changes things.
For the example workflow Model B represents the capability cliff – the point where the model was "good enough" for our needs. At this point (85% reliability at a very low cost) we don’t get much marginal improvement by spending an awful lot more.
We can instead focus on refining from that low-cost result.
To identify the capability cliff for your own use case here’s the rough outline:
The ideal model is not always the most advanced one – it's the one that satisfies your requirements at the lowest cost.
For simple content generation, older or smaller models might be perfectly adequate. For complex reasoning tasks, you might indeed need the most advanced options. The right option depends entirely on your goal. So don’t fall into the trap of falling off the cliff Wile E Coyote style!
The other BIG decision factor here is open vs. closed source.
Closed, proprietary models (like GPT-4, Claude, and Gemini) and open-source models (like Llama, Mistral, and various community models) have different pros and cons.
Generally the frontier models - the most advanced - are proprietary. That makes a lot of sense because these companies have a tonne of cash to throw into improving their models!
But for some uses we may want to use an open-source model.
If you're accessing models through their official APIs or web interfaces, the closed vs. open distinction matters less from a technical perspective. You're essentially consuming the model as a service, regardless of whether the underlying technology is open or closed.
What does matter is:
For most businesses starting with AI integration, proprietary APIs often provide the path of least resistance, with well-documented interfaces and reliable performance. It’s much easier to spin up a tool using an existing API. The tradeoff is typically higher cost and less control over how your data is handled.
If you're deploying models yourself (or working with a team that is), the closed vs. open distinction becomes absolutely crucial:
Self-hosting comes with significant considerations:
For businesses considering this path, previous Playbooks cover fine-tuning and deployment in more detail. But definitely hire an expert! The key insight is that this approach trades higher initial complexity for greater control and potentially lower long-term costs at scale. It’s a big business decision.
OK! Let’s pull all this together into an actionable decision tree.
If privacy/data sovereignty is paramount:
If you need the absolute best performance regardless of cost:
If you're cost-sensitive but need good performance:
If speed is critical:
This decision tree provides a starting point, but remember that the landscape changes rapidly! Use the comparison tools we discussed to validate these decisions with current data. The only constant is change!
In this Part we’ve looked at how to navigate the constantly evolving landscape of AI models. Rather than giving you soon-to-be-outdated recommendations, we've focused on equipping you with tools and frameworks to evaluate models yourself based on your specific needs. Hopefully that’s more useful!
Tomorrow, we'll wrap up our week on prompting mastery with advanced reasoning. We'll dive deep into powerful techniques like Chain of Thought, Tree of Thought and reasoning models.