Part 4 of 5

Part 4 - Which AI model is best?

I’m often asked what the best model is.

It’s a REALLY hard question to answer because i) it depends on the goal and ii) my recommendation today will be different to my recommendation in a week or two.

Everything just moves so fast.

In a field moving this quickly, it's futile to rely on static recommendations. Rather than telling you "Model X is best for task Y" (which might be outdated before you even finish reading), I realised we need something much more valuable – a framework for evaluating models yourself, along with tools that stay current as the landscape evolves.

In this Part I’ll show you the tools you need to make this decision yourself.

Let's get started:

✍️

Summary

Essential tools for comparing models yourself in real-time

Understanding the "capability cliff" and avoiding model overkill

The real-world implications of closed vs. open source models

How to create your own benchmark tests for your specific needs

Using variables and templates for consistent results across models

Strategies for handling model updates and changes

The Futility of Static Model Recommendations

If there's one constant in the AI world, it's change. Here's a sobering fact: over the last 12 months, we've seen more than 50 significant model releases and updates from major providers. And that’s not including the hundreds (nay, thousands) of smaller models out there.

I keep up with this stuff for a living. And it’s overwhelming.

So for anyone out there also building a business and, you know, having a life…I can’t imagine how overwhelming it is!

What was state-of-the-art last quarter might be middling today, and what works best for summarisation might be different from what excels at code generation. It’s a LOT.

This rapid pace of development means that any article, guide, or newsletter (yes, even this one!) that makes specific claims about which model is "best" comes with an expiration date. And it’s less than a pint of milk.

I've learned this lesson the hard way. Last year, I wrote a detailed comparison of the top models for content creation, spent hours testing and benchmarking... and it was rendered largely obsolete just two weeks later when two major providers released updated models.

Does this mean we just give up? Nah.

Let's focus on equipping you with tools and frameworks that remain valuable regardless of which specific models are leading the pack this month.

Real-Time Model Comparison Resources

I’m going to delegate the hard work here and just tell you the resources I personally use! This is a great starting point.

Artificial Analysis (artificialanalysis.ai)

This site offers detailed comparisons across multiple parameters, with summaries of which models currently excel at specific tasks.

The platform allows you to filter models based on specific criteria like reasoning ability, factual accuracy, or creative writing – helping you quickly narrow down which models might be appropriate for your specific needs.

Tracking AI (trackingai.org/home)

Tracking AI gives you a quick and dirty “what’s the smartest” model overview.

It regularly subjects AI models to IQ-style tests. Whilst this isn’t a foolproof benchmark (spoiler: there aren’t any) it’s a good quick look at the smarter models.

Vercel AI Playground (sdk.vercel.ai/playground)

Getting a bit more hands on now. This tool let’s you run side by side tests on two models Rather than relying on generic benchmarks, you can test models on the exact tasks you care about, seeing how different models respond to the same input.

LM Arena (lmarena.ai)

This site combines a comprehensive leaderboard with a side-by-side comparison tool, giving you both the big picture and the ability to run detailed comparisons.

As a user you can actually run prompts and tell LM Arena which model does the best job. This in turn helps LM Arena with the ranking of the models. Unlike the other sites here you actively contribute to the rankings.

These are all useful tools to hone in on potential models for your prompting task. Think of them as a way to give you a shortlist. From here we can further narrow down the options until we find the best model. First though a major warning!

Avoiding Model Overkill

One of the most important concepts to understand when selecting models is the "capability cliff" – the point at which a model becomes good enough for your specific needs. Paying for capabilities beyond this cliff often yields diminishing returns.

For example let’s say you want to build prompts and workflows to answer customer service emails. You would test the same prompt across several models:

Model A (basic): produced usable responses about 65% of the time

Model B (mid-tier): produced usable responses about 85% of the time

Model C (advanced): produced usable responses about 92% of the time

Model D (cutting-edge): produced usable responses about 94% of the time

Seeing this you might naturally think “OK, easy, Model D. We’re done here.”

But quality isn’t the only factor in play here.

Imagine the price difference between Model B and Model D is 10x. Yet the actual improvement in usable outputs was just 9 percentage points.

Huh. OK that changes things.

For the example workflow Model B represents the capability cliff – the point where the model was "good enough" for our needs. At this point (85% reliability at a very low cost) we don’t get much marginal improvement by spending an awful lot more.

We can instead focus on refining from that low-cost result.

Finding Your Own Capability Cliff

To identify the capability cliff for your own use case here’s the rough outline:

Define what "good enough" means for your specific task. This depends on your business! Not on abstract AI benchmarks.

Test increasingly capable models until you reach that threshold

Calculate the effective cost per task for each model

Look for the inflection point where improvements no longer justify increased costs

The ideal model is not always the most advanced one – it's the one that satisfies your requirements at the lowest cost.

For simple content generation, older or smaller models might be perfectly adequate. For complex reasoning tasks, you might indeed need the most advanced options. The right option depends entirely on your goal. So don’t fall into the trap of falling off the cliff Wile E Coyote style!

Closed vs. Open: Understanding the Implications

The other BIG decision factor here is open vs. closed source.

Closed, proprietary models (like GPT-4, Claude, and Gemini) and open-source models (like Llama, Mistral, and various community models) have different pros and cons.

Generally the frontier models - the most advanced - are proprietary. That makes a lot of sense because these companies have a tonne of cash to throw into improving their models!

But for some uses we may want to use an open-source model.

When Using Through APIs

If you're accessing models through their official APIs or web interfaces, the closed vs. open distinction matters less from a technical perspective. You're essentially consuming the model as a service, regardless of whether the underlying technology is open or closed.

What does matter is:

Pricing: Different providers have vastly different pricing models and costs

Data policies: How your inputs are handled, stored, and potentially used for training (may be a big deal for your organisation depending on sensitivity of data!)

Reliability: The guaranteed uptime and performance

For most businesses starting with AI integration, proprietary APIs often provide the path of least resistance, with well-documented interfaces and reliable performance. It’s much easier to spin up a tool using an existing API. The tradeoff is typically higher cost and less control over how your data is handled.

When Self-Hosting

If you're deploying models yourself (or working with a team that is), the closed vs. open distinction becomes absolutely crucial:

Open-source models can be downloaded, run locally, fine-tuned on your data, and modified (via fine-tuning) to suit your specific needs. This gives you maximum control, cost savings at scale, and full data privacy. Llama (by Meta) is generally to go-to here.

Closed models generally cannot be self-hosted.

Self-hosting comes with significant considerations:

Hardware requirements: Larger models require substantial computing resources.

Technical complexity: Deployment and maintenance require specialised knowledge

Optimisation challenges: Getting decent performance will require significant fine tuning

Potential obsolescence: You may spend months fine-tuning a local model only for it to be blown out of the water by the next (closed-source) ChatGPT release

For businesses considering this path, previous Playbooks cover fine-tuning and deployment in more detail. But definitely hire an expert! The key insight is that this approach trades higher initial complexity for greater control and potentially lower long-term costs at scale. It’s a big business decision.

The Model Selection Decision Tree

OK! Let’s pull all this together into an actionable decision tree.

If privacy/data sovereignty is paramount:

Consider open-source models that can be run locally or on a private cloud

Example choices: Llama, Mistral, or similar open models

If you need the absolute best performance regardless of cost:

First review whether you actually need the best model (hint: often you don’t!)

If you do, use the latest top-tier commercial models as they are typically at the cutting edge

Specialised tasks might have different leaders (certain models excel at longer content analysis, others at code) - use one of the resources above to hone in on this

If you're cost-sensitive but need good performance:

Consider smaller commercial models, ie. older versions of current top-tier models

Open-source models hosted on affordable platforms

For very high volume (ie. a b2c app), custom-hosting open-source models may be most economical

If speed is critical:

Smaller models generally offer faster inference times

Look for models specifically optimised for speed (again, use the sites above to start this search)

This decision tree provides a starting point, but remember that the landscape changes rapidly! Use the comparison tools we discussed to validate these decisions with current data. The only constant is change!

What's Next?

In this Part we’ve looked at how to navigate the constantly evolving landscape of AI models. Rather than giving you soon-to-be-outdated recommendations, we've focused on equipping you with tools and frameworks to evaluate models yourself based on your specific needs. Hopefully that’s more useful!

Tomorrow, we'll wrap up our week on prompting mastery with advanced reasoning. We'll dive deep into powerful techniques like Chain of Thought, Tree of Thought and reasoning models.

← Part 3 - From App to API Part 5 - Prompting Reasoning models →

AI with Kyle