Part 3 of 5

Part 3 - The three levels of testing

Last month, I thought I'd created the perfect prompt for generating social media content. The outputs looked amazing. The structure was spot on. Everything seemed great...until I gave it to my team.

Within hours, they found about a dozen ways to break it. Posts that went off-brand. Responses that missed the mark completely. One particularly creative team member even got it to swear somehow (which was impressive consider Claude is pretty prurient, but not exactly what we were going for).

This is exactly why we test. Because what works perfectly in your hands might fall apart in someone else's.

We need to bulletproof our Blueprint prompt.

Let’s get started:

✍️

Summary

Bulletproof Prompts

Any idiot can make a prompt work once

Three levels of testing

How to know when to pivot vs persist

Real examples of prompt refinement in action

The Three Levels of Testing

Testing isn't just about making sure your prompt "works."

Any idiot can do that. Once!

Instead this is about about making sure it works consistently, reliably, and in the hands of different users. Here are three methods that I use in order.

First method: Consistency Testing

First, we need to make sure the prompt produces reliable results when used repeatedly.

Remember that LLMs (Large Language Models) like ChatGPT are not deterministic.

Putting in the same input will not lead to the same output each time. This isn’t a deterministic mathematical equation. LLMs are instead probability-based which means that each time you’ll get a slightly different output.

The key is to make sure that this probabilistic variability is acceptable!

Here's how:

Run the exact same input 5 times, in new chats so that there is no “memory”

Compare the outputs side by side

Look for inconsistencies in:
- Tone
- Structure
- Level of detail
- Adherence to guidelines

If you're getting wildly different results with the same input, your prompt needs more work. You can either edit or scrap the prompt. But…how do you know when to keep refining a prompt versus starting fresh?

Here's my rule of thumb.

Persist if:

The core outputs are mostly right but need tweaking

Issues are consistent and identifiable

Team feedback points to specific improvements

Edge cases are the main problem

Pivot if:

Basic outputs are consistently off-target

Issues are random and unpredictable

Team feedback suggests fundamental problems

The prompt frequently goes off-rails

You’ll get a feel for this as you go along honestly. Especially the more you work with a model like ChatGPT or Claude. You’ll know when it’s just got the wrong end of the stick and so it’s time to rewrite!

Second method: The Break Test

Now it's time to try and break your prompt.

Yes, really. It’s going to come close to breaking point out in the wild anyway. So let’s stress test it now, try to break it and use the breakpoints to further refine.

Here's what to test:

Empty inputs

One-word inputs

Extremely long inputs

Off-topic inputs

Edge cases specific to your business

For example, if you're creating a customer service prompt, try:

Angry or abusive customer scenarios

Technical queries

Policy exceptions

Non-native English inquiries

Multiple issues in one query

Basically try to capture all the ways humans will be interacting with the prompt. Whether it’s internal (you and your team) or external (customers and users). Obviously with the external testing you need to be much more rigorous!

Document every way the prompt breaks. This is gold for refinement. Embrace the destruction!

Third method: A/B Testing

Now we get scientific. Our prompt is basically good to go and we want to finalise it. Take your prompt and create a variant with one major change. For example:

You are a customer service expert who provides friendly, solution-focused responses...

and its Variant :

You are a customer service expert who prioritises clear, step-by-step solutions while maintaining a friendly tone...

Make sure there is ONE and only one difference in play. This allows us to tell what is actually causing differences in output.

Test both versions with the same inputs. Compare:

Output quality

Consistency

Handling of edge cases

User feedback

Use the variant you prefer (the winner). Then try new A/B variant testing to test again. Again choosing the winner. Repeat as many times as you can (/want to!) to really polish the prompt.

The Pivot vs Persist Decision

Let’s run an example to make all this clearer.

Real Example: Refining a Social Media Prompt

Let's look at how we refined our social media blueprint through testing:

Version 1:

You are a social media expert who creates engaging business content...
[Rest of original Blueprint prompt]

Test Results:

Outputs were too generic

Brand voice inconsistent

No clear hook pattern

Meh. Generally our first pass will always be underwhelming. That’s fine. It’s why we refine!

Based on those results we’ll add in some instruction and steps to make the content more “specific” and interesting. Imagine you are a teacher and a student has come to you with a writing sample. What would you tell the student to help them improve? Tell the same thing to the AI!

Version 2 (added constraints in Narrowing):

You are a social media expert who creates engaging business content. You always:
- Start with a surprising statistic or challenging question
- Use industry-specific examples
- Include one actionable takeaway
- End with a discussion-provoking question
[Rest of prompt]

Test Results:

Better hooks

More specific content

Still lacking brand voice

Getting there but it felt a bit like an AI was writing it! So..let’s add some tone of voice and brand guidelines to help it match our style. We might try something like this.

Version 3 (added voice guidelines):

[Previous prompt plus:]
Brand Voice Guidelines:
- Authoritative but approachable
- Use "we" and "our" to build community
- Share insights from experience
- Avoid jargon unless necessary

Final Results:

Consistent brand voice

Engaging hooks

Specific, valuable content

Good engagement prompts

Understand the basic flow? Now it’s over to you. Use your Blueprint prompt from the last Part and work on i) breaking it and ii) refining based on what broke!

Take your prompt from Part 2

Run it through all three levels of testing

Document every issue you find

Create one refined version

Test again

and…if necessary, again!

Remember, the goal isn't perfection - it's reliability. You want a prompt that works consistently across different users and scenarios.

PS: If you give AI Workshops to businesses, build what you’ve learned today into your presentation. It's such a common issue facing teams, I’ve included a framework to tackle this inside the AI Workshop Kit.

← Part 2 - RISEN™ foundations Part 4 - Grassroots adoption →

AI with Kyle