Last month, I thought I'd created the perfect prompt for generating social media content. The outputs looked amazing. The structure was spot on. Everything seemed great...until I gave it to my team.
Within hours, they found about a dozen ways to break it. Posts that went off-brand. Responses that missed the mark completely. One particularly creative team member even got it to swear somehow (which was impressive consider Claude is pretty prurient, but not exactly what we were going for).
This is exactly why we test. Because what works perfectly in your hands might fall apart in someone else's.
We need to bulletproof our Blueprint prompt.
Let’s get started:
Bulletproof Prompts
Testing isn't just about making sure your prompt "works."
Any idiot can do that. Once!
Instead this is about about making sure it works consistently, reliably, and in the hands of different users. Here are three methods that I use in order.
First, we need to make sure the prompt produces reliable results when used repeatedly.
Remember that LLMs (Large Language Models) like ChatGPT are not deterministic.
Putting in the same input will not lead to the same output each time. This isn’t a deterministic mathematical equation. LLMs are instead probability-based which means that each time you’ll get a slightly different output.
The key is to make sure that this probabilistic variability is acceptable!
Here's how:
If you're getting wildly different results with the same input, your prompt needs more work. You can either edit or scrap the prompt. But…how do you know when to keep refining a prompt versus starting fresh?
Here's my rule of thumb.
Persist if:
Pivot if:
You’ll get a feel for this as you go along honestly. Especially the more you work with a model like ChatGPT or Claude. You’ll know when it’s just got the wrong end of the stick and so it’s time to rewrite!
Now it's time to try and break your prompt.
Yes, really. It’s going to come close to breaking point out in the wild anyway. So let’s stress test it now, try to break it and use the breakpoints to further refine.
Here's what to test:
For example, if you're creating a customer service prompt, try:
Basically try to capture all the ways humans will be interacting with the prompt. Whether it’s internal (you and your team) or external (customers and users). Obviously with the external testing you need to be much more rigorous!
Document every way the prompt breaks. This is gold for refinement. Embrace the destruction!
Now we get scientific. Our prompt is basically good to go and we want to finalise it. Take your prompt and create a variant with one major change. For example:
You are a customer service expert who provides friendly, solution-focused responses...
and its Variant :
You are a customer service expert who prioritises clear, step-by-step solutions while maintaining a friendly tone...
Make sure there is ONE and only one difference in play. This allows us to tell what is actually causing differences in output.
Test both versions with the same inputs. Compare:
Use the variant you prefer (the winner). Then try new A/B variant testing to test again. Again choosing the winner. Repeat as many times as you can (/want to!) to really polish the prompt.
Let’s run an example to make all this clearer.
Let's look at how we refined our social media blueprint through testing:
Version 1:
You are a social media expert who creates engaging business content...
[Rest of original Blueprint prompt]
Test Results:
Meh. Generally our first pass will always be underwhelming. That’s fine. It’s why we refine!
Based on those results we’ll add in some instruction and steps to make the content more “specific” and interesting. Imagine you are a teacher and a student has come to you with a writing sample. What would you tell the student to help them improve? Tell the same thing to the AI!
Version 2 (added constraints in Narrowing):
You are a social media expert who creates engaging business content. You always:
- Start with a surprising statistic or challenging question
- Use industry-specific examples
- Include one actionable takeaway
- End with a discussion-provoking question
[Rest of prompt]
Test Results:
Getting there but it felt a bit like an AI was writing it! So..let’s add some tone of voice and brand guidelines to help it match our style. We might try something like this.
Version 3 (added voice guidelines):
[Previous prompt plus:]
Brand Voice Guidelines:
- Authoritative but approachable
- Use "we" and "our" to build community
- Share insights from experience
- Avoid jargon unless necessary
Final Results:
Understand the basic flow? Now it’s over to you. Use your Blueprint prompt from the last Part and work on i) breaking it and ii) refining based on what broke!
Remember, the goal isn't perfection - it's reliability. You want a prompt that works consistently across different users and scenarios.
PS: If you give AI Workshops to businesses, build what you’ve learned today into your presentation. It's such a common issue facing teams, I’ve included a framework to tackle this inside the AI Workshop Kit.