Let’s talk about source data.
It’ll be fun - I promise! OK…fine it’s a bit dry. But it’s super important!! Sorry!
And when it goes wrong it’s pretty funny.
I worked with someone who was horrified by what their brand new AI tool was producing. They'd spent weeks building a system to write social media posts in their company's voice, but something was... off.
Every single post ended with "Thank you for reaching out! How would you rate my response today?” or some variant.
Bizarre, right? Totally inappropriate for social media.
After digging into their training data, we discovered the culprit. Instead of feeding the AI with their best marketing content, they'd used thousands of customer service chat logs.
The AI was faithfully replicating the tone and structure of those support conversations - complete with that signature service agent sign-off.
It was doing it’s job perfectly well! But it had been given the wrong information to work from. Not it’s fault really!
This is a perfect example of the cardinal rule of AI training: Garbage In, Garbage Out (GIGO). Your AI assistant can only be as good as the content you feed it.
Let's get started:
In the world of AI, there's a simple but powerful principle: what you put in determines what you get out.
Feed your tool outdated, off-brand, or irrelevant content, and you'll get outdated, off-brand, or irrelevant outputs. It's that simple.
Most of us get this. But that doesn’t necessarily mean we know how to action it. This Part of the Playbook is here to give you a specific action plan and a checklist. Converting best intentions into actually solid data.
This is especially important for brand voice. When you're building an AI system to replicate a tone or a voice, the examples you provide are everything.
The AI has zero inherent understanding of what makes your voice unique - it can only analyse and replicate patterns from the content you provide. Our source data is everything.
Before you start gathering content, you need to answer a crucial question: What exactly will you use this AI voice model for?
Different use cases require different types of content:
This alignment is critical.
Yes you can make a general purpose assistant that can do all the outputs.
But guess what? It’s not going to be as good as focused individual tools.
If you must combine everything into one tool you’ll need to label your inputs explicitly (ie. making it clear what is a transcript, what is a blog article, what is from an interview) and then also adjust your output prompts to specifically use certain sources.
It’s doable! But adds complexity. For now I’d recommend creating focused single purpose tools - one for social media, one for newsletters, one for email responses, one for customer service etc. etc.
Next consideration is whether this is using personal or company tonality. This questions comes after the usage question from before. We need to define usage first then what type of tonality.
The source collection process differs significantly depending on whether you're capturing your personal voice or a company's brand voice.
For your personal voice:
For a company voice:
OK those are the two main factors in play - purpose and tonality.
To help you create a tailored collection checklist, I've created this prompt that you can use with ChatGPT or Claude:
You are an AI voice training expert helping me collect content for an AI brand voice project. Based on my specific needs, create a detailed content collection checklist.
Ask me questions to determine the following:
- Purpose: What will the AI voice be used for? E.g., "Writing social media posts" or "Creating blog articles"
- Voice type: Personal or company voice
Then generate a list of potential content sources that'll be used as examples to capture brand voice.
For each content type in your checklist, please include:
1. Description of what to look for
2. Why this content type is valuable
3. Minimum recommended quantity
4. Specific elements to pay attention to
5. Red flags or content to avoid
This prompt will generate a customised collection checklist tailored to your specific needs.
Obviously the next question is how to extract each type of data. And honestly - it depends a lot depending on what it is! Let me quickly run through the main options!
Your company website is often the most polished representation of your brand voice. Here's how to capture it effectively:
Blog articles often contain the richest examples of your brand voice in action. They're typically longer-form content that addresses topics in depth.
Generally manually copy/pasting isn’t viable here. A scrape works but there are some additional methods here.
Collection methods:
Spoken content can provide excellent examples of natural voice patterns, especially for conversational tones. Super helpful for personal tone of voice, especially because when transcribed podcasts give your thousands of words. Here are your options for transcription:
Or use OpenAI’s Whisper model via the API.
When processing transcripts, clean up filler words and false starts unless these are part of the voice you want to capture. AI can do this for you - no need to do manually.
Social posts often showcase your most conversational, engaging voice. Collection approaches vary entirely by platform:
Twitter/X:
LinkedIn:
Instagram:
Facebook:
All of these work with text. What about video posts?
You can use Apify or similar tools to mass scrape posts.
This is how I personally do it - Apify to scrape videos and their subtitles, throw transcript of video post over to ChatGPT to clean up then send it into an Airtable. Very cost effective and you can basically strip mine a company’s post (or your own!) into a table in minutes.
With all this content collected, you need a system to organise it effectively:
This organised approach will make the next step - voice extraction - much more effective.
If you are just doing this for your own voice tool (rather than a client) you can probably get away with just dumping everything into a Google Drive. We don’t need the same level of precision and sorting because it’s all our voice. We can play more fast and loose.
How much content is enough? Here are my recommendations:
These are rules of thumb. Got more? Fantastic. As long as quality is solid more is generally better!
In Part 3, we'll take all this organised content and extract the DNA of your brand voice. We'll create powerful prompts that capture the essence of your voice and allow any AI to replicate it consistently.