The New A/B Testing Playbook (to increase conversions in the AI era)
Most teams are doing A/B testing wrong.
They’re running experiments, but have no actual process. The tests happen, the results come in, but there’s no learning system underneath it - no rigor, no documentation, no clear understanding of why something worked or didn’t.
So while they might see small wins, those wins don’t scale.
They can’t apply what they’ve learned to other parts of the business. And when a test fails, they don’t learn enough to improve their hit rate for the next time.
When I was leading growth at Wistia and Postscript, I followed a more disciplined approach.
It wasn’t glamorous work, but it turned testing into a strategic asset - something that created compounding value over time. And built trust for me as the team lead.
Now in 2025, that same rigor has unlocked something new:
The ability to feed years of structured experiment data into AI, and use it to predict what will win next.
This piece breaks down exactly how to do it - and how to use AI to make every A/B test smarter than the last.
Why most A/B tests fail
Most teams test ideas, but they don’t test insights.
There’s a big difference.
When I look back at hundreds of experiments I’ve seen over the years, most had good intentions, but not much structure.
Teams changed buttons, copy, headlines, images, or designs - hoping something would move the needle. Sometimes it did.
But more often, they got stuck around the same conversion rates. Or driving small incremental improvements (3-5% here and there) instead of big step changes. That’s because they weren’t diagnosing the problem before running the test.
They didn’t understand why users weren’t converting in the first place.
If you don’t know what you’re trying to learn, you’re not testing - you’re guessing. And guessing doesn’t scale.
Map your growth model and find the leverage
Most teams skip this part. They jump straight into brainstorming the next easy test - the one someone’s been wanting to try, or the area that feels safest to experiment in.
But the real work begins one step higher.
You have to zoom out, look at how the business actually works, and identify where improving a single number could have the biggest overall impact. That’s your leverage point.
Here’s how I’ve done it.
I start by sketching out the full growth model - acquisition, activation, retention, monetization.
Then, I layer in quantitative data to see where the growth rates - or conversion rates might be underperforming. The numbers show you the size of the opportunity.
Once you’ve spotted those performance gaps, zoom in and use qualitative data to understand why they exist.
Watch session replays. Read churn survey responses. Talk to sales, support, and customers. The goal is to uncover the friction that’s holding users back from converting.
Quantitative data tells you where the leverage is.
Qualitative data tells you what’s blocking it.
Over time, I’ve learned this step separates teams that get lucky from teams that drive real growth.
Because when you start with leverage instead of ease, you stop running random tests - and start building a testing program that compounds.
Brainstorm differently: 10% vs 10× thinking
Most teams run brainstorming sessions like casual group chats.
Someone throws out an idea, others nod, and before you know it, you’re testing the most exciting opinion in the room.
We flipped that on its head. We made it structured. Every brainstorm had five minutes of context, fifteen minutes of quiet ideation, and ten minutes of sharing, combining, and group think.
Then we added a single prompt: Which of these ideas might improve conversion by 10%, and which could improve it by 10x?
That tension changed everything. It pushed the team beyond surface-level tweaks into deeper, more creative problem-solving.
This is where creativity meets structure.
Prioritize using a rubric (ICE or RICE)
Choosing which test to run next shouldn’t depend on who talks loudest.
It should depend on the evidence.
Score every idea using the ICE framework: Impact, Confidence, and Ease. (Some teams use RICE, adding Reach - but the principle is the same.)
This removes ego from the process.
Instead of debating which idea “felt” most promising, you’ll have a clear, shared system to rank them. High-impact ideas that are reasonably easy to execute rise to the top. Harder, riskier bets go into the queue for later.
A good testing roadmap balances quick wins that build momentum with bold swings that create breakthroughs.
Design each experiment (with rigor)
This is where most teams lose the plot. They skip the brief.
They think writing an experiment brief is something only big companies (or big teams) do. '
Or they assume it’s redundant, since they already “know” the details. Sometimes, no one’s ever asked them to document it, so it feels like extra work.
But this single step is where so much of the leverage lives.
By skipping it, you’re not just saving time - you’re cutting out most of the learnings you could have captured.
That’s why every test should have a one-page doc before launch.
It doesn’t need to be fancy, or overly complex. Include a clear hypothesis, one primary metric for success, screenshots of control and variation, sample size, duration, what you’re hoping to learn, and your next steps (based on win or lose). The most important piece is your kill criteria: decide upfront when you’ll stop or revert, so the test doesn’t drift into ambiguity.
And later in this playbook, you’ll see why this kind of documentation becomes even more powerful - because it’s the foundation that makes AI-driven testing possible.
If you want my entire plug-and-play system for this, it’s included inside my Growth Operating System program.
It includes the templates, workflows, and experiment trackers I’ve used to help dozens of growth teams build structure, speed, and confidence in their process.
Log wins and lessons to turn experiments into insights
The total value of testing isn’t just the immediate results.
It’s the compounding learnings over time.
Most teams ship a test, check the numbers, and move on. The problem is, they’re missing the patterns.
Without a record of past experiments, it’s impossible to see which ideas work repeatedly, which ones keep failing, and where new opportunities might exist.
Sometimes you’ll notice something simple - like a test that performed well in one channel that’s worth trying in another. Other times, you’ll realize you’ve been testing the same idea in circles without moving on to something better.
That’s why you need a running log.
It can live anywhere: a spreadsheet, Notion, Airtable, whatever you use. For each test, record the ID, what you changed, the result, and what you learned.
Then, every month, review your tracker to spot trends. If your system is working, you’ll start to see your hit rate improve - and that’s how you know your process is compounding.
And here’s where this becomes even more powerful.
Over time, this tracker turns into training data for AI. You can feed it into your LLM to help predict which future experiments are most likely to win, based on patterns from the past.
That’s why this step isn’t optional—it’s the foundation of the new AI-powered testing playbook.
The teams that document their learnings today will move faster tomorrow.
Use AI to shorten the learning loop
Some people are using AI to replace experimentation.
That’s a huge mistake.
They’ll open ChatGPT and ask, “What’s the idea to improve my conversions?” Then they get a list of generic ideas pulled from across the internet - ideas that come from businesses with totally different models, markets, audiences, and goals.
So even if AI gives you an answer, it’s really just guessing. Its hit rate won’t be any better than yours.
The real power of AI isn’t to replace your experimentation.
It’s to make it faster.
When you train your AI on your historical data - your tests, your audience, your business model - you give it context.
You’re not asking it what works “in general.” You’re asking it to help you spot the patterns that exist inside your own system.
The patterns that already drive your wins. That’s how you turn AI into a multiplier instead of a shortcut.
Here’s how I do it…
When I plan a new batch of tests, I pull up our wins/losses tracker and feed it into ChatGPT. I use a prompt like:
“Here are our last 80 experiments: what we hoped to accomplish, what we wanted to learn, and what happened. Given this history, which of these five new ideas is most likely to win, and why?”
Growth teams are usually right about one out of every four tests. Around a 25% hit rate when running experiments.
In my experience, AI predictions are right about 65% of the time - and they surface those insights in seconds, not hours/days.
That means fewer wasted cycles, faster iteration, and a team focused on what really moves the needle.
AI is only as smart as what you feed it.
That’s why rigor matters. Every experiment brief, every documented metric, every logged learning - it’s all data that AI can use to accelerate your next breakthrough.
Most teams run experiments, but they don’t really have a testing system
They launch tests, watch metrics, and move on.
But without structure - without clear hypotheses, documentation, and reflection - they’re just collecting outcomes, not insights.
And that means they can’t leverage what they’ve learned to grow faster the next time.
Structure is what turns experimentation into an advantage.
It’s what makes your wins repeatable, your learnings transferable, and your process scalable.
And now, it’s also what unlocks the next playbook.
Because in this new era, the teams that build disciplined systems today will be the ones who can train AI on their real context tomorrow.
That’s where the compounding starts - where every test makes the next one smarter, faster, and more likely to win.
ps - if you want to build this kind of rigor into your own growth process, my Growth Operating System will help you do it.
It’s a plug-and-play framework for running faster, smarter experiments - with all the structure you need to turn testing into a real advantage.