Joel Wilson

Quality Engineering Leader, Ramsey Solutions | Writing about testing and AI on Medium

About

Joel Wilson is a Quality Engineering leader at Ramsey Solutions, where he leads a team of seven Software Engineers in Test embedded across multiple product squads. For the past 18 months, he's been deep in the world of LLM evaluation, building frameworks and defining best practices for AI products, including benchmarks, LLM-as-judge tooling, red teaming, and drift monitoring. He's a CS grad who's been in software quality for over a decade, and his obsession is keeping QE fundamentals at the center of modern AI testing and making these concepts accessible to engineers who didn't sign up to become AI researchers.

Connect

LinkedIn Blog

AI Testing Isn't One Thing (And Treating It Like It Is Will Bite You)

Time

1:00 PM - 1:50 PM

Room

Cartoon Room

Description

Your team shipped an AI feature. Congrats. Now someone asks: how do we test this?

You write a test. The output changes. You run it again. Different output. You consider a career in farming.

Here's the thing nobody tells you upfront: testing AI-powered software isn't one discipline, it's two. And the moment you try to apply one strategy to both, you're in trouble.

This talk breaks down the Two-Track Testing Model that every QA engineer building on AI needs to understand. There's the deterministic side, your traditional test pyramid covering infrastructure, routing, logic, and guardrails, and there's the AI evaluation side, where outputs are non-deterministic, pass/fail doesn't exist, and you need a completely different mental model to even know what "quality" means.

We'll walk through how these two tracks diverge, when they converge, and what it takes to get quality signals from both. You'll leave with a practical framework: the Three Pillars of AI Evaluation (human eval, deterministic checks, and LLM-as-judge), a benchmark-first approach to designing your eval strategy, and a clear picture of how maturity stage changes what you should be testing and how.

The fundamentals of our craft haven't changed. The pesticide paradox still applies. Risk-based thinking still applies. You still can't test everything. But the tools, vocabulary, and decision-making are genuinely new, and it's worth getting oriented before you're neck-deep in a chatbot that nobody can evaluate with any confidence.

This is the talk I wish existed when I started.