For god's sake, "vibe checks" is not an eval strategy
A 4-Level Framework for Moving from Random Checks to Systematic AI Evaluation
I hate to be the last person to join the choir, but here goes nothing: Evals, evals, evals!
You’re building a production-grade AI-powered or AI-native product. AI sits at the heart of your product (or at least powers an important feature).
You know that AI is a fickle bitch. Non-deterministic; inherently unpredictable.
So how are you systematically checking that the output your product spits out is good?
Smart product/AI leaders have been shouting for months about how AI product managers need to get good at building evals (Lenny Rachitsky, Aakash Gupta, Teresa Torres, Hamel Hussain & Shreya Shankar, to name a few).
So I’m shocked to see how many AI product builders have not thought this through, and leave it at a random vibe-check every now and then.
Not cool, guys.
Below is an overview of AI evaluation methods by maturity level. I apologise for the lack of “Else-special sauce”, and I’m certainly not the eval expert, but I sense that there are too many AI builders building blind out there.
Choose the right one (or a combination) based on your product goals and maturity level.
Level 0: Ad-Hoc Testing (”Vibe Checking”)
What it is: Unstructured, exploratory testing. You type in a few prompts, see what happens, and decide whether it’s right or wrong.
Pros: Fast, free, and excellent for catching obvious bugs and glaring tonal issues. You learn about your product’s common failure modes.
Cons: Not systematic, highly subject to the developer’s bias, and has zero statistical-significance. It doesn’t prove the product works, only that it’s not completely broken.
Level 1: Manual Review & Triage
This is the first systematic step. You collect and review real user interactions.
What you collect: Full Traces. Your logging system should capture the entire engineering event for every interaction. This “trace” is a detailed log that includes the user’s input, the final output, the prompt template used, any data retrieved from a database (RAG), any function calls, latency, and token counts.
What you review: Chat Transcripts A reviewer (a product manager, a domain expert, or you) looks at a clean, simple list of just the User Input and Model Output. This is the “transcript.” They can use an annotation tool (I vibe coded my own) to classify the transcript (good, bad), select any failure modes, set priority level, write comments, etc. Any information you will need later to understand whether this issue is common and important.
Level 2: Offline Automated Evals (Golden Sets)
You create a static, high-quality test set of prompts (a “Golden Set”) where you know what a perfect answer should look like.
What it is: A test set of 20-200 “must-work” prompts. Before deploying a new model version, you run it against this entire set and compare the new answers to your “ground truth” answers.
Pros: Automatically catches regressions (i.e., making sure a new feature didn’t break an old one).
Cons: It only tests what you know to test. It can’t tell you how the model will handle new, unexpected user questions.
Level 3: Online Automated Evals (Live Monitoring)
This is the “highest level,” where you automatically check and score live user interactions as they happen.
What it is: A system that automatically assigns a score or a pass/fail grade to your live traces.
Common Types:
Rule-Based: Hard-coded checks. Is the latency > 3 seconds? (FAIL). Did it output valid JSON? (PASS). Does it contain “As an AI model...”? (FAIL).
Metric-Based: Using other models to generate a score. This includes checking for toxicity, semantic similarity (is the answer close to a known-good answer?), or relevance (did the RAG context get used?).
LLM-as-a-Judge: Using a powerful model (like GPT-4) to “grade” your model’s output. You ask it, “On a scale of 1-5, how helpful was this response?” This is great for measuring subjective qualities like tone and helpfulness.
(P.S. For my tiny production-grade MVP Vera TechAssist (www.veratechassist.com) I went with levels 0, 1, and a sprinkle of 2. Follow up article coming out soon"!)
Work with me? I work with SaaS startups and scale-ups as an advisor or as an interim product lead. Connect with me on LinkedIn if you’d like to chat.




