There's a quiet problem building inside every team that has started using model-powered tools. They pick one, either because it scored well on some leaderboard, or because a colleague recommended it at a conference, or simply because it was already in the software stack. Then they assume the job is done.
It isn't.
The models you use to draft copy, analyze contracts, or summarize sales calls are not interchangeable. They have different strengths, different blind spots, and different underlying personalities when it comes to judgment calls. A model that writes tight marketing briefs may be far too conservative when asked to evaluate a business risk. One that codes well may write like a manual when asked for something human.
The analogy that maps most accurately here isn't software procurement. It's hiring. And most companies are currently making a hiring decision without running a single interview.
Why Most Scores Tell You Almost Nothing
The standard way to compare models involves putting them through a battery of tests called benchmarks. These are scored automatically, compiled into leaderboards, and cited in product announcements. On the surface, they look authoritative. In practice, they have serious gaps.
Many benchmark question banks are publicly available, and models sometimes incorporate them during training, inflating scores on tests they've essentially already seen. Even when this doesn't happen, the questions are often strange proxies for ability. One popular benchmark includes questions about the cranial capacity of Homo erectus and trivia about a 1979 Cheap Trick live album. Getting those right tells you something about recall, but very little about how a model will perform on your quarterly strategy review.
Math, science, structured reasoning, coding tasks. Categories where answers are binary and measurable.
Business judgment, writing quality, nuanced advice, empathy, and tone awareness.
ARC-AGI and METR Long Tasks track more durable skills and show real upward trends. These are worth following.
No public benchmark reliably measures which model best fits YOUR specific use case or industry context.
The Vibes Approach, and Why It Has Real Value
People who work with multiple models daily often develop informal tests of their own. Not structured experiments with controls, but quick gut-checks. A researcher might ask every new model to draw a pelican on a bicycle. A developer might throw it a request to build the control panel of a starship. These are not scientific, but they are revealing.
Give multiple models the same creative prompt, say, a paragraph about a person who has been told they only have ten thousand words left in their lifetime and is down to forty-seven, holding a newborn, and the differences become immediately legible. One model keeps a careful, devastating word count. Another ignores it entirely. One writes with restraint. Another reaches for metaphor at the cost of the scene itself. These differences in texture and judgment are real, even if they're hard to quantify.
The vibes-based approach works well for individuals forming a personal opinion. The problem is that it doesn't scale. Models give different answers each time, so any head-to-head comparison only holds if you run it repeatedly and are honest about what you're measuring.
Think of it as a first interview. Useful for forming impressions, but not a replacement for a structured evaluation.
What a Real-World Test Actually Looks Like
The most rigorous example of practical benchmarking comes from research that examined models against real professional tasks. Rather than asking whether a model could answer trivia, researchers gathered domain experts with over a decade of experience in fields including finance, law, retail, and software development. Those experts created complex, realistic scenarios that would take a human professional between four and seven hours to complete.
Models and human professionals then worked through the same tasks side by side. A separate panel of experts reviewed the results blind, without knowing which responses came from a person and which from a model. The evaluation alone took over an hour per question.
The results were specific and instructive in a way that leaderboards never are. Certain models outperformed human professionals on software development and personal financial advising. Pharmacists, industrial engineers, and real estate agents consistently outperformed the best models. And even within the top tier, performance varied task by task. One model proved more accurate as a sales manager; another came out ahead as a financial advisor.
When presented with the same dubious pitch (a drone guacamole delivery startup), models rated the idea's viability 1-to-10 across ten trials each. Consistency within each model was notable. The range across models was striking.
How to Actually Interview Your Model
For individuals, casual testing is usually sufficient. Build a personal library of prompts that reveal the things you care about. Test a new model against your go-to tasks before committing to it. Notice how it handles ambiguity, how it approaches risk, and whether its writing voice fits the context you're working in.
For organizations deploying a model across teams or customer-facing products, the bar is higher. You need to treat this like a professional evaluation.
The Bigger Picture
There's a broader trend worth noticing underneath all of this. As models get better, the top tier pulls further ahead of the middle tier on structured tasks. Math, code, and structured reasoning are increasingly settled. The remaining differentiation lives in harder-to-measure territory: judgment, voice, risk calibration, and the texture of advice on genuinely uncertain questions.
That's exactly the territory where a leaderboard score tells you nothing, and where your own testing tells you everything.
"You wouldn't hire a VP based solely on their SAT scores. You shouldn't pick the tool that will advise thousands of decisions based on whether it knows the cranial capacity of Homo erectus."Adapted from Ethan Mollick's research on real-world model performance
The models worth using aren't just the ones that score highest. They're the ones that perform best on your work, in your context, with your judgment criteria. You won't find that answer on a leaderboard.
If you're ready to start, pick one task your team does repeatedly, something consequential enough to matter and routine enough to run twenty times. Test your top two or three model candidates against it. Have a human expert score the outputs. Write down what you find. That's your first real data point on which model actually belongs in your corner.


