Giving your AI a Job Interview

There's a quiet problem building inside every team that has started using model-powered tools. They pick one, either because it scored well on some leaderboard, or because a colleague recommended it at a conference, or simply because it was already in the software stack. Then they assume the job is done.

It isn't.

The models you use to draft copy, analyze contracts, or summarize sales calls are not interchangeable. They have different strengths, different blind spots, and different underlying personalities when it comes to judgment calls. A model that writes tight marketing briefs may be far too conservative when asked to evaluate a business risk. One that codes well may write like a manual when asked for something human.

The analogy that maps most accurately here isn't software procurement. It's hiring. And most companies are currently making a hiring decision without running a single interview.

The Benchmarking Problem

Why Most Scores Tell You Almost Nothing

The standard way to compare models involves putting them through a battery of tests called benchmarks. These are scored automatically, compiled into leaderboards, and cited in product announcements. On the surface, they look authoritative. In practice, they have serious gaps.

Many benchmark question banks are publicly available, and models sometimes incorporate them during training, inflating scores on tests they've essentially already seen. Even when this doesn't happen, the questions are often strange proxies for ability. One popular benchmark includes questions about the cranial capacity of Homo erectus and trivia about a 1979 Cheap Trick live album. Getting those right tells you something about recall, but very little about how a model will perform on your quarterly strategy review.

What benchmarks measure well

Math, science, structured reasoning, coding tasks. Categories where answers are binary and measurable.

What benchmarks miss

Business judgment, writing quality, nuanced advice, empathy, and tone awareness.

Higher-quality tests

ARC-AGI and METR Long Tasks track more durable skills and show real upward trends. These are worth following.

The missing layer

No public benchmark reliably measures which model best fits YOUR specific use case or industry context.

The Vibes Approach, and Why It Has Real Value

People who work with multiple models daily often develop informal tests of their own. Not structured experiments with controls, but quick gut-checks. A researcher might ask every new model to draw a pelican on a bicycle. A developer might throw it a request to build the control panel of a starship. These are not scientific, but they are revealing.

Give multiple models the same creative prompt, say, a paragraph about a person who has been told they only have ten thousand words left in their lifetime and is down to forty-seven, holding a newborn, and the differences become immediately legible. One model keeps a careful, devastating word count. Another ignores it entirely. One writes with restraint. Another reaches for metaphor at the cost of the scene itself. These differences in texture and judgment are real, even if they're hard to quantify.

Field Note

The vibes-based approach works well for individuals forming a personal opinion. The problem is that it doesn't scale. Models give different answers each time, so any head-to-head comparison only holds if you run it repeatedly and are honest about what you're measuring.

Think of it as a first interview. Useful for forming impressions, but not a replacement for a structured evaluation.

What a Real-World Test Actually Looks Like

The most rigorous example of practical benchmarking comes from research that examined models against real professional tasks. Rather than asking whether a model could answer trivia, researchers gathered domain experts with over a decade of experience in fields including finance, law, retail, and software development. Those experts created complex, realistic scenarios that would take a human professional between four and seven hours to complete.

Models and human professionals then worked through the same tasks side by side. A separate panel of experts reviewed the results blind, without knowing which responses came from a person and which from a model. The evaluation alone took over an hour per question.

The results were specific and instructive in a way that leaderboards never are. Certain models outperformed human professionals on software development and personal financial advising. Pharmacists, industrial engineers, and real estate agents consistently outperformed the best models. And even within the top tier, performance varied task by task. One model proved more accurate as a sales manager; another came out ahead as a financial advisor.

Model Personalities: The GuacaDrone Test

When presented with the same dubious pitch (a drone guacamole delivery startup), models rated the idea's viability 1-to-10 across ten trials each. Consistency within each model was notable. The range across models was striking.

Grok Rated the idea highly. Enthusiastic toward risk-taking. Likely to encourage ambitious ideas even when fundamentals are shaky.

MS Copilot Similarly generous. Consistent optimism. May not be the tool you want if you need a sober second opinion.

GPT-5 More skeptical. Better calibrated to real-world business risk. Suitable for roles requiring honest evaluation.

Claude 4.5 Also skeptical. Consistent lower ratings for the dubious idea. Strong for analytical tasks requiring critical perspective.

How to Actually Interview Your Model

For individuals, casual testing is usually sufficient. Build a personal library of prompts that reveal the things you care about. Test a new model against your go-to tasks before committing to it. Notice how it handles ambiguity, how it approaches risk, and whether its writing voice fits the context you're working in.

For organizations deploying a model across teams or customer-facing products, the bar is higher. You need to treat this like a professional evaluation.

Your Model Interview Framework

01 Start with your actual tasks. Don't test against generic prompts. Use real scenarios your team faces: a product brief, a client proposal, a risk summary.

02 Run each task multiple times. Models produce different outputs on every run. A single result is anecdote. Ten results start to show you the pattern.

03 Bring in a human evaluator. Have someone with relevant expertise assess the outputs, ideally blind to which model produced them.

04 Test for judgment, not just accuracy. Include open-ended questions and ambiguous business scenarios. This is where personality differences become consequential.

05 Plan to repeat this. New models come out several times a year. A model you selected twelve months ago may no longer be the best option for your use case.

The Bigger Picture

There's a broader trend worth noticing underneath all of this. As models get better, the top tier pulls further ahead of the middle tier on structured tasks. Math, code, and structured reasoning are increasingly settled. The remaining differentiation lives in harder-to-measure territory: judgment, voice, risk calibration, and the texture of advice on genuinely uncertain questions.

That's exactly the territory where a leaderboard score tells you nothing, and where your own testing tells you everything.

"You wouldn't hire a VP based solely on their SAT scores. You shouldn't pick the tool that will advise thousands of decisions based on whether it knows the cranial capacity of Homo erectus."

Adapted from Ethan Mollick's research on real-world model performance

The models worth using aren't just the ones that score highest. They're the ones that perform best on your work, in your context, with your judgment criteria. You won't find that answer on a leaderboard.

If you're ready to start, pick one task your team does repeatedly, something consequential enough to matter and routine enough to run twenty times. Test your top two or three model candidates against it. Have a human expert score the outputs. Write down what you find. That's your first real data point on which model actually belongs in your corner.

Giving your AI a Job Interview

Why Most Scores Tell You Almost Nothing

The Vibes Approach, and Why It Has Real Value

What a Real-World Test Actually Looks Like

How to Actually Interview Your Model

The Bigger Picture

Keep Reading

Are You Interested
With Get?

Giving your AI a Job Interview

Why Most Scores Tell You Almost Nothing

The Vibes Approach, and Why It Has Real Value

What a Real-World Test Actually Looks Like

How to Actually Interview Your Model

The Bigger Picture

Keep Reading

Are You InterestedWith Get?

Are You Interested
With Get?