Agents can replicate academic research, pass expert evaluations, and generate 17 PowerPoints nobody asked for. The gap between useful and wasteful is entirely up to us.
Something shifted quietly in how much autonomous work machines can do, and most people haven't fully registered it yet.
This isn't about benchmark scores or lab results. It's about a test designed by practitioners: domain experts with an average of 14 years of experience in finance, law, retail, and other industries, who were asked to design realistic tasks that would take a skilled human four to seven hours to complete. Both humans and the latest models attempted those tasks. A blind panel of experts then graded the results without knowing which was which.

Humans won, but only barely, and the margins varied significantly by field. More importantly, the gap is closing fast, and the main reason AI fell short wasn't errors or hallucinations. It was formatting and instruction-following, two areas that are improving rapidly with each new generation of models.
We are not at replacement territory yet, but we are clearly past "useful assistant." Something more substantial is underway.

The Expert Evaluation: Key Findings
Experts with 14+ years of experience designed tasks taking 4 to 7 hours for a skilled human to complete. AI and human experts both attempted the same tasks. A blind panel graded results.
Humans won, but narrowly. Margins varied by industry. The main reason AI lost was not accuracy or hallucinations, but poor formatting and instruction-following. Both are rapidly improving.
Newer models scored significantly higher than older ones. If the current trend holds, the next generation of models is expected to beat human experts on average in this evaluation.
Tasks and jobs are not the same thing. AI completing individual tasks shifts what professionals do, it does not eliminate roles, at least not while AI abilities remain uneven across complex work.
The Research Replication Experiment That Changes Everything
To understand what "real, economically valuable work" actually looks like in practice, here is a concrete example worth sitting with.
Academia has been struggling with a replication crisis for years. Important published findings turn out to be impossible for other researchers to reproduce. Checking a paper's findings requires deeply reading it, analyzing the underlying data, and painstakingly tracing every calculation. It is slow, expensive, and requires genuine expertise. As a result, only a fraction of published research ever gets checked.
The experiment: Claude Sonnet 4.5 was given the full text of a sophisticated economics paper along with its complete replication dataset. The prompt was straightforward. Replicate the findings from the uploaded data. No hand-holding, no step-by-step instructions, just the files and the task.

What happened next was notable. Without further direction, Claude read the paper, sorted through the archive, converted the statistical code from STATA to Python, and methodically worked through all the reported findings before confirming a successful reproduction. The results were spot-checked by a human and independently verified by another model. Everything held up.
The same process was repeated on several other papers with comparably strong results.
The time saved matters, but it is not the main point. The main point is that a crisis affecting entire academic fields, one that required expensive expert labor to address and therefore could never be done at scale, now has a plausible solution. That is not a productivity gain. It is a structural shift in what is possible.
Why Agents Can Actually Do This Now
A year ago, the idea of an AI working autonomously through a multi-step research task, navigating files, converting code across languages, and checking its own outputs, would have been unreliable at best.
The reason it works now comes down to two changes happening simultaneously. First, newer models make significantly fewer errors. Error rate reductions that sound small in percentage terms translate into dramatically longer task chains that can complete successfully. A model that fails every 20 steps can handle far fewer complex tasks than one that fails every 100.
Second, the latest reasoning models are self-correcting. When they hit an obstacle or produce a wrong result, they identify the problem and adjust rather than stopping or silently continuing with bad data. That self-correction loop is what makes longer, more complex tasks survivable without constant human intervention.
The practical result is that agents can now use tools, including anything a computer can run, without substantial hand-holding at each step. METR's tracking of how long a task an AI can complete alone, measured consistently from GPT-3 through the current generation, shows exponential growth over five years. The gains are not slowing down.
A corporate memo was given to an agent with a single instruction: turn it into a PowerPoint. Then another from a different angle. Then another. The process was repeated until 17 different versions existed. All technically completed. None of them needed.
The same technology that can reproduce academic research in minutes can generate an infinite pile of outputs nobody asked for. The difference between those two outcomes has nothing to do with the tool. It is entirely about the judgment of the person directing it.
The Workflow That Actually Works
The OpenAI research on expert-AI collaboration offers a framework that is more practical than most of what circulates around this topic.
The suggested workflow runs in three stages. Start by delegating the task to the agent as a first pass. Review what comes back. If it is good enough, you are done. If not, try once or twice with corrections or better instructions. If it still falls short, do the work yourself. That is the full decision tree.
Following this workflow, the research estimates experts complete work 40 percent faster and 60 percent cheaper, while retaining meaningful control over the outputs. The speed and cost numbers are useful. The control part matters more, because the alternative is delegation without oversight, which produces technically completed work that serves no real purpose.

The honest version of this: agents are productive when the person directing them has a clear sense of what they actually want and why. When that clarity is missing, agents are efficient at generating volume, not value.
Vague instructions produce vague outputs at high volume. Know specifically what a successful result looks like before you hand it over. That clarity is the most important input.
Agents complete tasks, they do not evaluate whether the completed task was worth doing. That judgment stays with you. Build a review step into every workflow, not as a formality but as a real check.
Delegate. Review. If the output is not good enough, correct and try again once or twice. If it still fails, do it yourself. This keeps you in control without abandoning the efficiency gains.
Agents are very good at scaling output. The discipline is in deciding what output is actually worth producing. More is not a goal. The right thing, done well, is the goal.
The Real Dividing Line
The technology available right now can do something as precise and valuable as reproducing peer-reviewed academic research. It can also, with equal technical competence, generate 17 versions of a document that serves no purpose for anyone.
Both outcomes use the same tools. Both are the result of someone directing an agent toward a task. The only variable is whether the person directing it had a clear reason for doing so.
That is the actual challenge of this moment, not learning which tools to use, but developing the judgment to decide what is worth doing in the first place. Organizations that treat agents as a way to cut headcount and fill calendars with output will get exactly that: more output. Ones that use them to tackle work that was previously too slow, too expensive, or structurally impossible will end up with something far more useful.
The agents are here. What they produce is still up to us.


