Can AI Agents Outthink Classical Methods at Hyperparameter Tuning? A New Paper Has Answers

Can AI Agents Outthink Classical Methods at Hyperparameter Tuning? A New Paper Has Answers
A research paper published on arXiv on 9 June 2026 examines a question worth paying attention to: can a large language model (LLM) agent get better results than traditional mathematical algorithms when it comes to tuning machine learning models? The short answer is no — at least not when both are working within the same set of constraints.
What the Research Tested
The paper, authored by Fabio Ferreira and collaborators, studied something called autoresearch, which is a system that connects an LLM directly into the process of training a machine learning model. Instead of the usual approach — where you define a fixed set of knobs to turn (like learning rate or batch size) and let an algorithm try different values — this LLM agent can actually read and edit the training code itself. That means it could theoretically make bigger changes: add a new scheduling method, switch to a different optimizer, or restructure how the model learns.
But here's the key point: the researchers set up a fair comparison. They told the LLM agent to stay within the exact same boundaries that the classical methods use. This creates a clean apples-to-apples test.
What They Found
When both approaches play by the same rules, the classical methods win. Specifically, two well-established algorithms — CMA-ES (think of it as a sophisticated trial-and-error method that learns from each attempt) and TPE (a Bayesian approach used in popular tools like Optuna) — consistently outperformed the LLM agent at finding good hyperparameter settings.
To understand why: CMA-ES and TPE are methods that have been refined over decades. They work by building a statistical model of how different settings affect your results, then use that model to guide their search toward better values. They're very good at squeezing useful information out of each experiment they run. The LLM agent, by contrast, doesn't have that kind of built-in statistical machinery. Instead, it relies on patterns it learned during training and in-context reasoning — which can be clever, but doesn't have the mathematical guarantees that the older methods provide.
Why It Matters That the Agent Can Edit Code
The fact that this LLM can actually modify code is worth emphasizing. Most experiments with AI agents for this kind of work just ask them "what hyperparameters should I use?" and the agent responds with numbers. But autoresearch goes further: the agent can rewrite part of the training script itself. That's a bigger capability.
The paper shows that within a fixed search space, this extra power doesn't actually help — it seems to introduce noise rather than solve problems better. The classical methods have an advantage precisely because they understand the structure of the problem space they're working in. When both methods know the exact boundaries ahead of time, that structural knowledge pays off.
However, there's something worth considering: the paper deliberately chose this fixed-boundary scenario. A natural follow-up question is what happens when the boundaries are fuzzy, irregular, or described in plain English rather than a rigid configuration file. That's a scenario where classical methods tend to struggle. If the search space is messy or poorly defined, an LLM's ability to reason about language and code could be genuinely useful. The paper identifies this as an open question.
Placing This in Context
This pattern is familiar to anyone who has watched AI progress over several decades. When a powerful new tool emerges — and LLMs are certainly that — there's an initial wave of optimism followed by careful empirical testing that pins down exactly what the new approach does better and what the old approaches still beat it at. We saw this before: deep learning revolutionized image recognition but took longer to beat classical methods on structured, tabular data. Neural architecture search promised to automate the entire process of designing networks, but it turned out to work best as a complement to human expertise, not a replacement.
The results here look like part of the same calibration process. And that's actually useful. When you know that an LLM agent underperforms at a specific task, the productive question becomes: where would it shine. Reasoning about how model training works, designing new experiments from scratch, or translating a vague problem statement into a working configuration — those are areas where an agent's language understanding could genuinely add value.
What This Means for You
If you're building or maintaining machine learning systems in production, the practical lesson is straightforward: stick with classical methods like TPE or CMA-ES for hyperparameter tuning. They'll extract more value from each experiment you run, which matters when every training run costs money or time. Your results will be more reproducible, and the methods are well-understood.
The more interesting avenue for future work is hybrid systems: an LLM handling the bigger-picture decisions about experiment design while a classical method handles the detailed tuning. Since the autoresearch code is openly available, researchers can test this combination to see if the best of both approaches works better than either alone.
The paper doesn't settle the question of where LLM agents belong in the machine learning toolkit. Instead, it sharpens the question. And in a field full of hype and unclear boundaries, clarity about what something can and cannot do is genuinely valuable.


