LLMs vs. Classical HPO: New Research Finds Gradient-Free Methods Still Hold the Edge

A paper published on arXiv on 9 June 2026 asks a pointed question that anyone running ML pipelines at scale should be sitting with: can a large language model agent outperform classical hyperparameter optimisation algorithms when given direct access to training code? The short answer, according to the research, is not yet — at least not within a fixed search space.
What the Paper Studies
The work, authored by Fabio Ferreira and collaborators, is framed around autoresearch, a repository that wires an LLM agent into the ML training loop in an unusually direct way: rather than sampling from a predefined configuration space through a surrogate model or acquisition function, the agent reads and edits training code itself. The practical implication is that the agent can, in principle, make structural changes — not just tune learning_rate or batch_size within declared bounds, but modify the code that defines how those parameters are used.
That architectural choice makes the comparison meaningful. If you constrain the LLM to the same fixed hyperparameter search space that classical methods operate in, you get a clean apples-to-apples benchmark. That is precisely the experimental condition the authors establish.
The Core Finding
Within a fixed search space, classical HPO methods — specifically CMA-ES (Covariance Matrix Adaptation Evolution Strategy) and TPE (Tree-structured Parzen Estimator) — consistently outperform the LLM-based agent. Neither result is particularly surprising to anyone who has spent time with Bayesian optimisation literature, but the explicit empirical confirmation against a code-editing LLM agent is new ground.
CMA-ES and TPE are well-characterised. CMA-ES is a second-order, gradient-free evolutionary strategy that builds an iterative covariance model over the search space, making it highly sample-efficient in continuous domains. TPE, the workhorse behind tools like Optuna and Hyperopt, models p(x|y) as two density estimators and uses the ratio to guide sampling — again, sample-efficient and well-calibrated against a fixed configuration schema. Both methods have decades of theoretical grounding and extensive empirical validation across tabular, vision, and NLP tasks.
The LLM agent, by contrast, is operating under a fundamentally different set of constraints. It has no explicit probabilistic model of the search space. It does not maintain a surrogate. Its "reasoning" about which hyperparameters to adjust next is derived from in-context pattern matching across its training distribution — a process that may encode useful heuristics but lacks the formal convergence properties of evolutionary or Bayesian strategies.
Why the Code-Editing Framing Matters
The autoresearch setup is worth dwelling on because it is not a standard ask-the-LLM-for-a-config workflow. Giving the agent write access to training code is a meaningful capability escalation. It means the agent could, under different experimental conditions, step outside the fixed search space entirely — adding a learning rate scheduler that was not in the original config, switching optimisers mid-run, or restructuring the training loop. The paper's contribution is partly in identifying where that additional latitude does and does not translate into better outcomes.
Within the fixed-space condition, the extra degree of freedom appears to add noise rather than signal. The classical methods, operating with structural certainty about the search space geometry, make better use of each evaluation. This is consistent with what the HPO community has observed when comparing structured Bayesian methods against less constrained search strategies: the ability to encode assumptions about the search space is usually an asset, not a limitation.
Worth flagging here: the fixed-space constraint is a deliberate experimental choice, not the only interesting one. A natural follow-on question — which the paper itself may raise in its analysis — is whether LLM agents gain ground when the search space is underspecified, irregular, or partially described in natural language rather than a config schema. That is the regime where classical methods start to struggle and where language-grounded reasoning could, in principle, add value.
Placing This in a Longer Arc
There is a pattern here that anyone who has covered multiple waves of AI tooling will recognise. When a new capability class arrives — and LLMs are unambiguously a major one — there is an early phase of maximum-scope claims, followed by a more granular empirical reckoning that maps exactly where the new approach wins and where established methods retain their advantage. We saw it with deep learning displacing gradient boosting on some tasks while gradient boosting held firm on structured tabular data for years; we saw it again with neural architecture search, which promised to automate the entire design loop but ended up complementing rather than replacing domain knowledge. The autoresearch findings look like part of that same calibration cycle for LLM agents applied to classical optimisation problems.
That calibration is not a setback — it is how the field figures out where to focus. If LLM agents underperform CMA-ES on fixed-space HPO, the productive question becomes: what is the task decomposition where the agent's capabilities are genuinely additive? Code-level reasoning about training dynamics, experiment design in open-ended research settings, and translating informal problem descriptions into runnable configurations are all plausible candidates.
What This Means for Practitioners
For ML engineers running production HPO pipelines, the immediate takeaway is conservative: classical methods remain the default for structured, fixed-space tuning. Optuna's TPE or a CMA-ES backend via pycma or nevergrad will, on current evidence, extract more signal per trial than an LLM-driven agent operating in the same space. Compute budgets, trial latency, and reproducibility all argue in the same direction.
The more interesting near-term question is whether hybrid architectures — where an LLM handles the outer loop of experiment design and a classical method handles the inner loop of hyperparameter search — produce gains that neither approach achieves alone. The autoresearch codebase, by making the LLM's code-editing interface explicit, provides a reasonably clean substrate for testing that hypothesis.
The paper (arXiv:2603.24647) does not close the question of LLM utility in the ML automation stack; it sharpens it. And in a field where precision about capability boundaries is in short supply, that is a useful contribution in its own right.


