
One of the most important aspects in machine learning is hyperparameter optimisation, as finding the right hyperparameters for a machine learning task can make or break a model’s performance. Internally, we regularly use Google Vizier as the default platform for hyperparameter optimisation, say utian Chen, staff research scientist, DeepMind, and Xingyou (Richard) Song, research scientist, Google Research, brain team.
Throughout its deployment over the last 5 years, Google Vizier has been used more than 10 million times, over a vast class of applications, including machine learning applications from vision, reinforcement learning, and language but also scientific applications such as protein discovery and hardware acceleration. As Google Vizier is able to keep track of use patterns in its database, such data, usually consisting of optimisation trajectories termed studies, contain very valuable prior information on realistic hyperparameter tuning objectives, and are thus highly attractive for developing better algorithms.
While there have been many previous methods for meta-learning over such data, such methods share one major common drawback: their meta-learning procedures depend heavily on numerical constraints such as the number of hyperparameters and their value ranges, and thus require all tasks to use the exact same total hyperparameter search space (i.e., tuning specifications). Additional textual information in the study, such as its description and parameter names, are also rarely used, yet can hold meaningful information about the type of task being optimised. Such a drawback becomes more exacerbated for larger datasets, which often contain significant amounts of such meaningful information.
Today in “Towards Learning Universal Hyperparameter Optimisers with Transformers”, we are excited to introduce the OptFormer, one of the first Transformer-based frameworks for hyperparameter tuning, learned from large-scale optimisation data using flexible text-based representations. While numerous works have previously demonstrated the Transformer’s strong abilities across various domains, few have touched on its optimisation-based capabilities, especially over text space.
Our core findings demonstrate for the first time some intriguing algorithmic abilities of Transformers: 1) a single Transformer network is capable of imitating highly complex behaviors from multiple algorithms over long horizons; 2) the network is further capable of predicting objective values very accurately, in many cases surpassing Gaussian Processes, which are commonly used in algorithms such as Bayesian Optimisation.
Approach: representing studies as tokens
Rather than only using numerical data as common with previous methods, our novel approach instead utilises concepts from natural language and represents all of the study data as a sequence of tokens, including textual information from initial metadata. In the animation below, this includes “CIFAR10”, “learning rate”, “optimiser type”, and “Accuracy”, which informs the OptFormer of an image classification task.
The OptFormer then generates new hyperparameters to try on the task, predicts the task accuracy, and finally receives the true accuracy, which will be used to generate the next round’s hyperparameters. Using the T5X codebase, the OptFormer is trained in a typical encoder-decoder fashion using standard generative pretraining over a wide range of hyperparameter optimisation objectives, including real world data collected by Google Vizier, as well as public hyperparameter (HPO-B) and blackbox optimisation benchmarks (BBOB).
Imitating policies
As the OptFormer is trained over optimisation trajectories by various algorithms, it may now accurately imitate such algorithms simultaneously. By providing a text-based prompt in the metadata for the designated algorithm (e.g. “Regularised Evolution”), the OptFormer will imitate the algorithm’s behavior.

Predicting objective values
In addition, the OptFormer may now predict the objective value being optimised (e.g. accuracy) and provide uncertainty estimates. We compared the OptFormer’s prediction with a standard Gaussian Process and found that the OptFormer was able to make significantly more accurate predictions. This can be seen below qualitatively, where the OptFormer’s calibration curve closely follows the ideal diagonal line in a goodness-of-fit test, and quantitatively through standard aggregate metrics such as log predictive density.

Combining both: model-based optimisation

We may now use the OptFormer’s function prediction capability to better guide our imitated policy, similar to techniques found in Bayesian Optimisation. Using Thompson Sampling, we may rank our imitated policy’s suggestions and only select the best according to the function predictor. This produces an augmented policy capable of outperforming our industry-grade Bayesian Optimisation algorithm in Google Vizier when optimising classic synthetic benchmark objectives and tuning the learning rate hyperparameters of a standard CIFAR-10 training pipeline.
Conclusion
Throughout this work, we discovered some useful and previously unknown optimisation capabilities of the Transformer. In the future, we hope to pave the way for a universal hyperparameter and blackbox optimisation interface to use both numerical and textual data to facilitate optimisation over complex search spaces, and integrate the OptFormer with the rest of the Transformer ecosystem (e.g. language, vision, code) by leveraging Google’s vast collection of offline AutoML data.
The authors are utian Chen, staff research scientist, DeepMind, and Xingyou (Richard) Song, research scientist, Google Research, brain team.
Follow us and Comment on Twitter @TheEE_io