Non-Convex Intelligence / Adaptive Search

The Incredible Magic of
CMA-ES.

Designing wings, bridges, and neural networks when you have no gradients and no time to waste.

Scroll to Optimize

In modern machine learning, optimization almost automatically implies gradients. Adam, Adafactor, Lion, and SGD are all variants of the same concept: moving parameters downhill along a cheaply computed gradient.

However, many important problems exist where the gradient is either absent, meaningless, or too expensive to approximate. For these black-box optimization scenarios, CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is a remarkably effective tool.

CMA-ES is a Zero-Order optimizer. It operates without knowing the local slope of the landscape, relying instead on the height of the objective at various points. This approach provides a principled method for navigating toward optimal solutions using only function evaluations.

1. The Search Mechanism

CMA-ES maintains a Gaussian search distribution over the parameter space. It iteratively adapts the mean and covariance of this Gaussian to concentrate probability mass in regions where the objective performs well.

Consider an unknown function $f:\mathbb{R}^n \to \mathbb{R}$ that requires minimization. The algorithm is initialized with the dimension $n$ , an initial parameter guess, and a starting step scale.

The process then repeats through four primary phases:

1.
Sampling: Candidate points are drawn from a multivariate normal distribution:
$x_i$ $\sim$ $\mathcal{N}$ $($ $m$ $,$ $\sigma^2$ $C$ $)$
where
$m$
is the current mean,
$\sigma$
is a scalar step-size, and
$C$
is a covariance matrix defining the current search shape.
2.
Evaluation: The black-box function $f(x_i)$ is evaluated at each point.
3.
Ranking: Samples are sorted by their objective performance.
4.
Adaptation: The parameters $m$ , $\sigma$ , and $C$ are adjusted to increase the likelihood of sampling from high-performance regions in the next generation.

The initially spherical Gaussian gradually transforms into a rotated and stretched ellipsoid. This shape aligns with the valleys and ridges of the objective function, implicitly learning the local Hessian without calculating derivatives.

2. The Differentiability Gap

In many practical engineering settings, the mapping from parameters to outcomes is a jagged and discontinuous landscape. Traditional derivatives in these environments are often meaningless.

Airplane Wings

CFD simulations can take hours per run. Discontinuities arising from meshing and turbulence models make finite differences unreliable. A single gradient step might require days of computation.

Suspension Bridges

Finite element models involve sharp changes as constraints flip between satisfied and violated states. Brittle constraints and mode-crossing make standard gradient descent ineffective.

Transformer Hyperparameters

Architecture choices such as layer counts and attention heads are inherently discrete. The extreme cost of training runs often limits the total evaluation budget to a few hundred points.

200

Typical Eval Budget

Naive grid search is equally impractical; a simple 10-dimensional grid with 10 values per axis requires $10^{10}$ simulations. CMA-ES provides a solution by efficiently moving through the search space using a manageable number of samples.

3. The Gradient Hierarchy

Clear and inexpensive gradients should almost always be leveraged. First-order methods like Adam or SGD are highly efficient because they utilize the exact local slope.

CMA-ES is intended for scenarios where gradients are unavailable, computationally prohibitive, or obscured by numerical noise. It serves as a robust alternative when local slopes are too deceptive for reliable optimization.

By the fifth generation, the covariance $C$ typically evolves beyond its initial state. Its eigenvectors align with directions where the objective varies gently, allowing the distribution to navigate narrow valleys effectively. This process approximates the local curvature without requiring a formal Hessian or explicit gradients.

The Invariance Property

CMA-ES maintains robustness through its Invariance properties, making it essentially "coordinate-free."

Translation & Rotation

Rotating the search space or shifting the starting point does not alter the algorithm’s trajectory. It is independent of the axis orientation.

Rank-Based Scaling

Because selection relies solely on the rank of samples, the algorithm is invariant to monotonic transforms of the objective. Optimizing $f(x)$ yields identical results to optimizing $\log(f(x))$ .

This design minimizes the need for precise parameter scaling or objective preprocessing, making it a reliable default for diverse problems.

Stochastic Robustness

Many engineering black-boxes are Stochastic, producing noisy feedback due to rounding errors or sensor jitter.

Gradient-based methods often struggle with noise because they depend on single-point estimates. CMA-ES leverages its entire population to filter out jitter, identifying the underlying global structure through the noise.

Escaping Local Minima

Real-world landscapes are frequently Multimodal, containing deceptive Local Minima. CMA-ES addresses this through a "Restart with Increasing Population" strategy (IPOP).

When progress stalls, the algorithm increases its search radius and expands the population size. This shift from local refinement to global exploration enables CMA-ES to locate the global optimum in vast and complex search spaces.

4. Historical Context

CMA-ES is the culmination of several historical lines of optimization research.

Evolution Strategies

In contrast to genetic algorithms that emphasized bitstrings and crossover, the German evolution strategies community focused on continuous search spaces and Gaussian mutations. CMA-ES, introduced in the mid-1990s, extended this by adapting a full covariance matrix.

Information Geometry

Viewed through the lens of information geometry, the core CMA-ES update is a natural gradient step on the manifold of multivariate normals. This provides a rigorous mathematical foundation for its adaptive behavior.

Kriging & Gaussian Processes

CMA-ES shares a common philosophical basis with Kriging (Gaussian-process regression). While Kriging models the unknown function itself, CMA-ES uses its Gaussian model as a proposal distribution to explore regions of low objective values.

5. Disciplinary Silos

Optimization research is often divided into distinct communities that rarely interact.

GECCO

The evolutionary computation community focuses on EDAs, surrogate models, and derivative-free methods as the primary language of optimization.

NeurIPS

The deep learning community primarily relies on backpropagation and SGD variants. While they utilize Bayesian optimization, evolutionary computation is often treated as a separate discipline.

Bridging these worlds reveals conceptual overlaps, such as the relationship between CMA-ES and architectures like LeCun’s Joint-Embedding Predictive Architectures (JEPA).

6. Creative System Optimization

CMA-ES is particularly useful for tuning systems where end-to-end differentiability is not feasible.

Continuous Cellular Automata

Designing a continuous CA with specific visual dynamics—such as complexity or coherence—requires a scalar fitness function. Because gradients between these high-level properties and kernel weights do not exist, CMA-ES serves as an automated experimental collaborator, identifying parameter settings that produce intended dynamics.

Automated Collaboration

By deforming its search distribution based on empirical results, the algorithm identifies configurations that would be difficult to discover through manual tuning.

7. Implementations

Watch the algorithm navigate different landscapes in real-time below.

Personal Rust Implementations

wasm_cmaes

Rust CMA-ES compiled to WebAssembly, featuring SIMD-accelerated vector operations and internal Rayon parallelism.

Source Code

fast_cmaes

A high-performance Rust core with Python bindings, featuring vectorized objectives and a Rich-powered TUI.

Source Code

Technical Addendum

Notes derived from Nikolaus Hansen’s 2024 tutorial.

The Search Distribution

CMA-ES does not “push a single point downhill.” It updates a full search distribution, where the center is $m$ , the exploration scale is $\sigma$ , and the geometry/orientation is $C$ .

Center

Scale

Shape

Formally, optimization proceeds over $\theta = (m, \sigma, C)$ , not directly over a single parameter vector.

Visualization: Mean + Scale + Orientation

Invariance Properties

Selection depends on rank, not raw magnitude. That means monotonic transforms preserve decisions, and rigid linear transforms of coordinates preserve behavior.

Original Objective

f(x): [10, 4, 2]

Monotonic Transform

log(f(x)): [2.30, 1.39, 0.69]

Rank order is unchanged, so the same candidates are selected.

Adaptive Step-size Control (CSA)

CSA tracks whether successive updates keep moving coherently in one direction. If the path is unusually long, the algorithm increases $\sigma$ ; if motion zig-zags, it contracts $\sigma$ .

p_\sigma

=

(1 - c_s)

p_\sigma

+

\sqrt{c_s(2-c_s)\mu_{eff}}

C^{-1/2}

\frac{m_{new} - m_{old}}{\sigma}

Read it as: new path = decayed memory + whitened normalized mean shift.

Path Longer Than Expected

Increase σ

Path Short / Zig-Zagging

Decrease σ

The Incredible Magic ofCMA-ES.

1. The Search Mechanism

2. The Differentiability Gap

Airplane Wings

Suspension Bridges

Transformer Hyperparameters

3. The Gradient Hierarchy

The Invariance Property

Translation & Rotation

Rank-Based Scaling

Stochastic Robustness

Escaping Local Minima

4. Historical Context

Evolution Strategies

Information Geometry

Kriging & Gaussian Processes

5. Disciplinary Silos

GECCO

NeurIPS

6. Creative System Optimization

Continuous Cellular Automata

Automated Collaboration

7. Implementations

Personal Rust Implementations

wasm_cmaes

fast_cmaes

Technical Addendum

The Search Distribution

Invariance Properties

Adaptive Step-size Control (CSA)

The Incredible Magic ofCMA-ES.

1. The Search Mechanism

2. The Differentiability Gap

Airplane Wings

Suspension Bridges

Transformer Hyperparameters

3. The Gradient Hierarchy

The Invariance Property

Translation & Rotation

Rank-Based Scaling

Stochastic Robustness

Escaping Local Minima

4. Historical Context

Evolution Strategies

Information Geometry

Kriging & Gaussian Processes

5. Disciplinary Silos

GECCO

NeurIPS

6. Creative System Optimization

Continuous Cellular Automata

Automated Collaboration

7. Implementations

Personal Rust Implementations

wasm_cmaes

fast_cmaes

Technical Addendum

The Search Distribution

Invariance Properties

Adaptive Step-size Control (CSA)

The Incredible Magic of
CMA-ES.

The Incredible Magic of
CMA-ES.