Symbolic Regression

Discovering Equations with Genetic Programming

ML for Science and Engineering — Lecture 12

The Limits of Fixed Libraries

In SINDy, we chose a fixed library of candidate terms:

$\dot{x} = \theta_1 x + \theta_2 x^2 + \theta_3 \sin(x) + \cdots$

This works when we have a good guess about what terms might appear.

But what if the true equation is:

$y = \frac{\sin(x^2)}{1 + e^{-x}}$

No finite polynomial library will contain this term. We need to search over functional forms themselves.

What is Symbolic Regression?

Given data $(x, y)$, find a mathematical expression $f$ such that $y \approx f(x)$.
Linear Regression
Fix form: $y = ax + b$
Search over: $a, b$
SINDy
Fix library: $[1, x, x^2, \sin x]$
Search over: $\theta$
Symbolic Regression
Fix nothing
Search over: all expressions
The search space is enormous: all possible compositions of $+, -, \times, \div, \sin, \cos, \exp, \log, \dots$ applied to variables and constants.

A Brief History

  • 1992John Koza: Genetic Programming. Evolving computer programs (trees) via natural selection.
  • 2009Schmidt & Lipson, Science: Distilling free-form natural laws from experimental data. Rediscovered Newton's laws from pendulum data.
  • 2023Cranmer, PySR: High-performance symbolic regression combining evolutionary search with modern optimizations.

Expressions as Trees

Every mathematical expression has a natural tree structure:

$x + \sin(y)$

Internal nodes = operators ($+, \sin$)
Leaves = variables ($x, y$) or constants

Interactive: Expression Trees

Click an expression to see its tree representation:

$x + \sin(y)$ $\alpha x - \beta xy$ $\frac{x}{1 + e^{-x}}$ $\sin(x^2) + 0.5$
Nodes: 3
Depth: 2
Operators: +, sin
Expression:
x + sin(y)

The Search Space

With $k$ operators and $v$ variables, the number of possible trees of depth $d$ grows super-exponentially:

DepthApproximate # treesExample
1$\sim v$ (just variables)$x$
2$\sim k \cdot v^2$$x + y$
3$\sim k^3 \cdot v^4$$\sin(x) + y \cdot z$
5$> 10^{10}$Complex nested expressions
Exhaustive search is impossible. We need a heuristic search strategy. Genetic algorithms are a natural fit: they evolve a population of trees.

The Genetic Algorithm

Inspired by biological evolution: a population of candidate solutions evolves over generations.

1. Initialize
Random population of expression trees
2. Evaluate
Fitness = how well each tree fits the data
3. Evolve
Select, crossover, mutate. Repeat.
$\text{fitness}(f) = \underbrace{\text{MSE}(f, \text{data})}_{\text{accuracy}} + \alpha \cdot \underbrace{|f|}_{\text{complexity}}$

Selection

Tournament selection: pick $k$ individuals at random, keep the best one.

Other strategies: roulette wheel (probability proportional to fitness), rank selection, elitism (always keep the best).

Crossover: Swapping Subtrees

Pick a random subtree in each parent, swap them to create offspring:

Crossover combines structural features from both parents. A good subtree from one parent can be transplanted into another context.

Mutation: Random Changes

With some probability, randomly alter part of a tree:

Point mutation
Change one node: $+ \to \times$, or $x \to y$, or $0.5 \to 0.3$
Subtree mutation
Replace a subtree with a new random one
Hoist mutation
Replace tree with one of its own subtrees (simplification)
Constant optimization
Fine-tune numeric constants via gradient descent

The Full Loop


population = [random_tree() for _ in range(pop_size)]

for generation in range(max_generations):
    # Evaluate fitness
    for tree in population:
        tree.fitness = mse(tree, data) + alpha * tree.size

    # Selection + reproduction
    offspring = []
    while len(offspring) < pop_size:
        p1 = tournament_select(population, k=5)
        p2 = tournament_select(population, k=5)
        c1, c2 = crossover(p1, p2)
        c1 = maybe_mutate(c1, rate=0.1)
        c2 = maybe_mutate(c2, rate=0.1)
        offspring.extend([c1, c2])

    # Elitism: keep best from previous generation
    population = elites(population, n=5) + offspring
        

Watch Evolution in Action

Target: discover $f(x)$ from noisy data. Watch the population evolve.

Step Run 10 Reset
Best expression:
MSE: | Size:

The Accuracy-Complexity Tradeoff

More complex expressions fit better, but are they really better?

ExpressionMSESize
$0.3$2.401
$0.9x$0.853
$x \sin(x)$0.0015
$x \sin(x) + 0.001x^3 - \cdots$0.000915
Pareto front: the set of solutions where you cannot improve accuracy without increasing complexity (or vice versa).

The "knee" of the Pareto front often reveals the true equation: a sharp drop in error for minimal complexity increase.

PySR: Modern Symbolic Regression


from pysr import PySRRegressor
import numpy as np

X = np.random.randn(100, 2)      # two input variables
y = X[:, 0] * np.sin(X[:, 1])    # true relationship

model = PySRRegressor(
    niterations=100,
    binary_operators=["+", "-", "*", "/"],
    unary_operators=["sin", "cos", "exp"],
    populations=20,               # parallel populations
    maxsize=25,                   # max expression complexity
)

model.fit(X, y)
print(model)                      # Pareto front of equations
        

PySR outputs a Pareto front of equations ranked by complexity vs. loss. Uses Julia backend for speed.

Challenges and Solutions

Challenges

  • Bloat: trees grow large without improving fitness
  • Premature convergence: population loses diversity
  • Constants: hard to evolve precise numeric values
  • Runtime: expensive for large datasets

Solutions

  • Parsimony pressure, tree depth limits
  • Multiple populations, island model
  • Gradient-based constant optimization
  • Parallelization, batching, Julia/C++

Symbolic Regression in Context

MethodSearch spaceInterpretable?Extrapolates?
Neural networkWeights of fixed architectureNoPoorly
SINDyCoefficients of fixed libraryYesIf correct
Symbolic regressionAll expressionsYesIf correct
Physics-informed NNNN + known PDE constraintPartiallySomewhat
Symbolic regression is the most flexible equation discovery method, but also the most computationally expensive. Use SINDy when you have a good library guess. Use SR when you do not.

Summary

1. Represent
Expressions as trees.
Operators + variables + constants.
2. Evolve
Genetic algorithm: select, crossover, mutate over generations.
3. Select
Pareto front: accuracy vs. complexity. Find the knee.
Data $\longrightarrow$ evolve expression trees $\longrightarrow$ discover $f(x)$

Schmidt & Lipson, Science (2009)  |  PySR (Cranmer, 2023)