Symbolic Regression

Discovering Equations with Genetic Programming

ML for Science and Engineering — Lecture 12

The Limits of Fixed Libraries

In SINDy, we chose a fixed library of candidate terms:

$\dot{x} = \theta_1 x + \theta_2 x^2 + \theta_3 \sin(x) + \cdots$

This works when we have a good guess about what terms might appear.

But what if the true equation is:

$y = \frac{\sin(x^2)}{1 + e^{-x}}$

No finite polynomial library will contain this term. We need to search over functional forms themselves.

What is Symbolic Regression?

Given data $(x, y)$, find a mathematical expression $f$ such that $y \approx f(x)$.

Linear Regression

                Fix form: $y = ax + b$
Search over: $a, b$
            

SINDy
Fix library: $[1, x, x^2, \sin x]$
Search over: $\theta$

Symbolic Regression
Fix nothing
Search over: all expressions

The search space is enormous: all possible compositions of $+, -, \times, \div, \sin, \cos, \exp, \log, \dots$ applied to variables and constants.

A Brief History

1992 — John Koza: Genetic Programming. Evolving computer programs (trees) via natural selection.
2009 — Schmidt & Lipson, Science: Distilling free-form natural laws from experimental data. Rediscovered Newton's laws from pendulum data.
2023 — Cranmer, PySR: High-performance symbolic regression combining evolutionary search with modern optimizations.

Expressions as Trees

Every mathematical expression has a natural tree structure:

$x + \sin(y)$

Internal nodes = operators ($+, \sin$)
Leaves = variables ($x, y$) or constants

Interactive: Expression Trees

Click an expression to see its tree representation:

$x + \sin(y)$ $\alpha x - \beta xy$ $\frac{x}{1 + e^{-x}}$ $\sin(x^2) + 0.5$

Nodes: 3
Depth: 2
Operators: +, sin
Expression:
x + sin(y)

The Search Space

With $k$ operators and $v$ variables, the number of possible trees of depth $d$ grows super-exponentially:

Depth	Approximate # trees	Example
1	$\sim v$ (just variables)	$x$
2	$\sim k \cdot v^2$	$x + y$
3	$\sim k^3 \cdot v^4$	$\sin(x) + y \cdot z$
5	$> 10^{10}$	Complex nested expressions

Exhaustive search is impossible. We need a heuristic search strategy. Genetic algorithms are a natural fit: they evolve a population of trees.

The Genetic Algorithm

Inspired by biological evolution: a population of candidate solutions evolves over generations.

1. Initialize
Random population of expression trees
            

2. Evaluate
Fitness = how well each tree fits the data

3. Evolve
Select, crossover, mutate. Repeat.

$\text{fitness}(f) = \underbrace{\text{MSE}(f, \text{data})}_{\text{accuracy}} + \alpha \cdot \underbrace{|f|}_{\text{complexity}}$

Selection

Tournament selection: pick $k$ individuals at random, keep the best one.

Other strategies: roulette wheel (probability proportional to fitness), rank selection, elitism (always keep the best).

Crossover: Swapping Subtrees

Pick a random subtree in each parent, swap them to create offspring:

            Crossover combines structural features from both parents. A good subtree from one parent can be transplanted into another context.
        

Mutation: Random Changes

With some probability, randomly alter part of a tree:

Point mutation

                    Change one node: $+ \to \times$, or $x \to y$, or $0.5 \to 0.3$

Subtree mutation

                    Replace a subtree with a new random one

Hoist mutation
Replace tree with one of its own subtrees (simplification)

Constant optimization
Fine-tune numeric constants via gradient descent

The Full Loop


population = [random_tree() for _ in range(pop_size)]

for generation in range(max_generations):
    # Evaluate fitness
    for tree in population:
        tree.fitness = mse(tree, data) + alpha * tree.size

    # Selection + reproduction
    offspring = []
    while len(offspring) < pop_size:
        p1 = tournament_select(population, k=5)
        p2 = tournament_select(population, k=5)
        c1, c2 = crossover(p1, p2)
        c1 = maybe_mutate(c1, rate=0.1)
        c2 = maybe_mutate(c2, rate=0.1)
        offspring.extend([c1, c2])

    # Elitism: keep best from previous generation
    population = elites(population, n=5) + offspring

Watch Evolution in Action

Target: discover $f(x)$ from noisy data. Watch the population evolve.

Step Run 10 Reset

target:

Best expression:
MSE:  | Size: 

The Accuracy-Complexity Tradeoff

More complex expressions fit better, but are they really better?

Expression	MSE	Size
$0.3$	2.40	1
$0.9x$	0.85	3
$x \sin(x)$	0.001	5
$x \sin(x) + 0.001x^3 - \cdots$	0.0009	15

Pareto front: the set of solutions where you cannot improve accuracy without increasing complexity (or vice versa).

The "knee" of the Pareto front often reveals the true equation: a sharp drop in error for minimal complexity increase.

PySR: Modern Symbolic Regression


from pysr import PySRRegressor
import numpy as np

X = np.random.randn(100, 2)      # two input variables
y = X[:, 0] * np.sin(X[:, 1])    # true relationship

model = PySRRegressor(
    niterations=100,
    binary_operators=["+", "-", "*", "/"],
    unary_operators=["sin", "cos", "exp"],
    populations=20,               # parallel populations
    maxsize=25,                   # max expression complexity
)

model.fit(X, y)
print(model)                      # Pareto front of equations

PySR outputs a Pareto front of equations ranked by complexity vs. loss. Uses Julia backend for speed.

Challenges and Solutions

Challenges

Bloat: trees grow large without improving fitness
Premature convergence: population loses diversity
Constants: hard to evolve precise numeric values
Runtime: expensive for large datasets

Solutions

Parsimony pressure, tree depth limits
Multiple populations, island model
Gradient-based constant optimization
Parallelization, batching, Julia/C++

Symbolic Regression in Context

Method	Search space	Interpretable?	Extrapolates?
Neural network	Weights of fixed architecture	No	Poorly
SINDy	Coefficients of fixed library	Yes	If correct
Symbolic regression	All expressions	Yes	If correct
Physics-informed NN	NN + known PDE constraint	Partially	Somewhat

            Symbolic regression is the most flexible equation discovery method, but also the most computationally expensive. Use SINDy when you have a good library guess. Use SR when you do not.
        

Summary

1. Represent
Expressions as trees.
Operators + variables + constants.
        

2. Evolve
Genetic algorithm: select, crossover, mutate over generations.

3. Select
Pareto front: accuracy vs. complexity. Find the knee.

Data $\longrightarrow$ evolve expression trees $\longrightarrow$ discover $f(x)$

Schmidt & Lipson, Science (2009) | PySR (Cranmer, 2023)