Discovering Equations with Genetic Programming
ML for Science and Engineering — Lecture 12
In SINDy, we chose a fixed library of candidate terms:
This works when we have a good guess about what terms might appear.
But what if the true equation is:
No finite polynomial library will contain this term. We need to search over functional forms themselves.
Given data $(x, y)$, find a mathematical expression $f$ such that $y \approx f(x)$.
Every mathematical expression has a natural tree structure:
Internal nodes = operators ($+, \sin$)
Leaves = variables ($x, y$) or constants
Click an expression to see its tree representation:
With $k$ operators and $v$ variables, the number of possible trees of depth $d$ grows super-exponentially:
| Depth | Approximate # trees | Example |
|---|---|---|
| 1 | $\sim v$ (just variables) | $x$ |
| 2 | $\sim k \cdot v^2$ | $x + y$ |
| 3 | $\sim k^3 \cdot v^4$ | $\sin(x) + y \cdot z$ |
| 5 | $> 10^{10}$ | Complex nested expressions |
Inspired by biological evolution: a population of candidate solutions evolves over generations.
Tournament selection: pick $k$ individuals at random, keep the best one.
Other strategies: roulette wheel (probability proportional to fitness), rank selection, elitism (always keep the best).
Pick a random subtree in each parent, swap them to create offspring:
With some probability, randomly alter part of a tree:
population = [random_tree() for _ in range(pop_size)]
for generation in range(max_generations):
# Evaluate fitness
for tree in population:
tree.fitness = mse(tree, data) + alpha * tree.size
# Selection + reproduction
offspring = []
while len(offspring) < pop_size:
p1 = tournament_select(population, k=5)
p2 = tournament_select(population, k=5)
c1, c2 = crossover(p1, p2)
c1 = maybe_mutate(c1, rate=0.1)
c2 = maybe_mutate(c2, rate=0.1)
offspring.extend([c1, c2])
# Elitism: keep best from previous generation
population = elites(population, n=5) + offspring
Target: discover $f(x)$ from noisy data. Watch the population evolve.
More complex expressions fit better, but are they really better?
| Expression | MSE | Size |
|---|---|---|
| $0.3$ | 2.40 | 1 |
| $0.9x$ | 0.85 | 3 |
| $x \sin(x)$ | 0.001 | 5 |
| $x \sin(x) + 0.001x^3 - \cdots$ | 0.0009 | 15 |
The "knee" of the Pareto front often reveals the true equation: a sharp drop in error for minimal complexity increase.
from pysr import PySRRegressor
import numpy as np
X = np.random.randn(100, 2) # two input variables
y = X[:, 0] * np.sin(X[:, 1]) # true relationship
model = PySRRegressor(
niterations=100,
binary_operators=["+", "-", "*", "/"],
unary_operators=["sin", "cos", "exp"],
populations=20, # parallel populations
maxsize=25, # max expression complexity
)
model.fit(X, y)
print(model) # Pareto front of equations
PySR outputs a Pareto front of equations ranked by complexity vs. loss. Uses Julia backend for speed.
| Method | Search space | Interpretable? | Extrapolates? |
|---|---|---|---|
| Neural network | Weights of fixed architecture | No | Poorly |
| SINDy | Coefficients of fixed library | Yes | If correct |
| Symbolic regression | All expressions | Yes | If correct |
| Physics-informed NN | NN + known PDE constraint | Partially | Somewhat |