Learning Differential Equations
from Data

ML for Science and Engineering — Lecture 8

Where We Are

So far we have modeled time series by mapping previous steps to next steps:

$x_{i+1} = f(x_i, x_{i-1}, \dots)$

Autoregressive models, ARIMA, function fitting on lagged variables.

Today: a fundamentally different approach. Model the derivative:

$\dot{x} = f(x)$

Discover the differential equation that generates the data.

Why Model Derivatives?

This is what Newton did. This is what everyone in the sciences has done for quite some time.
A differential equation $\dot{x} = f(x)$ says that the same rule $f$ governs the system at every moment in time. It generalizes in time.
The dual relationship: In traditional physics, $f$ is known and $x$ is unknown (forward problem). Here, $x$ is known and $f$ is unknown (inverse problem).

A Brief History

Discovering equations from data is not new:

The key idea: cast equation discovery as sparse linear regression over a library of candidate terms.

Case Study: Predator-Prey Dynamics

Hudson Bay Company data (1900-1920): hare and lynx pelts traded.

● Hare (prey, $x$)   ● Lynx (predator, $y$)
Clear oscillations, two coupled variables. Can we discover the governing equations $\dot{x} = f(x,y)$, $\dot{y} = g(x,y)$ from the data alone?

Proposing a Hypothesis Class

We have two variables $x$ (prey) and $y$ (predator). Assume the dynamics can be written as a linear combination of candidate terms:

$\dot{x} = \theta_0 + \theta_1 x + \theta_2 y + \theta_3 x^2 + \theta_4 y^2 + \theta_5 xy + \theta_6 \sin(x) + \cdots$

We don't know which terms matter. We include many candidates and let the algorithm select the active ones.

Sneak peek: the true equations turn out to be $\dot{x} = \alpha x - \beta xy$ and $\dot{y} = \delta xy - \gamma y$. Only a few terms survive out of many candidates!

The Feature Library

The library encodes our hypothesis class: which terms could appear in the equation?

Library choiceWhat it can discover
$[1, x, x^2, x^3]$Polynomial ODEs
$[x, \sin(x), \cos(x)]$Trigonometric dynamics (e.g. pendulum)
$[x, y, xy, x^2, y^2]$Coupled 2D systems (e.g. Lotka-Volterra)
$[u, u_x, u_{xx}, u \cdot u_x]$PDEs (diffusion, Burgers, etc.)
More features = more expressive, but harder to identify the right terms. This is where sparsity comes in.

Approximating $\dot{x}$ from Data

We need $\dot{x}$, but all we have are discrete samples $x(t_1), x(t_2), \dots, x(t_n)$.

Forward difference:

$\dot{x}(t_i) \approx \frac{x(t_{i+1}) - x(t_i)}{\Delta t}$

Central difference:

$\dot{x}(t_i) \approx \frac{x(t_{i+1}) - x(t_{i-1})}{2\Delta t}$
0.05
▬ True  ▬ Forward  ▬ Central

Differentiation Methods

MethodAccuracyNoise sensitivityNotes
Forward difference$O(\Delta t)$HighSimplest
Central difference$O(\Delta t^2)$HighBetter accuracy
Savitzky-GolayDepends on windowLowSmoothing + differentiation
Fourier / spectralSpectralMediumGlobal, periodic signals
Total variationRegularizedLowFor very noisy data

From Data Points to Equations

Focus on $x$ with a quadratic hypothesis: $\dot{x}(t_i) = \theta_0 + \theta_1 \, x(t_i) + \theta_2 \, x(t_i)^2$

Each data point gives one equation with the same unknowns $\theta_0, \theta_1, \theta_2$. Click to add:

Add Point Reset

The System in Matrix Form

Stacking all $n$ equations from the data points:

$\underbrace{\begin{bmatrix} \dot{x}(t_1) \\ \dot{x}(t_2) \\ \dot{x}(t_3) \\ \vdots \\ \dot{x}(t_n) \end{bmatrix}}_{\dot{\mathbf{X}}} = \underbrace{\begin{bmatrix} 1 & x(t_1) & x(t_1)^2 \\ 1 & x(t_2) & x(t_2)^2 \\ 1 & x(t_3) & x(t_3)^2 \\ \vdots & \vdots & \vdots \\ 1 & x(t_n) & x(t_n)^2 \end{bmatrix}}_{\boldsymbol{\Phi}} \underbrace{\begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \end{bmatrix}}_{\boldsymbol{\theta}}$
$n$ equations, $3$ unknowns. When $n \gg 3$: overdetermined. This is just linear regression: $\dot{\mathbf{X}} = \boldsymbol{\Phi}\,\boldsymbol{\theta}$.

The General Feature Matrix $\boldsymbol{\Phi}\big(\mathbf{x}(t)\big)$

For an arbitrary library $\phi_0, \phi_1, \dots, \phi_m$:

$\underbrace{\begin{bmatrix} \dot{x}(t_1) \\ \dot{x}(t_2) \\ \vdots \\ \dot{x}(t_n) \end{bmatrix}}_{\dot{\mathbf{X}}} = \underbrace{\begin{bmatrix} \phi_0(x_1) & \phi_1(x_1) & \cdots & \phi_m(x_1) \\ \phi_0(x_2) & \phi_1(x_2) & \cdots & \phi_m(x_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(x_n) & \phi_1(x_n) & \cdots & \phi_m(x_n) \end{bmatrix}}_{\boldsymbol{\Phi}\big(\mathbf{x}(t)\big)} \underbrace{\begin{bmatrix} \theta_0 \\ \theta_1 \\ \vdots \\ \theta_m \end{bmatrix}}_{\boldsymbol{\theta}}$

Compact form: $\dot{\mathbf{X}} = \boldsymbol{\Phi}\big(\mathbf{x}(t)\big) \, \boldsymbol{\theta}$. Each $\phi_j$ could be $x$, $x^2$, $\sin(x)$, $xy$, etc.

The Least Squares Solution

Minimize the squared error between predicted and observed derivatives:

$\boldsymbol{\theta}^* = \arg\min_{\boldsymbol{\theta}} \left\| \boldsymbol{\Phi} \, \boldsymbol{\theta} - \dot{\mathbf{X}} \right\|_2^2$

Closed-form solution via the normal equations:

$\boldsymbol{\theta}^* = \left(\boldsymbol{\Phi}^\top \boldsymbol{\Phi}\right)^{-1} \boldsymbol{\Phi}^\top \dot{\mathbf{X}}$
This gives the best fit using all features. But many coefficients will be small and spurious. We need sparsification.

Generalizing to Vector Systems

When $\mathbf{x} = (x_1, x_2, \dots, x_d)$ is a vector, each component gets its own equation but shares the same library:

$\dot{x}_k = \boldsymbol{\theta}_k^\top \boldsymbol{\phi}(\mathbf{x}), \quad k = 1, \dots, d$

Compact matrix form:

$\underbrace{\dot{\mathbf{X}}}_{\mathbb{R}^{n \times d}} = \underbrace{\boldsymbol{\Phi}(\mathbf{X})}_{\mathbb{R}^{n \times p}} \; \underbrace{\boldsymbol{\Theta}}_{\mathbb{R}^{p \times d}}$

Where $n$ = data points, $d$ = state dimension, $p$ = library size. Sparsify each column of $\boldsymbol{\Theta}$ independently.

Approaches to Sparsity

LASSO ($\ell_1$ regularization)

$\min_\theta \|\Phi\theta - \dot{X}\|_2^2 + \lambda \|\theta\|_1$

The $\ell_1$ penalty pushes small coefficients to exactly zero. We saw this with regularization.

Sequential Thresholding (STLS)

Repeat: solve, then set $|\theta_j| < \lambda$ to zero

The approach used in the original SINDy paper. Simple, effective, interpretable.

SINDy in Python


import numpy as np
# X: (n, 2) state measurements, dt: time step, threshold: sparsity knob
dXdt = (X[2:] - X[:-2]) / (2 * dt)        # derivatives (central diff)

# Feature library: [1, x, y, x², y², xy]
x, y = X[1:-1, 0], X[1:-1, 1]
Phi = np.column_stack([np.ones_like(x), x, y, x**2, y**2, x*y])

Theta = np.linalg.lstsq(Phi, dXdt, rcond=None)[0]   # least squares

# Sequential thresholding
for _ in range(10):
    small = np.abs(Theta) < threshold
    Theta[small] = 0
    for j in range(2):                     # one column per variable
        active = ~small[:, j]
        if active.any():
            Theta[active, j] = np.linalg.lstsq(
                Phi[:, active], dXdt[:, j], rcond=None)[0]
        

SINDy: Step by Step

Step Run All Reset 0.30
0.08 Synthetic Real Data
Step 1: Build feature library Phi = [1, x, y, x2, y2, xy] # n data points × 6 features
Step 2: Solve least squares theta = lstsq(Phi, X_dot) # best fit using ALL features
Step 3: Threshold small coefficients theta[|theta| < λ] = 0 # enforce sparsity
Step 4: Re-solve for active terms active = |theta| > 0 theta[active] = lstsq(Phi[:,active], X_dot) # if changed, goto Step 2

Finding the Right Solution

SINDy converges, but is the answer correct? Several factors make this harder than it looks:

1. Noise corrupts derivatives
Small errors in $x(t)$ become large errors in $\dot{x}$. Noisy derivatives bias the least-squares solution and can activate spurious terms.
2. $\lambda$ is a hyperparameter
Too large $\rightarrow$ everything is zeroed out (trivial $\dot{x}=0$).
Too small $\rightarrow$ spurious terms survive (overfitting).
How do we measure "good"?
Training error (MSE on data used to fit) always decreases with more terms. We need a metric that captures generalization.
Train / test split
Hold out the last portion of the time series. Fit on training data, evaluate on held-out test data. The best $\lambda$ minimizes test error.

Hyperparameter Tuning: $\lambda$ vs MSE

0.30 Run SINDy Clear
0.08
Each dot = one SINDy run at that $\lambda$.
■ Train MSE   ■ Test MSE

Part 2: From ODEs to PDEs

ML for Science and Engineering — Lecture 9

From ODEs to PDEs

What if our variable depends on space and time? $u(x, t)$

$\frac{\partial u}{\partial t} = f\!\left(u, \; \frac{\partial u}{\partial x}, \; \frac{\partial^2 u}{\partial x^2}, \; u \frac{\partial u}{\partial x}, \; \dots \right)$

Same idea: the feature library now includes spatial derivatives:

$\boldsymbol{\phi} = [1, \; u, \; u_x, \; u_{xx}, \; u^2, \; u \cdot u_x, \; u_{xxx}, \; \dots]$

Building the PDE Feature Matrix

At each grid point $(x_i, t_j)$, compute all candidate terms:

$u_t^{(i,j)} = \theta_1 \cdot u^{(i,j)} + \theta_2 \cdot u_x^{(i,j)} + \theta_3 \cdot u_{xx}^{(i,j)} + \theta_4 \cdot u^{(i,j)} u_x^{(i,j)} + \cdots$

Stack all interior grid points into one system. After sparsification:

Discover: $u_t = \alpha \, u_{xx}$ (the diffusion equation). Only $u_{xx}$ survives.

From Grid to System of Equations

Click interior points to build the system.

$u_t^{(i,j)} \approx \frac{u^{(i,j+1)} - u^{(i,j)}}{\Delta t}, \quad u_x^{(i,j)} \approx \frac{u^{(i+1,j)} - u^{(i-1,j)}}{2\Delta x}, \quad u_{xx}^{(i,j)} \approx \frac{u^{(i+1,j)} - 2u^{(i,j)} + u^{(i-1,j)}}{\Delta x^2}$
System of equations:
Reset

Stacking into a Linear System

Each interior grid point gives one equation. Stack them all:

$\underbrace{\begin{bmatrix} u_t^{(2,1)} \\ u_t^{(3,1)} \\ u_t^{(2,2)} \\ \vdots \\ u_t^{(i,j)} \\ \vdots \end{bmatrix}}_{\dot{\mathbf{U}}} = \underbrace{\begin{bmatrix} u_x^{(2,1)} & u_{xx}^{(2,1)} \\ u_x^{(3,1)} & u_{xx}^{(3,1)} \\ u_x^{(2,2)} & u_{xx}^{(2,2)} \\ \vdots & \vdots \\ u_x^{(i,j)} & u_{xx}^{(i,j)} \\ \vdots & \vdots \end{bmatrix}}_{\boldsymbol{\Phi}} \underbrace{\begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}}_{\boldsymbol{\theta}}$
Same structure as the ODE case: $\dot{\mathbf{U}} = \boldsymbol{\Phi}\,\boldsymbol{\theta}$. Solve with least squares, then sparsify. The surviving terms reveal the PDE.

PDEs You Can Discover from Data

With a richer library $[u, u_x, u_{xx}, u \cdot u_x, u_{xxx}, \dots]$, the same approach discovers many classical PDEs:

EquationSurviving featuresPhysics
Diffusion: $u_t = \alpha u_{xx}$$u_{xx}$Heat conduction
Advection: $u_t = -c \, u_x$$u_x$Wave transport
Burgers: $u_t = \nu u_{xx} - u \, u_x$$u_{xx}$, $u \cdot u_x$Shock formation
KdV: $u_t = -u \, u_x + \delta \, u_{xxx}$$u \cdot u_x$, $u_{xxx}$Solitons
The method is not limited to dynamics or nonlinear equations. It is really Sparse Identification of Differential Equations from data.

Rudy, Brunton, Proctor & Kutz, Science Advances (2017) — PDE-FIND: the extension of SINDy to partial differential equations.

Practical Considerations

Challenges

  • Noisy derivatives degrade results
  • Threshold $\lambda$ is a hyperparameter
  • Library must contain the right terms
  • Doesn't scale to very high-dimensional systems

Strengths

  • Interpretable: output is an equation
  • Simple: linear regression + thresholding
  • Computationally cheap
  • Works with moderate data

SINDy in Context

ApproachHypothesis classInterpretable?Data needed
Neural networkAny continuous functionNoLots
SINDySparse combination of known termsYesModerate
Symbolic regressionArbitrary expressionsYesModerate
Physics-informed NNNN constrained by known PDEPartiallyLess
SINDy's appeal: very easy to understand. It's linear regression with sparsification. That's all it is.

Summary

1. Derivative
Compute $\dot{x}$ from data.
Handle noise carefully.
2. Library
Build $\boldsymbol{\Phi}(\mathbf{x})$ from candidate terms.
Encode your hypothesis.
3. Sparsify
Solve + threshold iteratively.
Discover the equation.
$\dot{\mathbf{X}} = \boldsymbol{\Phi}(\mathbf{X}) \, \boldsymbol{\theta} \quad \longrightarrow \quad$ discover $\boldsymbol{\theta}$ (sparse)

Brunton et al., PNAS (2016)  |  Rudy et al., Science Advances (2017)

Looking Ahead

  • Next lecture: Symbolic regression — what if the library itself is unknown?
  • Project idea: Find real-world time series data, apply SINDy, discover governing equations
Homework: Apply SINDy to the predator-prey data. Discover the governing equations. Extend to a PDE (diffusion) using finite differences.