From Residual Minimization to Deep Learning for PDEs
ML for Science and Engineering
Given a PDE on domain \(\Omega\):
with boundary/initial conditions:
Find \(u(x, t)\) that satisfies both the PDE and the conditions.
Concrete example: the heat equation
IC: \(u(x, 0) = \sin(\pi x)\)
BC: \(u(0, t) = u(1, t) = 0\)
Propose a solution as a linear combination of known functions:
Fourier modes: \(\phi_i(x) = \sin(i\pi x / L)\)
Polynomials: Legendre, Chebyshev
FEM shape functions: piecewise linear on a mesh
Substitute the hypothesis into the PDE. For the heat equation:
Choose a set of collocation points \(\{(x_k, t_k)\}\) in the domain.
Evaluate the residual at those points and minimize:
Given observations \((x_j, t_j, u_j^{\text{data}})\), find coefficients by minimizing:
PDE: \(u'' + \pi^2 u = 0\) on \([0,1]\), exact solution: \(\sin(\pi x)\). Build \(u(x) = \sum \theta_i \sin(i\pi x)\) by adjusting \(\theta_i\). Only \(\theta_1 = 1\) gives zero residual (red bars) at collocation points.
The basis expansion is linear in \(\theta_i\) and limited to the chosen \(\phi_i\).
What if we had a universal hypothesis that could learn any shape?
Replace the basis expansion with a neural network:
where \(\theta = \{W_1, b_1, W_2, b_2, \ldots\}\) are the network weights and biases.
Lagaris et al., Artificial Neural Networks for Solving ODEs and PDEs, IEEE TNN (1998) | Raissi, Perdikaris & Karniadakis, Physics-Informed Neural Networks, JCP (2019)
For the heat equation, the residual requires derivatives of \(u_\theta\):
If \(u_\theta\) is a neural network, how do we compute these derivatives?
AD computes exact derivatives through the computational graph via the chain rule:
Higher-order derivatives too: \(\frac{\partial^2 u_\theta}{\partial x^2}\) by differentiating again through the graph.
The weights \(\lambda_d, \lambda_r, \lambda_b\) balance the three objectives. Getting this balance right is one of the main challenges.
| Classical (FEM / FD) | PINN | |
|---|---|---|
| Domain | Mesh or grid required | Mesh-free (collocation points) |
| Basis | Pre-selected (polynomials, FEM) | Learned by network |
| Derivatives | Finite differences / weak form | Automatic differentiation |
| Data | Hard to incorporate | Natural (add data loss term) |
| High dimensions | Curse of dimensionality | Better scaling (in principle) |
| Accuracy | Can reach \(10^{-10}\) and below | Typically \(10^{-3}\) to \(10^{-5}\) |
| Cost | Fast for well-posed problems | Expensive training |
class PINN(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
nn.Linear(2, 20), nn.Tanh(),
nn.Linear(20, 20), nn.Tanh(),
nn.Linear(20, 1))
def forward(self, x, t):
return self.net(torch.cat([x, t], dim=1))
def pde_residual(model, x, t, alpha):
"""Heat equation: u_t - alpha * u_xx = 0"""
x.requires_grad_(True); t.requires_grad_(True)
u = model(x, t)
u_t = torch.autograd.grad(u, t, torch.ones_like(u), create_graph=True)[0]
u_x = torch.autograd.grad(u, x, torch.ones_like(u), create_graph=True)[0]
u_xx = torch.autograd.grad(u_x, x, torch.ones_like(u_x), create_graph=True)[0]
return u_t - alpha * u_xx
model = PINN(); optimizer = Adam(model.parameters(), lr=1e-3)
for epoch in range(N):
x_r, t_r = random_collocation_points(N_coll)
loss = MSE(model(x_data, t_data), u_data) \
+ lambda_r * MSE(pde_residual(model, x_r, t_r, alpha), 0)
loss.backward(); optimizer.step()
create_graph=True preserves the computation graph so we can backpropagate through the AD derivatives. Collocation points can be resampled every epoch.
When the PDE and all parameters are known, and we have no interior data:
Example: solve the ODE
Exact solution: \(u(x) = e^{-x}\)
Damped oscillator: \(u'' + 0.6\,u' + 9\,u = 0\), \(u(0)=1,\; u'(0)=0\). Toggle loss components to see their effect on extrapolation.
We observe data for \(u(x)\) and know the PDE form, but some parameters are unknown.
Example: \(u' + \alpha u = 0\), we observe \(u\) at several points, \(\alpha\) is unknown.
ODE: \(u' + \alpha u = 0,\; u(0)=1\). We observe \(u\) at a few points. The network learns \(\alpha\) simultaneously.
Training challenges, architectural variants, and practical guidance
Ref: Wang, Yu & Perdikaris, "When and why PINNs fail to train", JCP (2022)
Target: \(f(x) = \sin(x) + 0.5\sin(5x) + 0.3\sin(10x)\). Watch how the network captures low frequencies first.
| Strategy | How it works | Reference |
|---|---|---|
| Fixed weights | \(\mathcal{L} = \lambda_d \mathcal{L}_{\text{data}} + \lambda_r \mathcal{L}_{\text{res}}\) (manual tuning) | Raissi et al. (2019) |
| Learning rate annealing | Adjust \(\lambda\) based on gradient magnitudes of each loss term | Wang et al. (2021) |
| NTK-based weighting | Use Neural Tangent Kernel eigenvalues to balance convergence rates | Wang et al. (2022) |
| Self-adaptive weights | Make \(\lambda\) a trainable parameter optimized alongside \(\theta\) | McClenny & Brainerd (2023) |
Map inputs through random Fourier features before the network:
Feed \(\gamma(x)\) into the network instead of raw \(x\). The random matrix \(B\) controls the frequency scale.
Ref: Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions", NeurIPS (2020)
PINNs learn one solution for one PDE instance. What if we want to learn the solution operator?