Physics-Informed Neural Networks

From Residual Minimization to Deep Learning for PDEs

ML for Science and Engineering

The Problem We Want to Solve

Given a PDE on domain \(\Omega\):

\(\mathcal{N}[u](x, t) = 0, \;\; (x,t) \in \Omega\)

with boundary/initial conditions:

\(\mathcal{B}[u](x, t) = g(x, t), \;\; (x,t) \in \partial\Omega\)

Find \(u(x, t)\) that satisfies both the PDE and the conditions.

Concrete example: the heat equation

\(u_t = \alpha \, u_{xx}\)

IC: \(u(x, 0) = \sin(\pi x)\)
BC: \(u(0, t) = u(1, t) = 0\)

Exact: \(u(x,t) = e^{-\alpha \pi^2 t} \sin(\pi x)\)

Where We Are in the Course

Modeling
ODEs/PDEs, dynamical systems, numerical methods
Lectures 4–7
Data-Driven Discovery
SINDy, PDE-FIND, symbolic regression
Lectures 8–12
Deep Learning
NNs, CNNs, RNNs, PyTorch
Lectures 16–19
Today: Combine neural networks with physical laws to solve and discover PDEs

The Classical Approach: Basis Functions

Propose a solution as a linear combination of known functions:

\(u(x, t) = \sum_{i=1}^{N} \theta_i \, \phi_i(x, t)\)

Fourier modes: \(\phi_i(x) = \sin(i\pi x / L)\)

Polynomials: Legendre, Chebyshev

FEM shape functions: piecewise linear on a mesh

The idea: choose a hypothesis class, then determine the coefficients \(\theta_i\) from the data and the PDE.

Defining the Residual

Substitute the hypothesis into the PDE. For the heat equation:

\(\mathcal{R}(x, t;\, \theta_i) = \sum_{i} \theta_i \frac{\partial \phi_i}{\partial t} - \alpha \sum_{i} \theta_i \frac{\partial^2 \phi_i}{\partial x^2}\)
If \(u\) is the exact solution, \(\mathcal{R} = 0\) everywhere. For an approximate solution, we want \(\mathcal{R}\) as small as possible.
This is the central idea: measure how well a candidate solution satisfies the PDE, then minimize that error.

Collocation Methods

Choose a set of collocation points \(\{(x_k, t_k)\}\) in the domain.

Evaluate the residual at those points and minimize:

\(\min_{\theta_i} \sum_{k=1}^{N_r} \mathcal{R}(x_k, t_k;\, \theta_i)^2\)
This is the method of weighted residuals / collocation method. Well-established in applied math.
With data: \(\mathcal{L} = \underbrace{\sum_j (u_j^{\text{data}} - u_\theta)^2}_{\color{#e5c07b}{\text{Data}}} + \underbrace{\sum_k \mathcal{R}(x_k, t_k)^2}_{\color{#e06c75}{\text{Residual}}} + \underbrace{\sum_m (u_\theta - g_m)^2}_{\color{#98c379}{\text{BC/IC}}}\)

Adding Data: The Combined Loss

Given observations \((x_j, t_j, u_j^{\text{data}})\), find coefficients by minimizing:

\(\mathcal{L}(\theta_i) = \underbrace{\sum_{j=1}^{N_d} \left[ u_j^{\text{data}} - \sum_i \theta_i \phi_i(x_j, t_j) \right]^2}_{\color{#98c379}{\text{Data-fitting term}}} + \underbrace{\sum_{k=1}^{N_r} \mathcal{R}(x_k, t_k;\, \theta_i)^2}_{\color{#e06c75}{\text{Physics residual term}}}\)
Combine data-fitting with physical constraints in a single objective. The data pulls toward observations; the residual ensures physical consistency.

Interactive: Basis Function Mixer

PDE: \(u'' + \pi^2 u = 0\) on \([0,1]\), exact solution: \(\sin(\pi x)\). Build \(u(x) = \sum \theta_i \sin(i\pi x)\) by adjusting \(\theta_i\). Only \(\theta_1 = 1\) gives zero residual (red bars) at collocation points.

From Linear to Universal Approximation

The basis expansion is linear in \(\theta_i\) and limited to the chosen \(\phi_i\).

\(u(x) = \sum_{i=1}^N \theta_i \, \phi_i(x)\)

What if we had a universal hypothesis that could learn any shape?

\(\sum_i \theta_i \, \phi_i(x) \;\longrightarrow\; u_\theta(x)\)
Neural networks are universal approximators. We don't need to choose basis functions — the network learns them.

The PINN: Neural Network as Solution

Replace the basis expansion with a neural network:

\(u(x, t) \approx u_\theta(x, t)\)

where \(\theta = \{W_1, b_1, W_2, b_2, \ldots\}\) are the network weights and biases.

The network directly outputs the solution value at any point in the domain. No mesh needed.
Input x t Hidden Hidden Output uθ typically tanh activation

Lagaris et al., Artificial Neural Networks for Solving ODEs and PDEs, IEEE TNN (1998)  |  Raissi, Perdikaris & Karniadakis, Physics-Informed Neural Networks, JCP (2019)

Computing PDE Derivatives

For the heat equation, the residual requires derivatives of \(u_\theta\):

\(\mathcal{R} = \frac{\partial u_\theta}{\partial t} - \alpha \frac{\partial^2 u_\theta}{\partial x^2}\)

If \(u_\theta\) is a neural network, how do we compute these derivatives?

Finite differences?
Approximation errors, grid dependence, costly for higher-order derivatives
Something better...
Neural networks are compositions of differentiable functions. We can use the chain rule.

Automatic Differentiation

AD computes exact derivatives through the computational graph via the chain rule:

\((x, t) \xrightarrow{W_1, b_1} h_1 \xrightarrow{W_2, b_2} \cdots \xrightarrow{} u_\theta\)
\(\frac{\partial u_\theta}{\partial x} = \frac{\partial u_\theta}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdots \frac{\partial h_1}{\partial x}\)

Higher-order derivatives too: \(\frac{\partial^2 u_\theta}{\partial x^2}\) by differentiating again through the graph.

Unlike finite differences, AD gives exact derivatives (up to floating point). No mesh or grid needed.
This is efficient: PyTorch and TensorFlow compute all required PDE derivatives in a single backward pass through the network.

The PINN Loss Function

\(\mathcal{L}(\theta) = \lambda_d \, \mathcal{L}_{\text{data}} + \lambda_r \, \mathcal{L}_{\text{residual}} + \lambda_b \, \mathcal{L}_{\text{boundary}}\)
Data
\(\frac{1}{N_d}\sum_j (u_\theta - u_j^{\text{data}})^2\)
Residual
\(\frac{1}{N_r}\sum_k [\mathcal{N}[u_\theta]]^2\)
Boundary/IC
\(\frac{1}{N_b}\sum_m (u_\theta - g_m)^2\)
x t

The weights \(\lambda_d, \lambda_r, \lambda_b\) balance the three objectives. Getting this balance right is one of the main challenges.

PINN vs Classical Methods

Classical (FEM / FD)PINN
DomainMesh or grid requiredMesh-free (collocation points)
BasisPre-selected (polynomials, FEM)Learned by network
DerivativesFinite differences / weak formAutomatic differentiation
DataHard to incorporateNatural (add data loss term)
High dimensionsCurse of dimensionalityBetter scaling (in principle)
AccuracyCan reach \(10^{-10}\) and belowTypically \(10^{-3}\) to \(10^{-5}\)
CostFast for well-posed problemsExpensive training

Typical PINN Architecture

  • Input: spatial + temporal coordinates \((x, t)\) or \((x_1, \ldots, x_d, t)\)
  • Hidden layers: 3–6 fully connected layers, 20–50 neurons each
  • Activation: \(\tanh\) (smooth, infinitely differentiable)
  • Output: solution \(u(x,t)\) (scalar, or vector for systems)
ReLU is NOT suitable for PINNs: its second derivative is zero everywhere. The PDE residual for any second-order PDE would carry no gradient signal through ReLU activations.
Common choices: \(\tanh\), \(\sin\) (SIREN networks), or swish. All are smooth and have nonzero higher-order derivatives.

PyTorch Implementation

class PINN(nn.Module):
    def __init__(self):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(2, 20), nn.Tanh(),
            nn.Linear(20, 20), nn.Tanh(),
            nn.Linear(20, 1))
    def forward(self, x, t):
        return self.net(torch.cat([x, t], dim=1))

def pde_residual(model, x, t, alpha):
    """Heat equation: u_t - alpha * u_xx = 0"""
    x.requires_grad_(True); t.requires_grad_(True)
    u = model(x, t)
    u_t = torch.autograd.grad(u, t, torch.ones_like(u), create_graph=True)[0]
    u_x = torch.autograd.grad(u, x, torch.ones_like(u), create_graph=True)[0]
    u_xx = torch.autograd.grad(u_x, x, torch.ones_like(u_x), create_graph=True)[0]
    return u_t - alpha * u_xx

model = PINN(); optimizer = Adam(model.parameters(), lr=1e-3)
for epoch in range(N):
    x_r, t_r = random_collocation_points(N_coll)
    loss = MSE(model(x_data, t_data), u_data) \
         + lambda_r * MSE(pde_residual(model, x_r, t_r, alpha), 0)
    loss.backward(); optimizer.step()
create_graph=True preserves the computation graph so we can backpropagate through the AD derivatives. Collocation points can be resampled every epoch.

Forward Problem: Solving a Known PDE

When the PDE and all parameters are known, and we have no interior data:

\(\mathcal{L}(\theta) = \mathcal{L}_{\text{residual}} + \mathcal{L}_{\text{boundary}} + \mathcal{L}_{\text{IC}}\)

Example: solve the ODE

\(u'(x) + u(x) = 0, \quad u(0) = 1\)

Exact solution: \(u(x) = e^{-x}\)

The network learns the solution purely from the equation and boundary/initial conditions. No data from the solution itself is needed.

PINN Training: Loss Ablation

Damped oscillator: \(u'' + 0.6\,u' + 9\,u = 0\), \(u(0)=1,\; u'(0)=0\). Toggle loss components to see their effect on extrapolation.

Inverse Problem: Discovering Parameters

We observe data for \(u(x)\) and know the PDE form, but some parameters are unknown.

Example: \(u' + \alpha u = 0\), we observe \(u\) at several points, \(\alpha\) is unknown.

\(\mathcal{L}(\theta, \alpha) = \mathcal{L}_{\text{data}}(\theta) + \mathcal{L}_{\text{res}}(\theta, \alpha)\)
Key idea: treat \(\alpha\) as a trainable parameter alongside \(\theta\). Compute \(\frac{\partial \mathcal{L}}{\partial \alpha}\) via AD, just like network gradients.
Both \(\theta\) (network weights) and \(\alpha\) (physical parameter) are optimized simultaneously via gradient descent.

Inverse Problem Demo

ODE: \(u' + \alpha u = 0,\; u(0)=1\). We observe \(u\) at a few points. The network learns \(\alpha\) simultaneously.

Inverse Problems: Applications

  • Material properties: discover diffusivity \(\alpha\) from temperature data
  • Fluid dynamics: infer viscosity from velocity measurements
  • Epidemiology: estimate infection rates from case data
  • Structural health: locate damage from vibration data
Key advantage: PINNs handle noisy, sparse, and heterogeneous data naturally. The physics residual acts as a regularizer, preventing overfitting to noise.

Beyond the Basics

Training challenges, architectural variants, and practical guidance

Why PINNs Can Fail

1. Spectral bias
Neural networks learn low-frequency components first and struggle with high-frequency features. Solutions with sharp gradients or multi-scale behavior are hard.
2. Loss balancing
\(\mathcal{L}_{\text{data}}\) and \(\mathcal{L}_{\text{residual}}\) can differ by orders of magnitude. One term dominates training, the other is ignored.
3. Optimization landscape
The combined loss can have sharp, narrow minima. Standard optimizers (Adam, L-BFGS) may get stuck or converge slowly.

Ref: Wang, Yu & Perdikaris, "When and why PINNs fail to train", JCP (2022)

Spectral Bias in Action

Target: \(f(x) = \sin(x) + 0.5\sin(5x) + 0.3\sin(10x)\). Watch how the network captures low frequencies first.

Loss Balancing Strategies

StrategyHow it worksReference
Fixed weights \(\mathcal{L} = \lambda_d \mathcal{L}_{\text{data}} + \lambda_r \mathcal{L}_{\text{res}}\) (manual tuning) Raissi et al. (2019)
Learning rate annealing Adjust \(\lambda\) based on gradient magnitudes of each loss term Wang et al. (2021)
NTK-based weighting Use Neural Tangent Kernel eigenvalues to balance convergence rates Wang et al. (2022)
Self-adaptive weights Make \(\lambda\) a trainable parameter optimized alongside \(\theta\) McClenny & Brainerd (2023)
No single strategy works universally. This remains an active research area. Start with fixed weights and tune manually.

Fourier Feature Networks

Map inputs through random Fourier features before the network:

\(\gamma(x) = \begin{bmatrix} \cos(2\pi B x) \\ \sin(2\pi B x) \end{bmatrix}, \quad B \sim \mathcal{N}(0, \sigma^2)\)

Feed \(\gamma(x)\) into the network instead of raw \(x\). The random matrix \(B\) controls the frequency scale.

This one modification can dramatically improve PINN performance on problems with multi-scale solutions, by overcoming spectral bias.

Ref: Tancik et al., "Fourier Features Let Networks Learn High Frequency Functions", NeurIPS (2020)

Other PINN Variants

  • hp-VPINNs: Variational PINNs using test functions (weak form instead of strong form). Domain decomposition with local test functions for better accuracy.
  • cPINNs: Conservative PINNs that enforce conservation laws at subdomain interfaces.
  • XPINNs: Extended PINNs with domain decomposition for large or complex domains.
  • Causal PINNs: Enforce temporal causality during training: solve earlier times before later times, respecting the arrow of time.
The PINN framework is modular: each variant addresses a specific limitation while keeping the core residual-minimization idea intact.

When to Use PINNs (and When Not To)

PINNs shine when:
  • Sparse or noisy data + known physics
  • Inverse problems (parameter discovery)
  • High-dimensional PDEs
  • Complex geometries (mesh-free)
  • Multi-physics coupling
  • Transfer learning across conditions
Classical methods are better when:
  • Well-posed forward problem, known PDE
  • Very high accuracy needed (\(10^{-8}\))
  • Real-time inference (training is slow)
  • Low-dimensional, simple geometry
  • Well-established solver exists

✎ Your Turn

Summary

1. Classical roots
Basis expansion + residual minimization (collocation methods). A long tradition in applied math.
2. PINNs
Replace basis with a neural network. Use AD for derivatives. Combine data + physics in one loss.
3. Challenges
Spectral bias, loss balancing, optimization difficulties. Active research on variants and fixes.
\(\mathcal{L}(\theta) = \mathcal{L}_{\text{data}} + \mathcal{L}_{\text{residual}} + \mathcal{L}_{\text{boundary}}\)

Conclusion

PINNs learn one solution for one PDE instance. What if we want to learn the solution operator?

PINN:   \(f \mapsto u_\theta\)   (one input → one solution)

Neural Operator:   \(\mathcal{G}_\theta: f \mapsto u\)   (any input function → solution)
Next lectures: Neural operators (DeepONets, Fourier Neural Operators) learn a family of solutions at once. Train once, evaluate for any new input instantly.