Introduction to Deep Learning

From Linear Regression to Neural Networks

ML for Science and Engineering — Lecture 11
Joseph Bakarji

Why Does Deep Learning Work?

The success of deep learning reveals two underappreciated principles about intelligence:

Self-organization: intelligence emerges from simple elements that learn to coordinate
Data-driven adaptation: enabling a system to learn from its environment is more powerful than encoding rules

Core Insight
Deep networks are collections of simple elements that collaborate to process inputs and predict outputs. Their power comes not from any single element, but from their collective adaptation to data.

Rules vs. the Ability to Learn

Rule-Based Approach

Encode expert knowledge as explicit rules
Define what to do in each situation
Brittle when the environment changes
Cannot adapt to unseen contexts

Learning-Based Approach

Give the system the ability to learn
Let it adapt to its environment autonomously
Flexible in complex, unpredictable settings
More powerful when you cannot predict every case

Do you teach your children rigid rules for every situation, or do you give them the ability to adapt to any context they might encounter?

The Intellectual Milieu of the 1940s

An extraordinary convergence of ideas set the stage for artificial intelligence:

Russell & Whitehead showed in Principia Mathematica (1910-1913) that mathematics derives from formal logic. If logic is universal, and machines can perform logic...
Von Neumann designed the stored-program computer (EDVAC, 1945), using McCulloch-Pitts neurons as the conceptual model.
Norbert Wiener founded cybernetics (1948): feedback, control, and communication in animals and machines.

The reasoning:

                Logic can express any computation.

                Neurons perform logical operations.

                Therefore, networks of neurons can compute anything.

This launched AI as a field.

Walter Pitts and the Dream of Logical Intelligence

Walter Pitts was a self-taught, homeless teenager from Detroit who, at age 12, walked into a lecture by Bertrand Russell and found errors in Principia Mathematica. Russell invited Pitts to study with him.

Pitts met Warren McCulloch, a neurophysiologist asking: how does the brain compute? They combined propositional logic with neural anatomy to create the first formal neuron model.

Their 1943 paper showed that networks of binary threshold units could implement any logical function, founding both neural networks and digital computing theory.

The Macy Conferences (1946-1953)
McCulloch chaired interdisciplinary meetings with von Neumann, Wiener, Pitts, Bateson, and Margaret Mead. Here cybernetics crystallized, and the ideas that would become AI, information theory, and cognitive science were debated.

Macy Conferences · The Story of Walter Pitts (Nautilus)

The McCulloch-Pitts Neuron (1943)

A binary threshold unit:

\( y = \begin{cases} 1 & \text{if } \sum_i w_i x_i \geq \theta \\ 0 & \text{otherwise} \end{cases} \)

Click inputs to toggle. Use sliders or gate presets below.

From Theory to Biology: The Squid Giant Axon

While McCulloch-Pitts modeled neurons as logical abstractions, experimentalists studied the real thing. The squid giant axon (1000x thicker than human axons) was large enough to insert electrodes inside.

1939 Cole & Curtis develop voltage clamp: hold membrane potential fixed, measure current.

1952 Hodgkin & Huxley fit ODEs to voltage clamp data with gating variables \(m, h, n\) governed by first-order kinetics.

1963 Nobel Prize for Hodgkin, Huxley, and Eccles.

A triumph of scientific modeling: Hodgkin and Huxley had no computers. They solved the ODEs by hand with a desk calculator, predicting the action potential shape before it could be measured at full resolution.

Historical Perspective (PMC)

The Hodgkin-Huxley Model

\( C_m \frac{dV}{dt} = I_{\text{ext}} - g_{Na} m^3 h (V - E_{Na}) - g_K n^4 (V - E_K) - g_L (V - E_L) \)

The gating variables \(m, h, n \in [0,1]\) model ion channel opening probabilities. Na\(^+\) channels need 3 activation gates (\(m^3\)) and 1 inactivation gate (\(h\)); K\(^+\) channels need 4 gates (\(n^4\)). Each satisfies:

\( \frac{dx}{dt} = \alpha_x(V)(1-x) - \beta_x(V)x \)

where \(\alpha_x, \beta_x\) are voltage-dependent rate functions that H&H fit to experimental data as combinations of exponentials and rationals, e.g.:

\( \alpha_m(V) = \frac{0.1(V+40)}{1 - e^{-(V+40)/10}} \)

Inject current:

Rate functions \(\alpha_m, \beta_m\):

Two Views of the Neuron

	McCulloch-Pitts (1943)	Hodgkin-Huxley (1952)
Nature	Computational / Logical	Dynamical / Physical
Output	Binary: 0 or 1	Continuous voltage \(V(t)\)
Time	Instantaneous	Evolves via coupled ODEs
Parameters	Weights \(w_i\), threshold \(\theta\)	Conductances, Nernst potentials, gating kinetics
Discovery	Theoretical construction from logic	Fit from voltage clamp experiments
Legacy	Neural networks, deep learning, AI	Computational neuroscience, biophysics

Modern deep learning descends from the McCulloch-Pitts tradition: simplified, computational units whose power comes from collective self-organization, not biological fidelity. The architecture matters less than the ability to adapt.

From Linear Models
to Deep Networks

Building the mathematical foundations

The Linear Hypothesis

The simplest supervised learning model assumes a linear relationship between inputs and outputs:

\( \hat{y} = f_\mathbf{w}(\mathbf{x}) = \mathbf{w} \cdot \boldsymbol{\phi}(\mathbf{x}) = \sum_{j=1}^p w_j \phi_j(\mathbf{x}) \)

The feature vector \(\boldsymbol{\phi}(\mathbf{x})\) encodes our assumptions about the data:

\(\boldsymbol{\phi}(\mathbf{x}) = [1, x]\) (linear)

\(\boldsymbol{\phi}(\mathbf{x}) = [1, x, x^2, x^3]\) (polynomial)

\(\boldsymbol{\phi}(\mathbf{x}) = [1, x, \sin(3x)]\) (custom)

\(\boldsymbol{\phi}(\mathbf{x}) = \;?\;?\;?\) (what if we could learn this?)

Recall: We split data into training, validation, and test sets to select features and evaluate generalization. The question remains: how do we choose \(\boldsymbol{\phi}\)?
            

Neural networks answer this question by learning the features directly from data.

Linear Predictor: Network View

For multiple inputs and outputs:

\( \hat{\mathbf{y}} = W\mathbf{x} + \mathbf{b} \)

In index notation:

\( \hat{y}_i = \sum_{j=1}^n W_{ij} x_j + b_i \)

where \(W \in \mathbb{R}^{m \times n}\), \(\mathbf{b} \in \mathbb{R}^m\).

Each output is a weighted sum of all inputs. The network diagram makes this structure visible:

Activation Functions

To introduce nonlinearity, we pass the linear output through an activation function \(f\):

\( \hat{\mathbf{y}} = f(W\mathbf{x} + \mathbf{b}) \)

Function:

The Neural Network

A single hidden layer:

\( \hat{\mathbf{y}} = f\bigl(W_2 \;f(W_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2\bigr) \)

\(\mathbf{v} = f(W_1 \mathbf{x} + \mathbf{b}_1)\) — hidden activations
\(\hat{\mathbf{y}} = f(W_2 \mathbf{v} + \mathbf{b}_2)\) — output
The hidden layer learns a feature representation \(\mathbf{v} = \phi(\mathbf{x})\)

Instead of hand-crafting \(\phi(\mathbf{x})\), the network learns it from data.

Going Deep

Stack multiple hidden layers to build hierarchical representations:

\( \hat{\mathbf{y}} = f\bigl(W_L \;f(\cdots f(W_2 \;f(W_1 \mathbf{x}))\cdots)\bigr) \)

\(\mathbf{v}^{(1)} = f(W_1\mathbf{x} + \mathbf{b}_1)\)

\(\mathbf{v}^{(2)} = f(W_2\mathbf{v}^{(1)} + \mathbf{b}_2)\)

\(\mathbf{v}^{(3)} = f(W_3\mathbf{v}^{(2)} + \mathbf{b}_3)\)

\(\hat{\mathbf{y}} = f(W_4\mathbf{v}^{(3)} + \mathbf{b}_4)\)

Why Depth? Hierarchical Feature Learning

        Deep networks learn features at increasing levels of abstraction, from simple edges to complex concepts.
    

Layer 1

Edges, gradients,
simple patterns

Layer 2-3

Textures, parts,
local combinations

Deeper Layers

Objects, faces,
semantic concepts

Reference: Lee et al., "Unsupervised learning of hierarchical representations with convolutional deep belief networks," Communications of the ACM (2011).

Training Neural Networks

Loss functions, gradients, and backpropagation

Loss Functions

Regression (MSE)

\( \mathcal{L}_{\text{MSE}} = \frac{1}{n}\sum_{i=1}^n \|\hat{\mathbf{y}}^{(i)} - \mathbf{y}^{(i)}\|^2 \)

Squared distance between predictions and targets. Differentiable everywhere, standard for continuous outputs.

Classification (Binary Cross-Entropy)

\( \mathcal{L}_{\text{BCE}} = -\frac{1}{n}\sum_{i=1}^n \bigl[y^{(i)}\log\hat{y}^{(i)} + (1-y^{(i)})\log(1-\hat{y}^{(i)})\bigr] \)

Measures how well predicted probabilities match binary labels. Derived from maximum likelihood of Bernoulli distribution.

Objective: Find weights \(\hat{W} = \arg\min_{W} \mathcal{L}(W)\). No closed-form solution for neural networks — we use gradient descent.
    

Gradient Descent

Update rule for each parameter matrix:

\( W \leftarrow W - \alpha \frac{\partial \mathcal{L}}{\partial W} \)

\(\alpha\) is the learning rate
The gradient \(\nabla_W \mathcal{L}\) points toward steepest increase
We step in the opposite direction

Non-convexity: Neural network loss surfaces have many local minima and saddle points.

Loss landscape \(\mathcal{L}(w_1, w_2)\) — dark = low loss (minimum)

LR (α): 0.10

Backpropagation

The chain rule applied layer by layer, reusing intermediate gradients:

\( \frac{\partial \mathcal{L}}{\partial W^{[k]}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial \mathbf{z}^{[L]}} \cdot \frac{\partial \mathbf{z}^{[L]}}{\partial \mathbf{v}^{[L-1]}} \cdots \frac{\partial \mathbf{z}^{[k]}}{\partial W^{[k]}} \)

Forward Pass

\(\mathbf{z}^{[l]} = W^{[l]}\mathbf{v}^{[l-1]} + \mathbf{b}^{[l]}\)

\(\mathbf{v}^{[l]} = f(\mathbf{z}^{[l]})\)

Compute and store each layer's activations.

Backward Pass

Start: \(\delta^{[L]} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{[L]}}\)

Propagate: \(\delta^{[l]} = (W^{[l+1]})^\top \delta^{[l+1]} \odot f'(\mathbf{z}^{[l]})\)

Gradients: \(\frac{\partial \mathcal{L}}{\partial W^{[l]}} = \delta^{[l]}(\mathbf{v}^{[l-1]})^\top\)

Backpropagation computes all gradients in one backward sweep, with cost proportional to the forward pass.

Backpropagation: Concrete Example

A 1-hidden-layer network: \(x \xrightarrow{w_1} v = f(w_1 x + b_1) \xrightarrow{w_2} \hat{y} = w_2 v + b_2\). Let \(f = \text{ReLU}\), \(x = 2\), \(y = 1\).

Forward Pass

\(z_1 = w_1 x + b_1 = 0.5 \cdot 2 + 0.1 = 1.1\)

\(v = \text{ReLU}(1.1) = 1.1\)

\(\hat{y} = w_2 v + b_2 = 0.8 \cdot 1.1 + 0 = 0.88\)

\(\mathcal{L} = (\hat{y} - y)^2 = (0.88 - 1)^2 = 0.0144\)

Backward Pass

\(\frac{\partial \mathcal{L}}{\partial \hat{y}} = 2(\hat{y} - y) = -0.24\)

\(\frac{\partial \mathcal{L}}{\partial w_2} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot v = -0.24 \cdot 1.1 = -0.264\)

\(\frac{\partial \mathcal{L}}{\partial v} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot w_2 = -0.24 \cdot 0.8 = -0.192\)

\(\frac{\partial \mathcal{L}}{\partial w_1} = \frac{\partial \mathcal{L}}{\partial v} \cdot f'(z_1) \cdot x = -0.192 \cdot 1 \cdot 2 = -0.384\)

SGD update (α = 0.1): \(w_1 \leftarrow 0.5 - 0.1(-0.384) = 0.538\), \(w_2 \leftarrow 0.8 - 0.1(-0.264) = 0.826\). Each step moves weights to reduce loss.
    

Stochastic Gradient Descent and Variants

Method	Update Rule
SGD	\(W \leftarrow W - \alpha \nabla_W \mathcal{L}_{\text{batch}}\)
Momentum	\(v \leftarrow \beta v + \nabla \mathcal{L}\), \(W \leftarrow W - \alpha v\)
Adam	Adaptive learning rates per parameter

Mini-batch SGD: Instead of computing the gradient over the entire dataset, use a random subset (batch) at each step. Faster, and the noise helps escape local minima.
            

Training in Practice: PyTorch


import torch
import torch.nn as nn
import torch.optim as optim

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(1, 20)
        self.fc2 = nn.Linear(20, 15)
        self.fc3 = nn.Linear(15, 1)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

net = Net()
criterion = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.01)

for epoch in range(5000):
    output = net(x_train)
    loss = criterion(output, y_train)
    loss.backward()          # Backpropagation
    optimizer.step()         # Update weights
    optimizer.zero_grad()    # Reset gradients

Universal Approximation in Action

A single hidden layer with enough neurons can approximate any continuous function (Cybenko, 1989; Hornik et al., 1989). The network trains via the SGD loop above. Increase neurons to improve the fit:

Neurons: 3

Target:

Architecture:

1 → N → 1 (sigmoid activations)

Training: 800 epochs of SGD on 30 sample points, LR = 0.005

Loss (MSE): —

Overfitting and Regularization

When a model memorizes training data but fails on new data:

Regularization Strategies

Weight decay (L2): \(\mathcal{L}_{\text{reg}} = \mathcal{L} + \lambda\|W\|^2\)
Dropout: randomly zero out neurons during training
Early stopping: stop when validation loss starts increasing
Data augmentation: increase effective dataset size


# Weight decay in PyTorch
optimizer = optim.Adam(net.parameters(),
                       lr=0.01, weight_decay=1e-3)

Beyond the Basics

Architectures that encode structure

The Neural Network Zoo

Different architectures encode different inductive biases about the structure of data:

Architecture	Bias / Assumption	Application
Fully Connected (MLP)	No structure assumed	Tabular data, function approximation
Convolutional (CNN)	Spatial locality, translation invariance	Images, spatial data
Recurrent (RNN)	Sequential dependence	Time series, language
Autoencoder	Low-dimensional latent structure	Compression, denoising
Transformer	Attention over all positions	Language, vision, multimodal

Auto-Encoders

Learn a compressed representation by training the network to reconstruct its own input. The bottleneck forces the network to discover the most important features.

\( \mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2, \quad \hat{\mathbf{x}} = g_\theta(\underbrace{f_\phi(\mathbf{x})}_{\mathbf{z}}) \)

Encoder: \(\mathbf{z} = f_\phi(\mathbf{x}) \in \mathbb{R}^d\) — compressed representation
Decoder: \(\hat{\mathbf{x}} = g_\theta(\mathbf{z}) \in \mathbb{R}^n\) — reconstruction
If \(f, g\) are linear, the AE learns the SVD: same subspace as PCA

Example: MNIST denoising. Train on corrupted digits, reconstruct clean ones. The 2D latent space \(\mathbf{z} \in \mathbb{R}^2\) clusters digits by identity — the network discovers digit features without labels.

Convolutional Neural Networks

Instead of connecting every input to every hidden unit, CNNs use local filters (kernels) that slide across the input. A 2D convolution:

\( (\mathbf{K} * \mathbf{X})_{ij} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} K_{mn} \cdot X_{i+m,\, j+n} \)

Convolution: slide a \(k \times k\) kernel across the image. Same kernel everywhere → weight sharing
Pooling (max or average): reduce spatial dimensions, e.g. \(2 \times 2\) max-pool halves width and height
Stack: Conv → ReLU → Pool → Conv → ReLU → Pool → Flatten → FC

Parameter savings: A 5×5 kernel on a 28×28 image has 25 weights, vs. 784×hidden for a fully connected layer. Translation invariance is built in.

Recurrent Neural Networks

Process sequences by maintaining a hidden state that carries information across time steps:

\( \mathbf{h}_{t+1} = f(W_h \mathbf{h}_t + W_x \mathbf{x}_t + \mathbf{b}_h) \)

\( \hat{\mathbf{y}}_t = W_y \mathbf{h}_t + \mathbf{b}_y \)

\(\mathbf{x}_t\) — input at time \(t\); \(\mathbf{h}_t\) — hidden state (memory)
\(W_h, W_x, W_y\) — shared across all time steps
\(\hat{\mathbf{y}}_t\) — prediction at time \(t\) (e.g. next value \(\mathbf{x}_{t+1}\))

ODE connection: Euler's method \(\mathbf{y}_{n+1} = \mathbf{y}_n + h\,f(\mathbf{y}_n)\) is a recurrence with fixed "weights." RNNs learn the recurrence from data.

Summary

Deep learning succeeds through self-organization: simple elements adapting collectively to data
Neural networks are compositions of linear maps and nonlinearities: \(\hat{y} = f(W_L\cdots f(W_1\mathbf{x}))\)
Backpropagation computes gradients efficiently via the chain rule

Architecture encodes inductive bias: CNNs for spatial data, RNNs for sequences
Regularization prevents overfitting: weight decay, dropout, early stopping
The hidden layer learns features; depth enables hierarchical abstraction

The power of deep learning is not in any specific architecture, but in enabling systems to learn from data autonomously.

References & Resources

McCulloch & Pitts, "A Logical Calculus of the Ideas Immanent in Nervous Activity" (1943)
Hodgkin & Huxley, "A quantitative description of membrane current..." J. Physiol. (1952)
Goodfellow, Bengio, Courville. Deep Learning. MIT Press (2016).
The Man Who Tried to Redeem the World with Logic (Nautilus, on Walter Pitts)
Stanford CS231n · PyTorch Tutorials · TensorFlow Playground

Introduction to Deep Learning

Why Does Deep Learning Work?

Rules vs. the Ability to Learn

Rule-Based Approach

Learning-Based Approach

The Intellectual Milieu of the 1940s

Walter Pitts and the Dream of Logical Intelligence

The McCulloch-Pitts Neuron (1943)

From Theory to Biology: The Squid Giant Axon

The Hodgkin-Huxley Model

Two Views of the Neuron

From Linear Modelsto Deep Networks

The Linear Hypothesis

Linear Predictor: Network View

Activation Functions

The Neural Network

Going Deep

Why Depth? Hierarchical Feature Learning

Layer 1

Layer 2-3

Deeper Layers

Training Neural Networks

Loss Functions

Regression (MSE)

Classification (Binary Cross-Entropy)

Gradient Descent

Backpropagation

Forward Pass

Backward Pass

Backpropagation: Concrete Example

Forward Pass

Backward Pass

Stochastic Gradient Descent and Variants

Training in Practice: PyTorch

Universal Approximation in Action

Overfitting and Regularization

Regularization Strategies

Beyond the Basics

The Neural Network Zoo

Auto-Encoders

Convolutional Neural Networks

Recurrent Neural Networks

Summary

References & Resources

From Linear Models
to Deep Networks