反向传播算法详解

反向传播(Backpropagation)是深度学习的核心算法,它使得训练深度神经网络成为可能。本文将深入讲解反向传播的数学原理。

神经网络基础

一个简单的神经网络可以表示为:

  • 输入层:$x = (x_1, x_2, ..., x_n)$
  • 隐藏层:$h = \sigma(W^{(1)} x + b^{(1)})$
  • 输出层:$y = W^{(2)} h + b^{(2)}$

其中 $\sigma$ 是激活函数。

前向传播

前向传播计算过程:

$$z^{(1)} = W^{(1)} x + b^{(1)}$$

$$a^{(1)} = \sigma(z^{(1)})$$

$$z^{(2)} = W^{(2)} a^{(1)} + b^{(2)}$$

$$y = \sigma(z^{(2)})$$

损失函数

对于分类问题,常用交叉熵损失:

$$L = -\sum_c y_c \log(\hat{y}_c)$$

对于回归问题,使用均方误差:

$$L = \frac{1}{2}(y - \hat{y})^2$$

反向传播

反向传播使用链式法则计算梯度。假设我们要计算 $\frac{\partial L}{\partial W^{(1)}}$

输出层梯度

$$\delta^{(2)} = \frac{\partial L}{\partial z^{(2)}} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}}$$

对于 MSE 损失:

$$\delta^{(2)} = (\hat{y} - y) \cdot \sigma'(z^{(2)})$$

隐藏层梯度

$$\delta^{(1)} = (W^{(2)})^T \delta^{(2)} \cdot \sigma'(z^{(1)})$$

权重梯度

$$\frac{\partial L}{\partial W^{(2)}} = \delta^{(2)} (a^{(1)})^T$$

$$\frac{\partial L}{\partial W^{(1)}} = \delta^{(1)} x^T$$

代码实现

python
import numpy as np

class NeuralNetwork:

def __init__(self, input_size, hidden_size, output_size):

self.W1 = np.random.randn(input_size, hidden_size) 0.01

self.b1 = np.zeros((1, hidden_size))

self.W2 = np.random.randn(hidden_size, output_size) 0.01

self.b2 = np.zeros((1, output_size))

def sigmoid(self, x):

return 1 / (1 + np.exp(-x))

def sigmoid_derivative(self, x):

return x (1 - x)

def forward(self, X):

self.z1 = X @ self.W1 + self.b1

self.a1 = self.sigmoid(self.z1)

self.z2 = self.a1 @ self.W2 + self.b2

self.a2 = self.sigmoid(self.z2)

return self.a2

def backward(self, X, y, learning_rate=0.1):

m = X.shape[0]

# 输出层误差

delta2 = (self.a2 - y) self.sigmoid_derivative(self.a2)

# 隐藏层误差

delta1 = (delta2 @ self.W2.T) self.sigmoid_derivative(self.a1)

# 更新权重

self.W2 -= learning_rate (self.a1.T @ delta2) / m

self.b2 -= learning_rate np.sum(delta2, axis=0, keepdims=True) / m

self.W1 -= learning_rate (X.T @ delta1) / m

self.b1 -= learning_rate * np.sum(delta1, axis=0, keepdims=True) / m

def train(self, X, y, epochs=10000, learning_rate=0.1):

for i in range(epochs):

output = self.forward(X)

self.backward(X, y, learning_rate)

if i % 1000 == 0:

loss = np.mean((output - y) 2)

print(f"Epoch {i}, Loss: {loss:.4f}")

示例:XOR 问题

X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y = np.array([[0], [1], [1], [0]])

nn = NeuralNetwork(2, 4, 1)

nn.train(X, y)

print("预测结果:", nn.forward(X).round(2))

梯度下降变体

在实际应用中,常用的优化器包括:

  • SGD:标准梯度下降
  • Momentum:引入动量加速收敛
  • Adam**:自适应学习率
  • $$v_t = \beta_1 v_{t-1} + (1 - \beta_1) g_t$$

    $$m_t = \beta_2 m_{t-1} + (1 - \beta_2) g_t^2$$

    总结

    反向传播的核心思想:

    • 利用链式法则高效计算梯度
    • 从后向前传播误差
    • 梯度下降更新参数