AI：反向传播算法里程碑论文《Learning representations by back-propagating errors》逐段翻译与精读-平芜编程栈

我们一起来逐段精读并讲解这篇深度学习的奠基之作——《Learning representations by back-propagating errors》。

我将结合你提供的 PDF 内容，对论文的核心部分进行逐段翻译、技术解读和背景补充，力求清晰、准确且深入。

论文标题与作者信息

Learning representations by back-propagating errors
通过反向传播误差来学习表征

David E. Rumelhart*, Geoffrey E. Hinton† & Ronald J. Williams**
*加州大学圣迭戈分校认知科学研究所
†卡内基梅隆大学计算机科学系

解读：

标题直接点明了两大核心：“学习表征”（Learning representations）和**“反向传播”**（back-propagating errors）。
“表征”指的是隐藏层神经元所学到的内部特征或概念，这是多层网络超越感知机的关键。
作者阵容堪称“梦之队”，Rumelhart 是认知科学巨匠，Hinton 是深度学习教父。

摘要 (Abstract)

原文：We describe a new learning procedure, back-propagation, for networks of neurone-like units. The procedure repeatedly adjusts the weights of the connections in the network so as to minimize a measure of the difference between the actual output vector of the net and the desired output vector. As a result of the weight adjustments, internal ‘hidden’ units which are not part of the input or output come to represent important features of the task domain, and the regularities in the task are captured by the interactions of these units. The ability to create useful new features distinguishes back-propagation from earlier, simpler methods such as the perceptron-convergence procedure.

翻译：我们描述了一种用于类神经元单元网络的新学习过程——反向传播。该过程反复调整网络中连接的权重，以最小化网络实际输出向量与期望输出向量之间差异的度量。由于权重的调整，那些既非输入也非输出的内部“隐藏”单元，逐渐能够表征任务领域中的重要特征，并且任务中的规律性被这些单元的相互作用所捕获。这种创造有用新特征的能力，将反向传播与早期更简单的方法（如感知机收敛过程）区分开来。

解读：

核心目标：最小化实际输出与期望输出的差距（即损失函数）。
关键成果：隐藏单元会自动学习到对任务有用的内部表征。这是革命性的思想！在感知机时代，特征是手工设计的；而在这里，网络自己学会了什么是重要的。
历史对比：明确指出了与单层感知机的根本区别——表征学习能力。

引言 (Introduction)

原文：There have been many attempts to design self-organizing neural networks. The aim is to find a powerful synaptic modification rule that will allow an arbitrarily connected neural network to develop an internal structure that is appropriate for a particular task domain. The task is specified by giving the desired state vector of the output units for each state vector of the input units.

翻译：人们曾多次尝试设计自组织神经网络。其目标是找到一种强大的突触修正规则，使得任意连接的神经网络能够发展出适合特定任务领域的内部结构。任务的定义方式是：为输入单元的每一个状态向量，给出输出单元的期望状态向量。

解读：

设定了问题背景：如何让一个通用网络（arbitrarily connected）通过学习规则（synaptic modification rule）来自适应任务？
这就是监督学习的框架：输入-输出对( x , y ) (x, y)(x,y)定义了任务。

原文：If the input units are directly connected to the output units it is relatively easy to find learning rules that iteratively adjust the relative strengths of the connections so as to progressively reduce the difference between the actual and desired output vectors. Learning becomes more interesting but more difficult when we introduce hidden units whose actual or desired states are not specified by the task.

翻译：如果输入单元直接连接到输出单元，就相对容易找到学习规则，通过迭代调整连接的相对强度，逐步减小实际输出向量与期望输出向量之间的差异。然而，当我们引入隐藏单元时，学习变得更有意思但也更困难，因为任务并未指定这些隐藏单元的实际状态或期望状态。

解读：

点出了核心难题！这就是著名的“信用分配问题”（Credit Assignment Problem）。
对于输出层，我们知道它应该输出什么（有监督信号），所以可以轻松计算误差并更新权重。
但对于隐藏层，我们不知道它们“应该”激活成什么样。反向传播的天才之处就在于解决了这个难题。

原文：(In perceptrons, there are “feature analysers” between the input and output that are not true hidden units because their input connections are fixed by hand, so their states are completely determined by the input vector: they do not learn representations.)

翻译：（在感知机中，输入和输出之间存在“特征分析器”，但它们并非真正的隐藏单元，因为它们的输入连接是由人工固定的，因此其状态完全由输入向量决定：它们并不学习表征。）

解读：

再次强调了与感知机的区别。感知机的“隐藏层”其实是手工编码的特征提取器，不具备学习能力。
而本文提出的网络，其隐藏层的连接权重是可学习的，因此能真正地发现和构建新的表征。

原文：The learning procedure must decide under what circumstances the hidden units should be active in order to help achieve the desired input-output behaviour. This amounts to deciding what these units should represent. We demonstrate that a general purpose and relatively simple procedure is powerful enough to construct appropriate internal representations.

翻译：学习过程必须决定在何种情况下隐藏单元应该被激活，以帮助实现期望的输入-输出行为。这相当于决定了这些单元应该表征什么。我们证明，一个通用且相对简单的程序，就足以构建出合适的内部表征。

解读：

这是对全文工作的总结性预告。他们要展示的，正是这个“通用且相对简单”的程序——反向传播算法。

网络架构与前向传播 (Network Architecture & Forward Pass)

原文：The simplest form of the learning procedure is for layered networks which have a layer of input units at the bottom; any number of intermediate layers; and a layer of output units at the top. Connections within a layer or from higher to lower layers are forbidden, but connections can skip intermediate layers.

翻译：该学习过程最简单的形式适用于分层网络：底部有一层输入单元，顶部有一层输出单元，中间可有任意数量的中间层。禁止层内连接或从高层到低层的连接，但允许跨过中间层的连接。

解读：

定义了前馈神经网络（Feedforward Neural Network）的标准结构。
明确排除了循环连接（Recurrent Connections），将问题限定在静态模式识别上。

原文：An input vector is presented to the network by setting the states of the input units. Then the states of the units in each layer are determined by applying equations (1) and (2) to the connections coming from lower layers.

翻译：通过设定输入单元的状态，将一个输入向量呈现给网络。然后，通过将公式(1)和(2)应用于来自较低层的连接，来确定每一层单元的状态。

公式 (1)：x j = ∑ i y i w j i x_j = \sum_i y_i w_{ji}xj=∑iyiwji
公式 (2)：y j = 1 1 + e − x j y_j = \frac{1}{1 + e^{-x_j}}yj=1+e−xj1

翻译：
公式(1)：单元j jj的总输入x j x_jxj是其所有输入连接的加权和。
公式(2)：单元j jj的输出y j y_jyj是其总输入的 Sigmoid（逻辑）函数。

解读：

公式(1)是线性变换（矩阵乘法的核心）。
公式(2)是非线性激活函数。Sigmoid 函数将实数映射到 (0,1) 区间，模拟了生物神经元的“发放率”。
非线性是关键！没有它，多层网络等价于单层网络。

损失函数与学习目标

原文：The aim is to find a set of weights that ensure that for each input vector the output vector produced by the network is the same as (or sufficiently close to) the desired output vector. … The total error, E, is defined as
公式 (3)：E = 1 2 ∑ c ∑ j ( y c j − d c j ) 2 E = \frac{1}{2} \sum_c \sum_j (y_{cj} - d_{cj})^2E=21∑c∑j(ycj−dcj)2
where c is an index over cases (input-output pairs), j is an index over output units…

翻译：目标是找到一组权重，使得对于每个输入向量，网络产生的输出向量与期望输出向量相同（或足够接近）。… 总误差E EE定义为：
E = 1 2 ∑ c ∑ j ( y c j − d c j ) 2 E = \frac{1}{2} \sum_c \sum_j (y_{cj} - d_{cj})^2E=21∑c∑j(ycj−dcj)2
其中c cc是样本（输入-输出对）的索引，j jj是输出单元的索引…

解读：

使用了经典的均方误差（MSE）作为损失函数。
系数1 2 \frac{1}{2}21是为了在求导后消掉平方项的系数 2，使公式更简洁。

反向传播的核心：梯度计算 (The Heart of Backprop)

原文：To minimize E by gradient descent it is necessary to compute the partial derivative of E with respect to each weight in the network. … For a given case, the partial derivatives of the error with respect to each weight are computed in two passes. We have already described the forward pass … The backward pass which propagates derivatives from the top layer back to the bottom one is more complicated.

翻译：为了通过梯度下降法最小化E EE，需要计算E EE对网络中每个权重的偏导数。… 对于一个给定的样本，误差对每个权重的偏导数通过两次传递来计算。我们已经描述了前向传递… 将导数从顶层反向传播到底层的后向传递则更为复杂。

解读：

正式引入了两阶段计算：前向传递（计算输出）和后向传递（计算梯度）。

第一步：输出层的误差项 (δ)

公式 (4)：∂ E / ∂ y j = y j − d j \partial E / \partial y_j = y_j - d_j∂E/∂yj=yj−dj
公式 (5)：∂ E / ∂ x j = ( ∂ E / ∂ y j ) ⋅ y j ( 1 − y j ) \partial E / \partial x_j = (\partial E / \partial y_j) \cdot y_j (1 - y_j)∂E/∂xj=(∂E/∂yj)⋅yj(1−yj)

翻译：
公式(4)：误差对输出单元j jj的输出y j y_jyj的偏导，就是简单的预测误差。
公式(5)：利用链式法则，得到误差对输出单元j jj的总输入x j x_jxj的偏导。其中y j ( 1 − y j ) y_j(1-y_j)yj(1−yj)正是 Sigmoid 函数的导数。

解读：

δ j ( o u t p u t ) = ∂ E / ∂ x j \delta_j^{(output)} = \partial E / \partial x_jδj(output)=∂E/∂xj被称为误差项（error term）或局部梯度。它衡量了单元j jj的加权输入对总误差的“责任”。

第二步：权重梯度

公式 (6)：∂ E / ∂ w j i = ( ∂ E / ∂ x j ) ⋅ y i \partial E / \partial w_{ji} = (\partial E / \partial x_j) \cdot y_i∂E/∂wji=(∂E/∂xj)⋅yi

翻译：误差对连接权重w j i w_{ji}wji的偏导，等于单元j jj的误差项乘以前一层单元i ii的输出。

解读：

这是最直观的梯度公式。权重的更新量 = 当前层的“责任感” × 上一层的“活动水平”。

第三步：向隐藏层反向传播误差

公式 (7)：∂ E / ∂ y i = ∑ j ( ∂ E / ∂ x j ) w j i \partial E / \partial y_i = \sum_j (\partial E / \partial x_j) w_{ji}∂E/∂yi=∑j(∂E/∂xj)wji

翻译：误差对隐藏单元i ii的输出y i y_iyi的偏导，等于所有接收其输出的上层单元j jj的误差项∂ E / ∂ x j \partial E / \partial x_j∂E/∂xj乘以相应权重w j i w_{ji}wji后的总和。

解读：

这是反向传播的灵魂所在！
隐藏单元i ii不知道自己的“正确答案”，但它可以通过“下游”单元j jj的反馈来评估自己的表现。
如果单元i ii的输出y i y_iyi很高，并且它连接到一个有很大正误差项δ j \delta_jδj的单元j jj，那么单元i ii就有很大的“责任”，它的∂ E / ∂ y i \partial E / \partial y_i∂E/∂yi也会很大。
接着，再用和输出层同样的方法（乘以激活函数导数）得到∂ E / ∂ x i \partial E / \partial x_i∂E/∂xi，如此循环，直到输入层。

实验与图示 (Experiments & Figures)

论文通过两个精妙的实验展示了反向传播的能力：

图1：检测镜像对称性
- 任务：判断一个8位二进制向量是否关于中心对称。
- 结果：网络仅用2个隐藏单元就学会了！权重呈现出完美的对称/反对称模式。
- 意义：证明了网络能学会非常抽象的全局性质，而非简单的局部特征。
图2-4：家族关系推理
- 任务：给定三元组（如Colin has-aunt ?），预测缺失的人名。
- 结果：隐藏层自发形成了对“国籍”、“代际”、“家族分支”等概念的分布式编码。
- 意义：这是分布式表征（Distributed Representation）的经典范例，展示了网络如何捕捉数据的潜在结构。

结论与展望 (Conclusion)

原文：The most obvious drawback of the learning procedure is that the error-surface may contain local minima so that gradient descent is not guaranteed to find a global minimum. However, experience with many tasks shows that the network very rarely gets stuck in poor local minima…

翻译：该学习过程最明显的缺点是，误差曲面可能包含局部极小值，因此梯度下降不能保证找到全局最小值。然而，在许多任务上的经验表明，网络很少会陷入很差的局部极小值…

原文：The learning procedure, in its current form, is not a plausible model of learning in brains. However, applying the procedure to various tasks shows that interesting internal representations can be constructed by gradient descent in weight-space, and this suggests that it is worth looking for more biologically plausible ways of doing gradient descent in neural networks.