Automatic differentiation

From Wikipedia, the free encyclopedia

In mathematics and computer algebra, automatic differentiation, or AD, sometimes alternatively called algorithmic differentiation, is a method to numerically evaluate the derivative of a function specified by a computer program. Two classical ways of doing differentiation are:

Figure 1: How automatic differentiation relates to symbolic differentiation
Figure 1: How automatic differentiation relates to symbolic differentiation

The drawback with symbolic differentiation is low speed, and the difficulty of converting a computer program into a single expression. Moreover, many functions grow quite complex as higher derivatives are calculated. Two important drawbacks with finite differences are round-off errors in the discretization process, and cancellation. Both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Automatic differentiation solves all of the mentioned problems.

AD exploits the fact that any computer program that implements a vector function y = F(x) (generally) can be decomposed into a sequence of elementary assignments, any one of which may be trivially differentiated by a simple table lookup. These elemental partial derivatives, evaluated at a particular argument, are combined in accordance with the chain rule from derivative calculus to form some derivative information for F (such as gradients, tangents, the Jacobian matrix, etc.). This process yields exact (to numerical accuracy) derivatives. Because the symbolic transformation occurs only at the most basic level, AD avoids the computational problems inherent in complex symbolic computation.

Contents

[edit] Derivation of augmented arithmetic for Forward-Mode AD using dual numbers

Forward-Mode Automatic Differentiation is accomplished by augmenting the algebra of real numbers and obtaining a new arithmetic. An additional component is added to every number which will represent the derivative of a function at the number, and all arithmetic operators are extended for the augmented algebra. The augmented algebra is the algebra of dual numbers.

Replace every number \,x with the number x + x'\varepsilon, where x' is a real number, but \varepsilon is nothing but a symbol with the property \varepsilon^2=0. Using only this, we get for the regular arithmetic

(x + x'\varepsilon) + (y + y'\varepsilon) = x + y + (x' + y')\varepsilon
(x + x'\varepsilon) \cdot (y + y'\varepsilon) = xy + xy'\varepsilon + yx'\varepsilon + \dot{x}y'\varepsilon^2 = xy + (x y' + yx')\varepsilon

and likewise for subtraction and division.

Now, we may calculate polynomials in this augmented arithmetic. If P(x) = p_0 + p_1 x + p_2x^2 + \cdots + p_n x^n, then

P(x + x'\varepsilon) =\, p_0 + p_1(x + x'\varepsilon) + \cdots + p_n (x + x'\varepsilon)^n
=\, p_0 + p_1 x + \cdots + p_n x^n
\, + p_1x'\varepsilon + 2p_2xx'\varepsilon + \cdots + np_n x^{n-1} x'\varepsilon
=\, P(x) + P'(x)x'\varepsilon

x' can be chosen arbitrarily for now, it will be denoted a seed later.

The new arithmetic consists of ordered pairs, elements written \langle x, x' \rangle, with ordinary arithmetics on the first component, and first order differentiation arithmetic on the second component, as can be derived as above. Extending the above results on polynomials to analytic functions we obtain a list of the basic arithmetic and some standard functions for the new arithmetic:

\langle u,u'\rangle +\langle v,v'\rangle = \langle u+v, u'+v' \rangle
\langle u,u'\rangle -\langle v,v'\rangle = \langle u-v, u'-v' \rangle
\langle u,u'\rangle *\langle v,v'\rangle = \langle u v, u'v+uv' \rangle
\langle u,u'\rangle /\langle v,v'\rangle = \left\langle \frac{u}{v}, \frac{u'v-uv'}{v^2} \right\rangle \quad ( v\ne 0)
\sin\langle u,u'\rangle = \langle \sin(u) , u' \cos(u) \rangle
\cos\langle u,u'\rangle = \langle \cos(u) , -u' \sin(u) \rangle
\exp\langle u,u'\rangle = \langle \exp u , u' \exp u \rangle
\log\langle u,u'\rangle = \langle \log(u) , u'/u \rangle \quad (u>0)
\langle u,u'\rangle^k = \langle u^k , k u^{k-1} u' \rangle \quad (u \ne 0)
\left| \langle u,u'\rangle \right| = \langle \left| u \right| , u' \mbox{sign} u \rangle \quad (u \ne 0)

and in general for the primitive function g,

g(\langle u,u' \rangle , \langle v,v' \rangle ) = \langle g(u,v) , g^{(1)}(u,v) u' + g^{(2)}(u,v) v' \rangle

where g(1) and g(2) are the derivatives of g with respect to its first and second arguments, respectively.

Also, when a binary basic arithmetic operation is applied to mixed arguments, say the ordered pair \langle u,u' \rangle and the real number c, the real number is first lifted to \langle c,0 \rangle. The derivative of a function f : \mathbb{R}\rightarrow\mathbb{R} at the point x0 is now found by calculating f(\langle x_0, 1 \rangle) using the above arithmetic, which gives \langle f ( x_0 ) , f' ( x_0 ) \rangle as the result.

[edit] Vector arguments and functions

Multivariate functions can be handled with the same efficiency and mechanisms as univariate functions by adopting a directional derivative operator, which finds the directional derivative y' in \mathbb{R}^m of f:\mathbb{R}^n\rightarrow\mathbb{R}^m at x in \mathbb{R}^n in the direction x' in \mathbb{R}^n by calculating (\langle y_1,y'_1\rangle, \ldots, \langle y_m,y'_m\rangle) = f(\langle x_1,x'_1\rangle, \ldots, \langle x_n,x'_n\rangle) using the same arithmetic as above.

[edit] Higher order differentials

The above arithmetic can be generalized, in the natural way, to second order and higher derivatives. However, the arithmetic rules quickly grow very complicated, complexity will be quadratic in the highest derivative degree Instead, truncated Taylor series arithmetic is used. This is possible because the Taylor summands in a Taylor series of a function are products of known coefficients and derivatives of the function. Computations of Hessians using AD has proven useful in some optimization contexts.

[edit] The chain rule, forward and reverse accumulation

Fundamental to AD is the decomposition of differentials provided by the chain rule. For the simple composition f(x) = g(h(x)) the chain rule gives

\frac{df}{dx} = \frac{dg}{dh} \frac{dh}{dx}

Usually, two distinct modes of AD are presented, forward accumulation (or forward mode) and reverse accumulation (or reverse mode, the use of "backward" is discouraged). Forward accumulation specifies that one traverses the chain rule from right to left, that is first one computes dh / dx and then dg / dh, and reverse accumulation from left to right.

Figure 2: Example of forward accumulation with computational graph
Figure 2: Example of forward accumulation with computational graph

[edit] Forward accumulation

Forward accumulation automatic differentiation is easiest to understand and to implement. The function f(x1,x2) = x1x2 + sin(x1) is by a computer (or the human programmer) interpreted as the sequence of elementary operations on the work variables wi, and an AD tool for forward accumulation adds the corresponding operations on the second component of the augmented arithmetic.

Original code statements Added statements for derivatives
w1 = x1 w'1 = 1 (seed)
w2 = x2 w'2 = 0 (seed)
w3 = w1w2 w'3 = w'1w2 + w1w'2 = 1x2 + x10 = x2
w4 = sin(w1) w'4 = cos(w1)w'1 = cos(x1)1
w5 = w3 + w4 w'5 = w'3 + w'4 = x2 + cos(x1)

The derivative computation for f(x1,x2) = x1x2 + sin(x1) needs to be seeded in order to distuingish between the derivative with respect to x1 or x2. The table above seeds the computation with w'1 = 1 and w'2 = 0 and we see that this results in x2 + cos(x1) which is the derivative with respect to x1. Note that although the table displays the symbolic derivative, in the computer it is always the evaluated (numeric) value that is stored. Figure 2 represents the above statements in a computational graph.

In order to compute the gradient of this example function, that is \partial f/\partial x_1 and \partial f / \partial x_2, two sweeps over the computational graph is needed, first with the seeds w'1 = 1 and w'2 = 0, then with w'1 = 0 and w'2 = 1.

The computational complexity of one sweep of forward accumulation is proportional to the complexity of the original code.

Forward accumulation is superior to reverse accumulation for functions f:\mathbb{R} \rightarrow \mathbb{R}^m with m \gg 1 as only one sweep is necessary, compared to m sweeps for reverse accumulation.

Figure 3: Example of reverse accumulation with computational graph
Figure 3: Example of reverse accumulation with computational graph

[edit] Reverse accumulation

Reverse accumulation traverses the chain rule from left to right, or in the case of the computational graph in Figure 3, from top to bottom. The example function is real-valued, and thus there is only one seed for the derivative computation, and only one sweep of the computational graph is needed in order to calculate the (two-component) gradient. This is only half the work when compared to forward accumulation, but reverse accumulation requires the storage of some of the work variables wi, which may represent a significant memory issue.

The data flow graph of a computation can be manipulated to calculate the gradient of its original calculation. This is done by adding an adjoint node for each primal node, connected by adjoint edges which parallel the primal edges but flow in the opposite direction. The nodes in the adjoint graph represent multiplication by the derivatives of the functions calculated by the nodes in the primal. For instance, addition in the primal causes fanout in the adjoint; fanout in the primal causes addition in the adjoint; a unary function y = f(x) in the primal causes x' = f'(x)y' in the adjoint; etc.

Reverse accumulation is superior to forward accumulation for functions f:\mathbb{R}^n \rightarrow \mathbb{R} with n \gg 1.

Backpropagation of errors in multilayer perceptrons, a technique used in machine learning, is a special case of reverse mode AD.

[edit] Jacobian computation

The Jacobian J of f:\mathbb{R}^n \rightarrow \mathbb{R}^m is an m \times n matrix. The Jacobian can be computed using n sweeps of forward accumulation, of which each sweep can yield a column vector of the Jacobian, or with m sweeps of reverse accumulation, of which each sweep can yield a row vector of the Jacobian.

[edit] Beyond forward and reverse accumulation

Forward and reverse accumulation are just two (extreme) ways of traversing the chain rule. In order to compute a full Jacobian of f:\mathbb{R}^n \rightarrow \mathbb{R}^m, there is a path somewhere in between which is optimal, however, locating that path is probably an NP-hard problem.

[edit] Implementation

Forward-mode AD is implemented by a Nonstandard interpretation of the program in which real numbers are replaced by dual numbers, constants are lifted to dual numbers with a zero epsilon coefficient, and the numeric primitives are lifted to operate on dual numbers. This nonstandard interpretation is generally implemented using one of two strategies: source code transformation or operator overloading.

[edit] Source code transformation

Figure 4: Example of how source code transformation could work
Figure 4: Example of how source code transformation could work

The source code for a function is replaced by an automatically generated source code that includes statements for calculating the derivatives interleaved with the original instructions.

Source code transformation can be implemented for all programming languages, and it is also easier for the compiler to do compile time optimizations. However, the implementation of the AD tool itself is more difficult.

Examples:

[edit] Operator overloading

Figure 5: Example of how operator overloading could work
Figure 5: Example of how operator overloading could work

Operator overloading is a possibility for source code written in a language supporting it. Objects for real numbers and elementary mathematical operations must be overloaded to cater for the augmented arithemtic depicted above. This requires no change in the original source code for the function to be differentiated.

Operator overloading for forward accumulation is easy to implement, and also possible for backward accumulation. However, current compilers lag behind in optimizing the code when compared to forward accumulation.

Example:

[edit] References

  • Andreas Griewank (2000), Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, SIAM, ISBN 0-89871-451-6

[edit] External links

In other languages