Automatic differentiation

From Wikipedia, the free encyclopedia

In mathematics and computer algebra, automatic differentiation, or AD, sometimes alternatively called algorithmic differentiation, is a method to numerically evaluate the derivative of a function specified by a computer program. Two classical ways of doing this are:

symbolically differentiate the function as an expression, and evaluate it in the point; or
use finite differences.

The drawback with symbolic differentiation is low speed, and the difficulty of converting a computer program into a single expression. Moreover, many functions grow quite complex as higher derivatives are calculated. Two important drawbacks with finite differences are round-off errors in the discretization process, and cancellation. Both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Automatic differentiation solves all of the mentioned problems.

There are two modes of AD, called forward accumulation and reverse accumulation. Forward accumulation is less useful in practice, but easy to implement on computers. Reverse accumulation allows the efficient calculation of gradients.

1 Forward accumulation AD
2 Higher order forward accumulation AD
3 Reverse accumulation AD
4 Software
5 See also
6 External links

[edit] Forward accumulation AD

A new arithmetic is introduced using ordered pairs as objects. The new arithmetic uses ordinary arithmetic on the first element of the pair and a first order differentiation arithmetic applying the chain rule on the second element. The basic arithmetic operations and some standard functions are defined by

$\langle u,u'\rangle +\langle v,v'\rangle = \langle u+v, u'+v' \rangle$

$\langle u,u'\rangle -\langle v,v'\rangle = \langle u-v, u'-v' \rangle$

$\langle u,u'\rangle *\langle v,v'\rangle = \langle u v, u'v+uv' \rangle$

$\langle u,u'\rangle /\langle v,v'\rangle = \left\langle \frac{u}{v}, \frac{u'v-uv'}{v^2} \right\rangle \quad ( v\ne 0)$

$\sin\langle u,u'\rangle = \langle \sin(u) , u' \cos(u) \rangle$

$\cos\langle u,u'\rangle = \langle \cos(u) , -u' \sin(u) \rangle$

$\exp\langle u,u'\rangle = \langle \exp u , u' \exp u \rangle$

$\log\langle u,u'\rangle = \langle \log(u) , u'/u \rangle \quad (u>0)$

$\langle u,u'\rangle^k = \langle u^k , k u^{k-1} u' \rangle \quad (u \ne 0)$

$\left| \langle u,u'\rangle \right| = \langle \left| u \right| , u' \mbox{sign} u \rangle \quad (u \ne 0)$

and in general for the primitive function $g$ ,

$g(\langle u,u' \rangle , \langle v,v' \rangle ) = \langle g(u,v) , g^{(1)}(u,v) u' + g^{(2)}(u,v) v' \rangle$

where $g (1)$ and $g (2)$ are the derivatives of $g$ with respect to its first and second arguments, respectively. Also, when a binary basic arithmetic operation is applied to mixed arguments, say the ordered pair $\langle u,u' \rangle$ and the real number $c$ , the real number is first lifted to $\langle c,0 \rangle$ .

The derivative of a function $f:\Re\rightarrow\Re$ at the point $x 0$ is now found by calculating $f \langle x_0, 1 \rangle$ using the above arithmetic, which gives $\langle f ( x_0 ) , f' ( x_0 ) \rangle$ as the result.

Multivariate functions can be handled with the same efficiency and mechanisms as univariate functions by adopting a directional derivative operator, which finds the directional derivative $y':\Re^m$ of $f:\Re^n\rightarrow\Re^m$ at $x:\Re^n$ in the direction $x':\Re^n$ by calculating $(\langle y_1,y'_1\rangle, \ldots, \langle y_m,y'_m\rangle) = f(\langle x_1,x'_1\rangle, \ldots, \langle x_n,x'_n\rangle)$ using the same arithmetic as above.

[edit] Higher order forward accumulation AD

The above arithmetic can be generalized, in the natural way, to second order and higher derivatives. However, the arithmetic rules quickly grow very complicated. Instead, truncated Taylor series arithmetic is used. This is possible because the Taylor summands in a Taylor series of a function are products of known coefficients and derivatives of the function.

[edit] Reverse accumulation AD

The data flow graph of a computation can be manipulated to calculate the gradient of its original calculation. This is done by adding an adjoint node for each primal node, connected by adjoint edges which parallel the primal edges but flow in the opposite direction. The nodes in the adjoint graph represent multiplication by the derivatives of the functions calculated by the nodes in the primal. For instance, addition in the primal causes fanout in the adjoint; fanout in the primal causes addition in the adjoint; a unary function $y = f (x)$ in the primal causes $x' = f'(x) y'$ in the adjoint; etc.

Backpropagation of errors in multilayer perceptrons, a technique used in machine learning, is a special case of reverse mode AD.