Kochanski multiplication

Kochanski multiplication^[1] is an algorithm that allows modular arithmetic (multiplication or operations based on it, such as exponentiation) to be performed efficiently when the modulus is large (typically several hundred bits). This has particular application in number theory and in cryptography: for example, in the RSA cryptosystem and Diffie-Hellman key exchange.

The most common way of implementing large-integer multiplication in hardware is to express the multiplier in binary and enumerate its bits, one bit at a time, starting with the most significant bit, perform the following operations on an accumulator:

Double the contents of the accumulator (if the accumulator stores numbers in binary, as is usually the case, this is a simple "shift left" that requires no actual computation).
If the current bit of the multiplier is 1, add the multiplicand into the accumulator; if it is 0, do nothing.

For an n-bit multiplier, this will take n clock cycles (where each cycle does either a shift or a shift-and-add).

To convert this into an algorithm for modular multiplication, with a modulus r, it is necessary to subtract r conditionally at each stage:

Double the contents of the accumulator.
If the result is greater than or equal to r, subtract r. (Equivalently, subtract r from the accumulator and store the result back into the accumulator if and only if it is non-negative).
If the current bit of the multiplier is 1, add the multiplicand into the accumulator; if it is 0, do nothing.
If the result of the addition is greater than or equal to r, subtract r. If no addition took place, do nothing.

This algorithm works. However, it is critically dependent on the speed of addition.

Addition of long integers suffers from the problem that carries have to be propagated from right to left and the final result is not known until this process has been completed. Carry propagation can be speeded up with carry look-ahead logic, but this still makes addition very much slower than it needs to be (for 512-bit addition, addition with carry look-ahead is 32 times slower than addition without carries at all).

Non-modular multiplication can make use of carry-save adders, which save time by storing the carries from each digit position and using them later: for example, by computing 111111111111+000000000010 as 111111111121 instead waiting for the carry to propagate through the whole number to yield the true binary value 1000000000001. That final propagation still has to be done to yield a binary result but this only needs to be done once at the very end of the multiplication.

Unfortunately the modular multiplication method outlined above needs to know the magnitude of the accumulated value at every step, in order to decide whether to subtract r: for example, if it needs to know whether the value in the accumulator is greater than 1000000000000, the carry-save representation 111111111121 is useless and needs to be converted to its true binary value for the comparison to be made.

It therefore seems that one can have either the speed of carry-save or modular multiplication, but not both.

Outline of the algorithm

The principle of the Kochanski algorithm is one of making guesses as to whether or not r should be subtracted, based on the most significant few bits of the carry-save value in the accumulator. Such a guess will be wrong some of the time, since there is no way of knowing whether latent carries in the less significant digits (which have not been examined) might not invalidate the result of the comparison. Thus:

A subtraction may not have been made when one was required. In that case the result in the accumulator is greater than r (although the algorithm doesn't know it yet), and so after the next shift left, 2r will need to be subtracted from the accumulator.

A subtraction may have been made when one was not required. In that case the result in the accumulator is less than 0 (although the algorithm doesn't know it yet), and so after the next shift left, r or even 2r will need to be added back to the accumulator to make it positive again.

What is happening is essentially a race between the errors that result from wrong guesses, which double with every shift left, and the corrections made by adding or subtracting multiples of r based on a guess of what the errors may be.

It turns out^[2] that examining the most significant 4 bits of the accumulator is sufficient to keep the errors within bounds and that the only values that need to be added to the accumulator are -2r, -r, 0, +r, and +2r, all of which can be generated instantaneously by simple shifts and negations.

At the end of a complete modular multiplication, the true binary result of the operation has to be evaluated and it is possible that an additional addition or subtraction of r will be needed as a result of the carries that are then discovered; but the cost of that extra step is small when amortized over the hundreds of shift-and-add steps that dominate the overall cost of the multiplication.

Alternatives

Brickell^[3] has published a similar algorithm that requires greater complexity in the electronics for each digit of the accumulator.

Montgomery multiplication is an alternative algorithm that processed the multiplier "backwards" (least significant digit first) and uses the least significant digit of the accumulator to control whether or not the modulus should be added/subtracted. This avoids the need for carries to propagate. However, the algorithm is impractical for single modular multiplications, since two or three additional Montgomery steps have to be performed to convert the operands into a special form before processing and to convert the result back into conventional binary at the end.

References

↑ M.J. Kochanski, "Developing an RSA Chip", in Advances in Cryptology: Proceedings of CRYPTO 85, Springer-Verlag, Berlin (1985), 3-540-16463-4.
↑ M.J. Kochanski, A New Method of Serial Modular Multiplication describes the algorithm in full detail.
↑ Ernest F. Brickell, "A Fast Modular Multiplication Algorithm with Applications to Two Key Cryptography", in Advances in Cryptology: Proceedings of CRYPTO '82, Plenum, New York (1983), pp. 51-60.

Creating the FAP4 Chip Informal explanation and motivation of the algorithm, with details of an actual hardware implementation.