General Matrix Multiply

From Wikipedia, the free encyclopedia

The General Matrix Multiply (GEMM) is a subroutine in the Basic Linear Algebra Subprograms (BLAS) which performs matrix multiplication, that is the multiplication of two matrices. This includes:

SGEMM for single precision,
DGEMM for double-precision,
CGEMM for complex single precision, and
ZGEMM for complex double precision.

GEMM is often tuned by High Performance Computing vendors to run as fast as possible, because it is the building block for so many other routines. It is also the most important routine in the LINPACK benchmark. For this reason, implementations of fast BLAS library typically focus first on GEMM performance.

[edit] Operation

The xGEMM routine calculates the new value of matrix C based on the matrix-product of matrices A and B, and the old value of matrix C

$C \leftarrow \alpha A B + \beta C$

where $α$ and $β$ values are scalar coefficients.

The Fortran interface for these procedures are:

SUBROUTINE xGEMM ( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC )

where TRANSA and TRANSB determines if the matrices A and B are to be transposed. M is the number of rows in matrix A and C. N is the number of columns in matrix B and C. K is the number of columns in matrix A and rows in matrix B. LDA, LDB and LDC specifies the size of the first dimension of the matrices, as laid out in memory; meaning the memory distance between the start of each row/column, depending on the memory structure.

[edit] Optimization

Not only is GEMM an important building block of other numeric software, it is often an important building block for calls to GEMM for larger matrices. By decomposing one or both of the input matrices into block matrices, GEMM can be used repeatedly on the smaller blocks to build up a result for the full matrix. This is one of the motivations for including the $β$ parameter, so the results of previous blocks can be accumulated. Note that this requires the special case $β = 1$ which many implementations optimize for, thereby eliminating one multiplication for each value of C.

This decomposition allows for better locality of reference both in space and time of the data used in the product. This, in turn, takes advantage of the cache on the system. For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as ATLAS.