Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU).

Below are a collection of small tricks that can help with large (~4000x4000) matrix multiplications. I have used them to reduce inference time in a deep neural network from 24 seconds to less than one second. In fact, in one case, my optimized code on a CPU turned out to run faster than Tensorflow using a GPU (1 second vs 7 seconds).

The first step is to measure everything.
On Linux, Python’s `time.time()`

provides a high resolution timer.
Timing information will help direct your efforts.

In a neural network, most of the time will be spent with large matrix multiplications.

Extremely complex element-wise operations (such as chains of sigmoids) may have neglible performance impact when compared to a slow matrix multiplication. Until you measure the performance of each step in your algorithm, you don’t know what is affecting performance.

Ensure your arrays have a `dtype`

of `numpy.float32`

, rather than the default `numpy.float64`

. I’ve seen this have a four-fold improvement or more.

(It may be tempting to try further reductions to `numpy.float16`

, but this backfires because the CPU and BLAS libraries do not work natively at this precision.)

BLAS is a high-performance matrix library. Even though NumPy uses BLAS, I've noticed performance can be improved by calling BLAS directly. Perhaps this is simply because using direct calls to BLAS forces you to shape your data ready for use with BLAS.

Replace `numpy.matmul`

with `scipy.linalg.blas.sgemm(...)`

for `float32`

matrix-matrix multiplication and `scipy.linalg.blas.sgemv(...)`

for `float32`

matrix-vector multiplication.

Check that you’re using OpenBLAS or Intel MKL.

An easy way to check is to look at your CPU usage (e.g., with top). If your matrix multiplications are using a single core, then you may be using a slow BLAS. You can get over 2x performance improvements by using a multi-core BLAS library such as OpenBLAS or Intel MKL.

Check whether values in your matrices are stored in column major (`order = 'F'`

) or row major order (`order = 'C'`

).

You can check the order by examining the `.flags`

property of your matrix (pay attention to the `C_CONTIGUOUS`

and `F_CONTIGUOUS`

flags).

For example:

```
> a = np.array([[1,2],[3,4]])
> a.flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
```

You can change the order by copying your matrix (`numpy.copy(..., order='F')`

) or by a transpose.

Changing the order of your matrices can improve performance (BLAS typically works better with column major order).

Hint: C and F stand for the orders used in the C and Fortran programming languages. BLAS prefers F because BLAS is written in Fortran.

The data order can have dramatic impact on performance.

Each iteration in a recurrent neural networks including LSTMs combines input with the output state of the previous iteration.

The input processing can be partially factored out.

i.e., rather than `matmul(M, concatenate(x[i], h))`

, the matrix `M`

can be factored out into `M_x`

and `M_h`

.

This means that `temp = matmul(M_x, x[:])`

can be precomputed in single large batch, then the iterative computation becomes `temp[i] + matmul(M_h, h)`

.

This results in (small) benefits.

The inference vectors in a neural networks trained with ReLU activations may be sparse.

This can be exploited, for example by using the sparse vector `v`

as a filter for the columns of a matrix: `M[:,v > 0.001]`

.

I found that this sometimes resulted in a small improvement in run time.

A large matrix can be approximated by computing the Singular Value Decomposition (SVD). Computing an SVD is too slow to be done online. However, if one of your matrices is constant, then ‘precomputation’ can pay off.

Consider the multiplication `y = matmul(A, x)`

.

Such a multiplication can be approximated by two lower rank multiplications:

```
U, s, V = numpy.linalg.svd(A) # Very slow, so precompute!
rank = len(s) / 3 # Compression by a factor of 3
y = matmul(V[:rank,:],x)
y *= s[:rank]
y = matmul(U[:,:rank], y)
```

Hint: precompute and copy the views `V[:rank,:]`

, `s[:rank]`

, and `U[:,:rank]`

and don't forget to cast the SVD down to `float32`

.

Reducing a single 2000x2000 matrix multiplication to a 100x2000 followed by a 2000x100 multiplication (for example) can make a big difference!

I've found that reducing the rank of a matrix by a third or more can have negligible impact on the accuracy of a RNN yet produce dramatic improvements in performance.

Let me know if you have other suggestions/tricks to recommend. Happy optimizing!

Published 21 January 2018 by Benjamin Johnston.