Skip to content Skip to sidebar Skip to footer

Speed Up Numpy Nested Loop

I am writing a simulation for a wireless network in python using numpy and cython, where suppose there is a number of nodes no_nodes scattered randomly on the 2d plane which send o

Solution 1:

this is a memory bandwidth bound problem with about 3GB/s memory bandwidth the best you can get out of this is around 2-4ms for the inner loop. to reach that bound you need to block your inner loop to utilize the cpu caches better (numexpr does this for you):

for i inrange(no_nodes):
    for j inrange(no_nodes):
        # should be chosen so all operands fit in the (next-to-)last level cache# first level is normally too small to be usable due to python overhead
        s  = 15000 
        a = attenuation[i,j]
        o = output[j]
        w = waveforms[i]
        for k inrange(0, w.size, s): 
            u = min(k + s, w.size)
            w[k:u] += o[k:u] * a
        # or: numexpr.evaluate("w + o * a", out=w)

using float32 data instead of float64 should also half the memory bandwidth requirements and double the performance.

To get larger speedups you have to redesign your full algorithm to have a better data locality

Solution 2:

I think it is memory access that's crippling your performance, and there is probably little you can do there. You can speed-up things a little bit by using in-place operations and preallocating and reusing buffers. As a toy example for your code, without the time alignment:

def no_buffer(output, attenuate):
    waveforms = np.zeros_like(output)
    for i in xrange(len(output)):
        for j in xrange(len(output)):
            waveforms[i,:] += output[j, :] * attenuate[i, j]

    return waveforms

def with_buffer(output, attenuate):
    waveforms = np.zeros_like(output)
    buffer_arr = np.empty_like(output[0])
    for i in xrange(len(output)):
        for j in xrange(len(output)):
            np.multiply(output[j, :], attenuate[i, j], out=buffer_arr)
            np.add(waveforms[i, :], buffer_arr, out=waveforms[i, :])

    return waveforms

o = np.random.rand(20, 1e6)
a = np.random.rand(20, 20)

In [17]: np.allclose(no_buffer(o, a), with_buffer(o, a))
Out[17]: True

In [18]: %timeit no_buffer(o, a)
1 loops, best of 3: 2.3 s per loop

In [19]: %timeit with_buffer(o, a)
1 loops, best of 3: 1.57 s per loop

Which I guess is better than nothing.

Of course, if you can get rid of the time alignment thing, your operation is simply a matrix multiplication, and it is best to let BLAS deal with that. On my system, using MKL:

In [21]: np.allclose(with_buffer(o, a), np.dot(o.T, a.T).T)
Out[21]: True

In [22]: %timeit np.dot(o.T, a.T).T
10 loops, best of 3: 123 ms per loop

Solution 3:

Just for the sake of experimentation suppose that output from each transmitter is time aligned, and therefore no clocks are required. I came up with a version using heavily broadcasting and thus eliminating the for loops completely. However, it is 3x slower. Here is the code I wrote:

import numpy as np
import time

def calc(no_nodes):

    output = np.random.rand(no_nodes, 7e5) #some random data, 7e5 samples here
    attenuate= np.random.rand(no_nodes,no_nodes) #some random data
    start_time = time.time()
    output_per_node = np.zeros((no_nodes,no_nodes,7e5))
    output_per_node += output[None, :, :]
    data = attenuate[:,:,None] * output_per_node
    waveforms = np.sum(data, axis=1)
    end_time = time.time()
    print end_time - start_time
    return waveforms

The regular nested for-loop implementation would be:

def calc1(no_nodes):
    output = np.random.rand(no_nodes, 7e5)
    attenuation = np.random.rand(no_nodes,no_nodes)
    waveforms = np.zeros((no_nodes, 7e5))
    start_time = time.time()
    for i in range(no_nodes):
        for j in range(no_nodes):
            waveforms[i] += output[j] * attenuation[i,j]
    printtime.time() - start_time
    return waveforms

So what is really going on here? Numpy is supposed to be so blazingly fast you can't keep up with. I don't say it generally isn't, but in this particular example something is not going on well. Even if you convert both these codes to cython, the second one (with for loops) is much faster than the one with broadcasting. What do you believe I am doing wrong here? Note: try with no_nodes=10

For anyone interested you can find the ipython notebook with the above code displaying the difference in performance, both in ipynb and html form here:

Any feedback would be greatly appreciated.

Post a Comment for "Speed Up Numpy Nested Loop"