Why Are Lil_matrix And Dok_matrix So Slow Compared To Common Dict Of Dicts?

September 30, 2022 Post a Comment

I want to iteratively build sparse matrices, and noticed that there are two suitable options for this according to the SciPy documentation: LiL matrix: class scipy.sparse.lil_matr

Solution 1:

When I change your += to just = for your 2 sparse arrays:

for row, col in zip(rows, cols):
    #freqs[row,col] += 1
    freqs[row,col] = 1

their respective times are cut in half. What's consuming the most time is the indexing. With += it is has to do both a __getitem__ and a __setitem__.

When the docs say that dok and lil are better for iterative construction they mean that it's easier to expand their underlying data structures than for the other formats.

When I try to make a csr matrix with your code, I get a:

/usr/lib/python2.7/dist-packages/scipy/sparse/compressed.py:690: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient. SparseEfficiencyWarning)

and 30x slower speed.

So the speed claims are relative to formats like csr, not relative to pure Python or numpy structures.

You might want to look at the Python code for dok_matrix.__get_item__ and dok_matrix.__set_item__ to see what happens when you do freq[r,c].

A faster way to construct your dok would be:

freqs = dok_matrix((1000,1000))
d = dict()
for row, col in zip(rows, cols):
    d[(row, col)] = 1
freqs.update(d)

taking advantage of the fact that a dok is a subclassed dictionary. Note that dok matrix is not a dictionary of dictionaries. Its keys are tuples like (50,50).

Another fast way of constructing the same sparse array is:

freqs = sparse.coo_matrix((np.ones(1000,int),(rows,cols)))

In other words, since you already have the rows and cols arrays (or ranges), calculate the corresponding data array, and THEN construct the sparse array.

But if you must perform sparse operations on your matrix between incremental growth steps, then dok or lil may be your best choices.

Sparse matricies were developed for linear algebra problems, such as solving a linear equation with a large sparse matrix. I used them years ago in MATLAB to solve finite difference problems. For that work the calculation friendly csr format is the ultimate goal, and the coo format was a convenient initialization format.

Now many of the SO scipy sparse questions arise from scikit-learn and text analysis problems. They are also used in a biological database files. But still the (data),(row,col) definition method works best.

So sparse matrices were never intended for fast incremental creation. The traditional Python structures like dictionaries and lists are much better for that.

Here's a faster dok iteration that takes advantage of its dictionary methods. update seems to work as fast as on a plain dictionary. get is about 3x faster the equivalent indexing (freq[row,col]). Indexing probably uses get, but must have a lot of overhead.

def fast_dok(rows, cols):
    freqs = dok_matrix((1000,1000))
    for row, col in zip(rows,cols):
         i = freqs.get((row,col),0)
         freqs.update({(row,col):i+1})
    return freqs

Skipping the get, and just doing

         freqs.update({(row,col): 1)

is even faster - faster than the defaultdict of defaultdict example, and nearly as fast as simple dictionary initialization ({(r, c):1 for r,c in zip(rows, cols)})

Solution 2:

There are various reasons why your test is not fair. Firstly, you're including the overhead of constructing the sparse matrices as part of your timed loop.

Secondly, and arguably more importantly, you should use data structures as they are designed to be used, with operations on the whole array at once. That is, rather than iterating over the rows and columns and adding 1 each time, simply add 1 to the whole array.

Python Playground

Why Are Lil_matrix And Dok_matrix So Slow Compared To Common Dict Of Dicts?

Solution 1:

Solution 2:

Post a Comment for "Why Are Lil_matrix And Dok_matrix So Slow Compared To Common Dict Of Dicts?"