Skip to content Skip to sidebar Skip to footer

Multiply Two Numpy Matrices In Pyspark

Let's say I have these two Numpy arrays: A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024) B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024) and I perfor

Solution 1:

Using the as_block_matrix method from this post, you could do the following (but see the comment of @kennytm why this method can be slow for bigger matrices):

import numpy as np
from pyspark.mllib.linalg.distributed import RowMatrix
A = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)
B = np.arange(1024 ** 2, dtype=np.float64).reshape(1024, 1024)

from pyspark.mllib.linalg.distributed import *

def as_block_matrix(rdd, rowsPerBlock=1024, colsPerBlock=1024):
    return IndexedRowMatrix(
        rdd.zipWithIndex().map(lambda xi: IndexedRow(xi[1], xi[0]))
    ).toBlockMatrix(rowsPerBlock, colsPerBlock)

matrixA = as_block_matrix(sc.parallelize(A))
matrixB = as_block_matrix(sc.parallelize(B))
product = matrixA.multiply(matrixB)

Post a Comment for "Multiply Two Numpy Matrices In Pyspark"