Spark: How To "reducebykey" When The Keys Are Numpy Arrays Which Are Not Hashable?

February 02, 2024 Post a Comment

I have an RDD of (key,value) elements. The keys are NumPy arrays. NumPy arrays are not hashable, and this causes a problem when I try to do a reduceByKey operation. Is there a way

Solution 1:

The simplest solution is to convert it to an object that is hashable. For example:

from operator import add

reduced = sc.parallelize(data).map(
    lambda x: (tuple(x), x.sum())
).reduceByKey(add)

and convert it back later if needed.

Is there a way to supply the Spark context with my manual hash function

Not a straightforward one. A whole mechanism depend on the fact object implements a __hash__ method and C extensions are cannot be monkey patched. You could try to use dispatching to override pyspark.rdd.portable_hash but I doubt it is worth it even if you consider the cost of conversions.

Python Playground

Spark: How To "reducebykey" When The Keys Are Numpy Arrays Which Are Not Hashable?

Solution 1:

Post a Comment for "Spark: How To "reducebykey" When The Keys Are Numpy Arrays Which Are Not Hashable?"