Skip to content Skip to sidebar Skip to footer

Pyspark. Transformer That Generates A Random Number Generates Always The Same Number

I am trying to measure the performance impact on having to copy a dataframe from scala to python and back in a large pipeline. For that purpose I have created this rather artificia

Solution 1:

Well, expected is rather relative here but it is not something that cannot be explained. In particular the state of the RNG is inherited from the parent process. You can easily prove that by running following simple snippet in the local mode:

import random 

defroll_and_get_state(*args):
    random.random()
    return [random.getstate()]

states = sc.parallelize([], 10).mapPartitions(roll_and_get_state).collect()
len(set(states))
## 1

As you can see each partition has is using its own RNG but all have the same state.

In general ensuring correct Python RNG behavior in Spark without a serious performance penalty, especially if you need reproducible results, is rather tricky.

One possible approach is to instantiate separate Random instance per partition with seed generated using cryptographically safe random data (os.urandom).

If you need reproducible results you can generate RNG seeds based on global state and partition data. Unfortunately this information is not easily accessible on runtime from Python (ignoring special cases like mapPartitionsWithIndex).

Since partition level operations are not always applicably (like in case of UDF) you can achieve similar result by using singleton module or Borg pattern to initialize RNG for each executor.

See also:

Post a Comment for "Pyspark. Transformer That Generates A Random Number Generates Always The Same Number"