Understanding Device Allocation, Parallelism(tf.while_loop) And Tf.function In Tensorflow

April 21, 2024 Post a Comment

I'm trying to understand parallelism on GPU in tensorflow as I need to apply it on uglier graphs. import tensorflow as tf from datetime import datetime with tf.device('/device:GPU

Solution 1:

First, notice that your tensor_scatter_nd_update is just incrementing a single index, therefor you could only be measuring the overhead of the loop itself.

I modified your code to do it with a much larger batch size. Running in Colab under a GPU, I needed batch=10000 to hide the loop latency. Anything below that measures (or pays for) a latency overhead.

Also, the question is, does var.assign(tensor_scatter_nd_update(...)) actually prevent the extra copy made by tensor_scatter_nd_update? Playing with batch size shows that indeed we're not paying for extra copies, so the extra copy seems to be prevented very nicely.

However, it turns out that in this case, apparently, tensorflow just considers the iterations to be dependent on each other, therefor it doesn't make any difference (at least in my test) if you increase the loop iterations. See this for further discussion on what TF does: https://github.com/tensorflow/tensorflow/issues/1984

It does things in parallel only if they are independent (operations).

BTW, an arbitrary scatter op isn't going to be very efficient on a GPU, but you still might be (should be) able to perform several in parallel if TF considers them independent.

import tensorflow as tf
from datetime import datetime

size = 1000000
index_count = size
batch = 10000
iterations = 10with tf.device('/device:GPU:0'):
    var = tf.Variable(tf.ones([size], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
    indexes = tf.Variable(tf.range(index_count, dtype=tf.dtypes.int32), dtype=tf.dtypes.int32)
    var2 = tf.Variable(tf.range([index_count], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)

@tf.functiondeffoo():
    return tf.while_loop(c, b, [i], parallel_iterations = iterations)      #tweak@tf.functiondefb(i):
    var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(indexes, [-1,1])[i:i+batch], var2[i:i+batch]))
    return tf.add(i, batch)

with tf.device('/device:GPU:0'):
    i = tf.constant(0)
    c = lambda i: tf.less(i,index_count)

start = datetime.today()
with tf.device('/device:GPU:0'):
    foo()
print(datetime.today()-start)

Solution 2:

One technique is to use a distribution strategy and scope:

strategy = tf.distribute.MirroredStrategy()

with strategy.scope():
  inputs = tf.keras.layers.Input(shape=(1,))
  predictions = tf.keras.layers.Dense(1)(inputs)
  model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
  model.compile(loss='mse',
                optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2))

Another option is to duplicate the operations on each device:

# Replicate your computation on multiple GPUs
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)

See this guide for more details

Python Playground

Understanding Device Allocation, Parallelism(tf.while_loop) And Tf.function In Tensorflow

Solution 1:

Solution 2:

Post a Comment for "Understanding Device Allocation, Parallelism(tf.while_loop) And Tf.function In Tensorflow"