Understanding Device Allocation, Parallelism(tf.while_loop) And Tf.function In Tensorflow
Solution 1:
First, notice that your tensor_scatter_nd_update is just incrementing a single index, therefor you could only be measuring the overhead of the loop itself.
I modified your code to do it with a much larger batch size. Running in Colab under a GPU, I needed batch=10000 to hide the loop latency. Anything below that measures (or pays for) a latency overhead.
Also, the question is, does var.assign(tensor_scatter_nd_update(...))
actually prevent the extra copy made by tensor_scatter_nd_update
? Playing with batch size shows that indeed we're not paying for extra copies, so the extra copy seems to be prevented very nicely.
However, it turns out that in this case, apparently, tensorflow just considers the iterations to be dependent on each other, therefor it doesn't make any difference (at least in my test) if you increase the loop iterations. See this for further discussion on what TF does: https://github.com/tensorflow/tensorflow/issues/1984
It does things in parallel only if they are independent (operations).
BTW, an arbitrary scatter op isn't going to be very efficient on a GPU, but you still might be (should be) able to perform several in parallel if TF considers them independent.
import tensorflow as tf
from datetime import datetime
size = 1000000
index_count = size
batch = 10000
iterations = 10with tf.device('/device:GPU:0'):
var = tf.Variable(tf.ones([size], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
indexes = tf.Variable(tf.range(index_count, dtype=tf.dtypes.int32), dtype=tf.dtypes.int32)
var2 = tf.Variable(tf.range([index_count], dtype=tf.dtypes.float32), dtype=tf.dtypes.float32)
@tf.functiondeffoo():
return tf.while_loop(c, b, [i], parallel_iterations = iterations) #tweak@tf.functiondefb(i):
var.assign(tf.tensor_scatter_nd_update(var, tf.reshape(indexes, [-1,1])[i:i+batch], var2[i:i+batch]))
return tf.add(i, batch)
with tf.device('/device:GPU:0'):
i = tf.constant(0)
c = lambda i: tf.less(i,index_count)
start = datetime.today()
with tf.device('/device:GPU:0'):
foo()
print(datetime.today()-start)
Solution 2:
One technique is to use a distribution strategy and scope:
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
inputs = tf.keras.layers.Input(shape=(1,))
predictions = tf.keras.layers.Dense(1)(inputs)
model = tf.keras.models.Model(inputs=inputs, outputs=predictions)
model.compile(loss='mse',
optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.2))
Another option is to duplicate the operations on each device:
# Replicate your computation on multiple GPUs
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
with tf.device(d):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
See this guide for more details
Post a Comment for "Understanding Device Allocation, Parallelism(tf.while_loop) And Tf.function In Tensorflow"