[ALBERT] Same amount of TPU memory consumption compared to BERT

Open jinmel opened this issue 6 years ago • 1 comments

Although the layers are shared in ALBERT, I failed running ALBERT with larger batch size than the batch size I ran successfully on BERT.

Only thing I can suspect is that TPU/GPU consumes same amount of memory regardless of layer sharing.

Is this expected?

Nov 25 '19 11:11 jinmel

Although we have a name_scope for each layers, tf.trainable_variables doesn't return variables with name scope appended to each layers.

So researching through this, I came across tf.get_default_graph().get_operations() which has all the variables including the variables in name scope.

It seems like when running a graph on device(TPU/GPU), tensorflow uploads all the nodes on the graph regardless of variable sharing. For example,

a = np.array([1,2,3])
for a in range(n):
    a += np.array([1,2,3])

we normally expect that it only allocates memory for a, but on TPU/GPU, it has versions of tensors a which are a_1, a_2, a_3 .. a_n. Explaining this phenomenon with my tiny bit of knowledge on GPU, GPU computes by warp, which has asynchronous computation order, and therefore each elements with in variable a_1 may not be computed while some warp might have processed up to a_6. This is causing a RAW(Read After Write) dependency. While some variables may have been computed up to a_3[1], some postponed computation could leave a_1[0] not executed.

This is why TPU allocates all variables regardless of variable sharing. Since the computation order is not defined, it needs to allocate all versions of variables in a device to avoid RAW hazard.

In conclusion, although we do variable weight sharing, we still need same amount of memory for computation.

Am I right?

Nov 25 '19 12:11 jinmel