[ALBERT] Same amount of TPU memory consumption compared to BERT
Although the layers are shared in ALBERT, I failed running ALBERT with larger batch size than the batch size I ran successfully on BERT.
Only thing I can suspect is that TPU/GPU consumes same amount of memory regardless of layer sharing.
Is this expected?
Although we have a name_scope for each layers, tf.trainable_variables doesn't return variables with name scope appended to each layers.
So researching through this, I came across tf.get_default_graph().get_operations() which has all the variables including the variables in name scope.
It seems like when running a graph on device(TPU/GPU), tensorflow uploads all the nodes on the graph regardless of variable sharing. For example,
a = np.array([1,2,3])
for a in range(n):
a += np.array([1,2,3])
we normally expect that it only allocates memory for a, but on TPU/GPU, it has versions of tensors a which are a_1, a_2, a_3 .. a_n. Explaining this phenomenon with my tiny bit of knowledge on GPU, GPU computes by warp, which has asynchronous computation order, and therefore each elements with in variable a_1 may not be computed while some warp might have processed up to a_6. This is causing a RAW(Read After Write) dependency. While some variables may have been computed up to a_3[1], some postponed computation could leave a_1[0] not executed.
This is why TPU allocates all variables regardless of variable sharing. Since the computation order is not defined, it needs to allocate all versions of variables in a device to avoid RAW hazard.
In conclusion, although we do variable weight sharing, we still need same amount of memory for computation.
Am I right?