benchmarks Bug while parsing tf.VarLenFeature

I am using variable length data so I need to parse using tf.VarLenFeature. I migrated to tf 1.8 and still the same problem. The training crushes and it is killed before the epoch ends. Any ideas? I couldn't find a solution at stackoverflow. It is probably a memory issue...

Apr 19 '18 14:04 chrisrn

Can you clarify what the context is? Did you modify tf_cnn_benchmarks or some other benchmark to use variable length data?

Apr 19 '18 17:04 reedwm

The code that I am using is based on tf benchmark code that's why I am asking here. I am using tf dataset inside minibatch with exactly the same way. I have memory leak (tensorflow training is killed) and still I cannot understand why. The pipeline is that I am parsing data using tf.VarLenFeature and then I am getting the next rows of tfrecords using a tf.while loop because the batch size depends on the var len feature. I also used tf.FixedLenSequenceFeature but still the script is killed, so probably this has to do with the feeding of unknown shape tensors into tf.while. Have a look:

label, image, publisher = ds_iterator.get_next()
labels_splits = label
images_splits = tf.tile(image, [tf.shape(label)[0]])
images_splits = tf.reshape(images_splits, [tf.shape(label)[0], -1])
publishers_splits = publisher
batch_size_counter = tf.shape(label)[0]

def condition(batch_size_counter, *args):
       return tf.less(batch_size_counter, self.batch_size)

def body(batch_size_counter, labels_splits, publishers_splits, images_splits):
       label, image, publisher = ds_iterator.get_next()
       labels_splits = tf.concat([labels_splits, label], 0)
       publishers_splits = tf.concat([publishers_splits, publisher], 0)
       images_dupl = tf.tile(image, [tf.shape(label)[0]])
       images_dupl = tf.reshape(images_dupl, [tf.shape(label)[0], -1])
       images_splits = tf.concat([images_splits, images_dupl], 0)
       return batch_size_counter + tf.shape(label)[0], labels_splits, publishers_splits, images_splits

_, labels_splits, publishers_splits, images_splits = tf.while_loop(condition,
                                                                                   body,
                                                                                   [batch_size_counter,
                                                                                    labels_splits,
                                                                                    publishers_splits,
                                                                                    images_splits],
                                                                                   shape_invariants=
                                                                                   [batch_size_counter.get_shape(),
                                                                                    tf.TensorShape([None]),
                                                                                    tf.TensorShape([None]),
                                                                                    tf.TensorShape([None, None])])

labels_splits = tf.slice(labels_splits, [0], [self.batch_size])
publishers_splits = tf.slice(publishers_splits, [0], [self.batch_size])
images_splits = tf.slice(images_splits, [0, 0], [self.batch_size, -1])
images_splits = tf.reshape(images_splits, [self.batch_size, feature_shape[0], feature_shape[1], feature_shape[2]])

images = tf.split(images_splits, self.num_splits)
labels = tf.split(labels_splits, self.num_splits)
publishers = tf.split(publishers_splits, self.num_splits)

I am using the above code instead of this:

for idx in xrange(self.batch_size):
      label, image, publisher= ds_iterator.get_next()
      split_index = idx % self.num_splits
      labels[split_index].append(label)
      images[split_index].append(image)
      publishers[split_index].append(publisher)

for split_index in xrange(self.num_splits):
      images[split_index] = tf.parallel_stack(images[split_index])
      labels[split_index] = tf.concat(labels[split_index], 0)
      publishers[split_index] = tf.concat(publishers[split_index], 0)
      images[split_index] = tf.cast(images[split_index], self.dtype)

Apr 20 '18 08:04 chrisrn

Unfortunately we do not have time to debug significant modifications to tf_cnn_benchmarks, so I'll mark as contributions welcome. If you think there is a bug in TensorFlow, you can create a small example that reproduces the bug (without using tf_cnn_benchmarks), and file a bug in the TensorFlow repo.

Apr 20 '18 17:04 reedwm

@reedwm what's your roadmap for this repo? could you share it?

Aug 20 '18 06:08 anpark

We do not have an explicit roadmap. tf_cnn_benchmarks is a sandbox allowing us to experiment with different performance strategies. Once we find useful performance strategies, we can later integrate them into TensorFlow to make them easy to use. We have overall performance goals, and tf_cnn_benchmarks serves as a tool helping us reach those goals.

Aug 20 '18 18:08 reedwm

thanks, pytorch 1.0 is to be released with caffe for good performance, and more and more papers are written by pytorch. I hope tf can move further in performance, thanks!

Aug 21 '18 04:08 anpark