tbparse icon indicating copy to clipboard operation
tbparse copied to clipboard

Possibility to subsample when loading the binary ?

Open ReHoss opened this issue 2 years ago • 3 comments

Hello,

Is it possible to subsample the event file while loading ? What do you recommend if we don't have enough RAM like in a Jupyter notebook to read the event file ?

I know that tensorboard use a subsampling strategy.

Thanks for you consideration.

ReHoss avatar Mar 10 '23 18:03 ReHoss

Hi,

I would like to know more details about your use case. What event types are you loading? and how large is the event file? Does your use case require iterating through all events, or does it only need to process certain filtered events?

tbparse is designed to load all events directly into the system memory, and currently does not support subsampling. However, it may be possible to add a feature for pre-filtering the events in the future, given valid use cases.

If you simply want to iterate through the events, maybe you can try out the raw method by TensorBoard/TensorFlow as documented here.

j3soon avatar Mar 14 '23 17:03 j3soon

From: https://github.com/tensorflow/tensorboard/blob/master/README.md

Is my data being downsampled? Am I really seeing all the data?

TensorBoard uses reservoir sampling to downsample your data so that it can be loaded into RAM. You can modify the number of elements it will keep per tag by using the --samples_per_plugin command line argument (ex: --samples_per_plugin=scalars=500,images=20). See this Stack Overflow question for some more information.

And according to the help command:

--samples_per_plugin: An optional comma separated list of plugin_name=num_samples pairs to explicitly specify how many samples to keep per tag for that plugin. For unspecified plugins, TensorBoard randomly downsamples logged summaries to reasonable values to prevent out-of-memory errors for long running jobs. This flag allows fine control over that downsampling. Note that 0 means keep all samples of that type. For instance, "scalars=500,images=0" keeps 500 scalars and all images. Most users should not need to set this flag. (default: '')

For instance, the asker from the StackOverflow thread trains over 20M steps.

I train over 1e6 steps but run 100 experiments. If I log accurately the training score I end up with an extremely large DataFrame.

It would be nice to have an option to downsample randomly (with a seed interface then) or evenly. Ideally for n training curves, same time steps are kept.

Thank you for your consideration, Best,

ReHoss avatar Mar 14 '23 19:03 ReHoss

Thanks for providing the detailed information. I think reservoir sampling is a useful feature and won't be too hard to implement. However, I'm not sure if we can manually set the RNG seed...

This feature may be implemented by modifying the code here. I'll see if I can add this feature in my free time.

Meanwhile, I suggest loading each experiments individually and downsample them by yourself. You can retrieve a deterministic results by stacking the downsampled experiments.

j3soon avatar Mar 19 '23 14:03 j3soon