Track puts, gets, buffersize, buffer statistics, put-get metric, and GPU bandwidth with tensorboard logger (!89) · Merge requests · melissa / Melissa

CAULK Robert requested to merge add-put-get-metric into develop Feb 15, 2023

This MR aims to efficiently track important buffer related metrics with the tensorboard logger.

The logger will always log these metrics, if a user wants to post-process them inside a dataframe, they simply need to activate:

"convert_log_to_df": true

in their dl_config which creates the full logger df in the DEBUG/tensorboard folder.

The MR also adds the option for users to track buffer statistics with get_buffer_statistics: true in the dl_config. We activate the feature by default in the heat-pde example.

The MR also adds a put_get_metric to the tensorboard logging so that each put adds 1 and each get subtracts one. This value should remain near 0 if the bandwidth of the training is equivalent to the client data generation.

This MR also adds a batches_per_second metric to the tensorboard logging which tracks the amount of time it takes to train through n_batch_epoch_equivalent batches in the heat pde server.

Monitoring documentation is improved to help users understand these metrics, how to access the metrics, and how to plot them. Descriptions of metrics follow:

batches_per_second: Average number of batches trained per second (logged n_batch_epoch_equivalent frequency).
buffer_size: Size of the buffer at given time (logged on each sample get).
put_time: Time spent to put each sample into the buffer.
get_time: Time spent to get each sample from the buffer.
put_get_inc: Metric aimed at showing balance of puts and gets (puts add 1, gets subtract 1).
buffer_std/{param}: The standard deviation of {param} in the buffer (only active if get_buffer_statistics is set to true in dl_config). This value also requires customization for custom parameters (see examples/heat-pde/heat-pde-dl/heatpde_dl_server.py for an example.
buffer_mean/{param}: The mean of {param} in the buffer (only active if get_buffer_statistics is set to true in dl_config). This value also requires customization for custom parameters (see examples/heat-pde/heat-pde-dl/heatpde_dl_server.py for an example).

Edited Feb 16, 2023 by CAULK Robert

Admin message

Track puts, gets, buffersize, buffer statistics, put-get metric, and GPU bandwidth with tensorboard logger

Merge request reports