Track puts, gets, buffersize, buffer statistics, put-get metric, and GPU bandwidth with tensorboard logger
This MR aims to efficiently track important buffer related metrics with the tensorboard logger.
The logger will always log these metrics, if a user wants to post-process them inside a dataframe, they simply need to activate:
"convert_log_to_df": true
in their dl_config
which creates the full logger df in the DEBUG/tensorboard
folder.
The MR also adds the option for users to track buffer statistics with get_buffer_statistics: true
in the dl_config
. We activate the feature by default in the heat-pde example.
The MR also adds a put_get_metric
to the tensorboard logging so that each put
adds 1 and each get
subtracts one. This value should remain near 0 if the bandwidth of the training is equivalent to the client data generation.
This MR also adds a batches_per_second
metric to the tensorboard logging which tracks the amount of time it takes to train through n_batch_epoch_equivalent
batches in the heat pde server.
Monitoring documentation is improved to help users understand these metrics, how to access the metrics, and how to plot them. Descriptions of metrics follow:
-
batches_per_second
: Average number of batches trained per second (loggedn_batch_epoch_equivalent
frequency). -
buffer_size
: Size of the buffer at given time (logged on each sampleget
). -
put_time
: Time spent toput
each sample into the buffer. -
get_time
: Time spent toget
each sample from the buffer. -
put_get_inc
: Metric aimed at showing balance of puts and gets (puts add 1, gets subtract 1). -
buffer_std/{param}
: The standard deviation of{param}
in the buffer (only active ifget_buffer_statistics
is set to true indl_config
). This value also requires customization for custom parameters (seeexamples/heat-pde/heat-pde-dl/heatpde_dl_server.py
for an example. -
buffer_mean/{param}
: The mean of{param}
in the buffer (only active ifget_buffer_statistics
is set to true indl_config
). This value also requires customization for custom parameters (seeexamples/heat-pde/heat-pde-dl/heatpde_dl_server.py
for an example).