[SC2023] Heat-PDE use case at scale
This MR investigates the heat-pde use-case at scale in preparation for the SC paper.
Note: this branch stems from add-put-get-metric
introduced in MR !89.
Motivation
The heat-pde use-case is particularly interesting because it can be tuned to match any computational load. In addition, the physical phenomena at play are simple enough to be learnt with a simple fully connected architecture.
For these reasons, this example will be used to fulfill the following objectives:
- evaluating the performance of Melissa with respect to the newly introduced metrics,
- designing an extremely scalable use-case i.e. clients parallelized over many CPUs and server over many GPUs,
- paving the way towards a real world CFD example (work in progress in MR !80).
Execution on Jean-Zay
Although it may be hard to get resources on the standard partitions of the cluster, both CPU and GPU partitions come with multiple quality of service (QoS) with specific priority rules. Among these QoS, the develop queues are almost immediately accessible and their limitation are large enough to experiment at decent scale:
-
qos_cpu-dev
: execution time up to 2h over up to 128 nodes (per job/user/group), -
qos_gpu_dev
: execution time up to 2h over up to 32 GPUs (per job/user/group).
The GPUs have 16 or 32 GB of memory and come with 10 CPUs (48 GB). On the other hand, one CPU node made of 40 cores has a total of 192 GB.
Note: according to the cluster website some partitions are composed of nodes with GPUs provided with 40 or even 80 GB.
Preliminary tests
In the frame of a deep learning study on Jean-Zay, the main limiting factor is the GPU memory which saturates quickly when dealing with MLP whose output layer has 1,000,000 elements. For instance, working with 1 GPU (16 GB), batches of size 10 and a mesh size 1000x1000, the server runs out of memory with hidden layers of size 1024 or 512. A simple study (2 clients) ran successfully with hidden layers of size 256.
Regarding the clients, these were executed over 20 cores each, for around 1 minute.
To-do:
-
running complementary experiments on multiple CPU & GPU nodes, -
evaluating the performances in the light of the new metrics, -
elaborating full scale experiments to run on the productions queues.