SC23 TODO
SC23
Deadlines
- Abstract March 30th 2023
- Paper deadline April 6th 2023
Use Cases
Lucas will reach out to EDF to ask about Code Saturn + machine learning existing
Goal
SC23 only cares about the large-scale architecture and the benefits of the supercomputer.
Try to make 3D code saturne example Big numbers of clients/servers large DDP ml architecture (maybe 30 GPUs?)
Need metric for starting/stopping the simulations for fully loading GPUs
Buffer monitoring (kind of same)
TODO
-
Start with existing Code Saturne install at scale -
add the GPU util and buffer put-get metrics into tensorboard -
get offline script with validation set running on Jean-Zay -
test/make online code saturne script with validation set -
add the timestep -1 send in melissa_finalize + detect the message server-side -
Try buffer where we execute N clients in parallel. Next, increase the density of the grid -
Checkpointing training
Edited by Lucas Meyer