Mentions légales du service

Skip to content

Use ADIOS2 I/O framework

SCHOULER Marc requested to merge adios-comm into develop

This MR investigates ADIOS2 as the server/client communication tool with the final objective of replacing the original Melissa-API.

See the branch README for a detailed discussion of what was done and what is left to do.

To do:

  • Make toy example to demonstrate feasibility and better understand adios2 internals
  • Isolate/document adios2 installation
  • Build/test heatpde with adios2
  • Build RoundRobin methodology
  • Build AllToAll methodology
  • Test RoundRobin in Melissa
  • Test AllToAll in Melissa
  • Remove collective communications from get_data() and engine openings
  • Test joblimit, ensure that new engines are opened as they become available
  • implement deep learning heat-pde version (with RoundRobin)
  • Add adios2 install to gitlab CI
  • Rewrite API to be closer to Adios (melissa_define_var, melissa_begin_step, melissa_put etc)
  • Ensure termination pattern is precisely how we want it. We have now converted both SA and DL to use the same termination pattern based on the existence of files (as opposed to counting timesteps in the old ZMQ version).
  • integrate to CI/Convert tests to conform to new API methods
  • Test using multiple fields on client/server
  • Refactor methods to ensure generalized machinery is in base_server.py, and specific machinery is in dl/sa server files.
  • Ensure receive() is shared between SA and DL. Add a child_data_handle() to compute stats vs put to buffer. These two servers require fully separated receive() functions due to finalization signals as well as message types expected for buffer vs compute_stats.
  • Ensure fault tolerance/checkpointing is working. Issues encountered with the inability to pickle the adios2 engine object. Opened an issue at adios https://github.com/ornladios/ADIOS2/issues/3808
  • Ensure client failure will re-open Engine status on server side after relaunch
  • Have launcher clean up the client engine file if it detects a failure? If the reader fails, writer still needs the file. If the writer fails, the file will be recreated on a new submission of the same writer.
  • Check with ADIOS2 that calling engine.Close() waits for readers to extract all remaining timesteps
  • Check statistics remain the same before and after ZMQ->Adios2. Both sobol and nonsobol results are consistent after the transition.
  • wrap adios2 api in an easy melissa api call for C done, Python done, and Fortran done (for 1d vectors only).
  • Convert Lorenz to use new python adios2 api Lorenz is now working with Adios python lib directly.
  • Create setup.sh optional script that can be wget to install adios and melissa
  • Updating documentation website.
  • Install on Jean-Zay and run Heatpde (/gpfswork/rech/igf/commun/uhd97cp/melissa_adios2/melissa/examples/heat-pde/heat-pde-dl)
  • test adios2 on Jean-Zay with infiniband
  • Ensure ZMQ is completely removed from project?
  • ADIOS2 SST does have a support for ZMQ. In case, if RDMA is not possible, we can still leverage ZMQ.
  • Replicate supercomputing
  • Add server side sobol
  • Update code saturne melissa writer
  • Allow typing to be set on the client side to ensure best performance
  • Compile Adios2 in the virtual cluster so that virtual cluster CI stage works https://gitlab.inria.fr/melissa/virtual-cluster-ci/. We now have a fully green CI with Adios, meaning nearly all functionality (except sobol) is replicated between ZMQ and Adios2 versions.
  • Continue from the previous timestep after restarting the server when failed?
  • ZMQ approach was to kill all the clients and resubmit them.
  • In adios2, the writers will keep waiting on the file for the reader to (re-)open. But, the launcher kills all jobs and then resubmits the writers.

This repo implements a simplified Melissa/Adios2 system motivated by our exchanges with adios2 developers (see this github discussion).

Useful links:

Edited by PURANDARE Abhishek

Merge request reports