Melissa Extensive Development GuideLine & Todo List
This document gathers some high-level tasks related to short and medium term developments for the Melissa suite. Task maturity is variable, some will need to be refined. This list is expected to be updated regularly.
This list concerns all melissa flavors (melissa-SA, melissa-DA, melissa-DL), and is about building a better integration between these flavors.
Previous Notes and Exploratory Work
- Bruno Melissa Refactoring <2021-09-06 Mon>: https://gitlab.inria.fr/-/snippets/738
- Launcher refactoring issue: https://gitlab.inria.fr/melissa/melissa/-/issues/62
- Lucas about Deep melissa: https://gitlab.inria.fr/melissa/deep-melissa/-/issues/5
- Adios testing: https://gitlab.inria.fr/melissa/melissa/-/issues/135
- PDI plugin for Melissa: https://gitlab.inria.fr/melissa/melissa/-/issues/35
- Sebastian analysis of the FT protocols: https://cloud.univ-grenoble-alpes.fr/index.php/s/decPxinSrjyKLpJ
Development Process
- CI
- OAR and SLURM virtual cluster working, not yet deployed in CI for testing
- X86 Intel and ARM (raspberry-pi) platforms
- Would be desirable (less maintenance) to move from having our own hardware to equipment managed by the LAB/Univ/INRIA/other (Gricad?).
- Generally important to maintain and extend the test suite
- Coding Style
- Comments
- Naming and directory structure
- License
- gitlab:
- issues
- branches
- Python style enforcement, variable typing, ...
- Packaging
- CMAKE:
- For compiling and installing all melissa code, including Python
- Generate a melissa configuration file that the user needs to source to set all necessary access paths
- Effective in Melissa-SA. Need to be maintained.
- Spack:
- Packaging solution to maintain and extend
- melissa-sa has a Spack package integrated into Spack distribution
- melissa-da has a Spack package that needs to be finished
- We developed a Spack package for Code_Saturne as it was often difficult to install (up to the Code-Saturne team to push it into the Spack distribution)
- NIX
- Packaging solution to experiment with
- OAR team developed a NIX package for Melissa (https://github.com/oar-team/nur-kapack)
- CMAKE:
- Documentation
- Documentation is distilled into the code in the README.md files using Markdown.
- Documentation is uneven and spread in different branches and repos. Efforts are needed to keep it coherent and improve it.
- Pro:
- The doc comes with the code where it is needed
- This is simple and effective. Markdown formating is limited but sufficient so far.
- Cons:
- We do not have a linear/printable documentation
- Github relies on a similar approach
- Doc generators from comments in code (doxygen-like) ? We did some attempts, but not generalized. Need further discussions.
- Examples: have one running example (heat_pde and maybe add lorenz96 for DA) well documented and used for demonstrating the different features of the suite. Other examples should be kept to a minimum.
Launcher (New Version)
- Used by melissa-dl so only far, but designed to be suitable for other melissa flavors.
- Important designs change with respect to melissa-sa original launcher:
- The list of simulations to run (the "design of experiment") is driven by the server and not the launcher
- This requires changes in the server codes.
- Scheduler supported:
- Local
- Slurm
- OAR
- Experiments have shown the need to support variations from the original design where each client (potentially a group of simulations) is one job submitted to the batch scheduler.
- Need to decouple the resource allocation (nodes) from the client launching, to guarantee resource availability for the application progress.
- We tested different approaches with mixed results
- Granularity issues: when the simulations are short the cost of job handling by native machine scheduler dominates the cost and make the execution inefficient. Solutions:
- Let the user explicitly gather several simulations into one.
- Rely on some utility tools (slurm job arrays, gnu parallel,...)
- Support some high throughput workflow scheduler like Radical-Pilot, Balsam, ExaWork, Flux
- Need to decouple the resource allocation (nodes) from the client launching, to guarantee resource availability for the application progress.
- TODO:
- Stabilise a two-level scheduling code (alloc resource and schedule into this allocation) for SLURM and OAR.
- If necessary revisit the launcher design to make it simple and intuitive for users to change and adapt the scheduling strategy
- Test and support a tool like Radical-Pilot. Early feedback from jean-zay support is that the mangodb server required by Radical-Pilot is problematic.
- Remark: on-going work in REGALE EU project is leading us to test some stronger binding between melissa and the batch scheduler (OAR) to further optimize resource usage / energy consumption.
- TODO: new launcher need to be ported to other melissa flavors
- melissa-DA:
- Should be easy for the particle filter server (sequential Python code)
- The server based on PDAF (Fortran - ENKF filter) may require more work.
- melissa-SA:
- From melissa-dl:
- Sevelop a Python interface for the interactive statistic library (or code iterative statistics with Numpy)
- Adapt the sever (simpler than the melissa-dl server as computations are local and there is no need for asynchronism
- Tricky part is pick-freeze for Sobol' indices, but a good way to probe the design
- Need to coordinate with EDF (Alejandro) that has been working on iterative statistics (into OpenTurn in particular) - Update (Sept 2022): EDF is working on a pure Python lib for iterative statistics, with numerical stability tests. We need to coordinate with them and integrate it into Melissa as a replacement of the original Melissa-sa implementation.
- This appraoch implicitly mean that we give up the current C server to move to a Python based server, but this seems reasonable. The computations of the stats has never been a bottleneck, the server parallelization level being instead driven by 1) ZMQ performance 2) the amount of memory needed.
- melissa-DA:
- TODO Review status of job monitoring developed by Bartek (mode 1: in-line. mode 2: web server), and potentially better integrate it in launcher and properly document it.
- Turn the launcher into a Python module
- New launcher is a unix style executable (configured through many options).
- TODO discuss if relevant to turn it into a module so users can include it into a larger Python script. See Melissa-da launcher that is following this logic.
Server(s)
- The server is application specific. We have so far:
- SA: C+MPI, non communicating processes
- DL: Python + MPI, allreduce communications between processes, Pytorch supported, Tensorflow on the way
- DA
- EnKF: based on PDAF (Fortran + MPI)
- Particle Filter: Sequential Python server.
- TODO we have to think in terms of Python modules providing services to build a server:
- Client/Server communications
- Related to Melissa API, but so far the Melissa API is very focused on the client side. See specific section on the topic for an in-deep analysis.
- Fault tolerance
- Server is in charge of looking after the clients and so an important component of FT.
- Server needs to be checkpointed so it can restart on failure.
- Design of experiment: everything related to defining what are the simulations to run
- SA/DL: so far no control of the simulation to run based on the data analysis results, but it may come soon. Keep that in mind in the design.
- As usual Pick-Freeze for Sobol' indices is our top complexity use case for this topic
- Generating all the simulations on the server may cause performance issues in some cases - see Lorenz example with Melissa-DL (in particular because just one node of the server is working on that). It may be beneficial to be able to off-load part of this work to the launcher. That may just need to extend the server/launcher protocol so the server can give order to the launcher to generate and launch X simulations (and so the launcher should have the capability to load the related module as well).
- Machine specifics: everything related to the target machine (batch scheduler, accounts, node configurations, GPUs to use, ....)
- Administration/monitoring communication channels
- Include port setting (should be chosen by the system as much as possible to avoid issues when starting several melissa applications concurrently on the same resources), identification of nodes name, transmission of these node names to whoever needs them, transport of some data.
- Data transport probably goes in a specific module (melissa-API)
- Support for administrative communications between clients and server, server and launcher.
- The communication between the server and new launcher as been reworked improved compared to original implementation. Notice that transport is based on TCP/IP and not ZMQ
- Data processing: specialized modules for data processing. Specific to each application (sa, dl, da), some introduce extra dependencies (Pythorch, TF), Others non Python codes (PDAF for SA).
- Client/Server communications
Melissa-API (Client)
- It needs to be deeply revisited and likely partly rewritten.
- Developed in C/C++ with binding for Fortran and Python.
- Two flavors one for sa/dl and one for da (two way communications).
- Comprises several aspects:
- Initialisation/closing of connections (with handshake with rank 0 of the server)
- Data transport
- SA: NXM (one way)
- Deep: Round Robin
- DA:
- ENKF: NxM 2-ways
- Particle Filter: 1-to-1
- Coupling of simulations in a group for aggregating the data to the master simulation for Sobol' indices.
- Weak point:
- No data model. The user is responsible for serializing the data and deserializing the data on reception. This is error prone
- Very basic data structure support so far (only 1D arrays of doubles). Can be a limiting factor for users when handling more advanced data structures.
- Stopping criteria needs to be better defined. So far melissa-dl sends a message to all server nodes (one with data, all the others with empyt messages) so each process can count the number of messages receives and stop when the target count is reached. This is not efficient. We need to switch to a termination algorithm based on a 'end of data transmission' message emitted by client. Be careful about details like order of message arrival may be different than the order of reception.
- Coupling for Sobol is in the api codes using if switches. Make the code complex to read/maintain.
- Solution 1: to make a specific API for coupling, even if need to include more calls into the simulation code, but pick freeze being so specific the user cannot avoid being aware of the coupling (cf. MPI communication issues)
- Solution 2: get ride of couplign and data aggregation at the client level, and move data aggregation on the server side. Pros: supporting Sobol' indices will only need specific code on the server side. Cons: some performance loss as data aggregation is less parallel.
- The biggest effort is related to data model and support of more advanced data structures.
- Here are different options we have been investigated/considered:
- ZMQ today. Simple and robust, but not capable of leveraging the best of high performance networks (Infiniband, OmniPath)
- MPI connected mode
- Christoph tests showed some issues with some implementations.
- Would be the best performance-wise (MPI is the standard)
- No data model
- Adios2 NxM data layer:
- https://adios2.readthedocs.io/en/latest/engines/engines.html
- Pros:
- Data Model
- High performance transport layer (to be confirmed, it's not so clear in the doc in the case of connect mode)
- Cons:
- All-in-one application
- No practical experience so far
- Use is limited and so documentation, support, etc.
- Early tests done by Christoph
- https://gitlab.inria.fr/melissa/melissa/-/issues/135
- Connected mode ok
- 2-ways communications ok
- No need for XML/yaml file (done at first iteration)
- Ok with python or C++ (based on templates)
- Round-robin yes (decided on server side)
- PDI:
- Developed by Julien Bigot team
- This is first a data model, does not come with NxM data redistribution that need to be added
- A PDI plugin can be developed in Python and called into a FOrtran code easily (some restrictions with some data types like strings)
- We have an early prototype for the clietn side: https://gitlab.inria.fr/melissa/melissa/-/issues/35
- Here are different options we have been investigated/considered:
- TODO
- Move toward a unified code for sa, dl and da.
- See the melissa-API fro melissa-da that may be a better starting point than the one of melissa-sa.
- Properly define the termination algorithm (and avoid sending empty messages as done in melissa-dl)
- Better split the API lib in between the different aspects (connection, data communication, coupling, etc.)
- Maintain a ZMQ implementation for its simplicity (even without data model)
- Coupling: adopt data aggregation on the server side for it's simplicity ?
- Pursue evaluation of ADIOS
- Evaluate how PDI and Adios can play together or should be considered as competitive solutions (they are not equivalent).
Runner versus Simulation
- Runners are implemented in melissa-SA.
- A runner supports the virtualization of the simulation execution. The virtualization mechanism is based on a state caching mechanism.
- Could make sense for Deep Melissa for more control on the data production: "produce more data for the times steps that are difficult to learn, the server being the one responsible for generating starting states (propagate from t_i to t_j).
- Using runners also enables to decouple resource allocation from the number of members/simulations without paying the overhead of simulation start.
- The difficulty is to identify the variables to save (hard to thing about anything automatic there, but probably a job done somehow if the simulation supports checkpoint/restart).
- Two flavors in melissa-da:
- States saved to the server (EnTK)
- States saved to a distributed cache (and persisted to the file system) implemented across the runners (with FTI)
- Sebastian:
- Users do not see the benefits (and so not ready to make the effort)
- So need to find motivation or a possible simplified design
Other
- TODO Need an option to turn ON/OFF fault tolerance
- TODO Have one file/place where all user-level parameters are gathered (option.py). This has not been maintained in the current version of melissa-dl
- TODO Client failure detection on the server side (melissa-dl)
- Called at every data reception. Detect failing simulation and restart if necessary. Curent implementation is too costly.
- Possible more efficient implementation:
- At regular intervals (time based) in Poll and not at each data message reception:
- Check for unresponsive simulations.
- Compute time for next check:
- t_min= min(last_message_time from all connected simulations)
- t_nextcheck= t_min + t_timeout.
- At regular intervals (time based) in Poll and not at each data message reception:
- Minimize the need to work with bash scripts, favor instead Python.
- Integration with Jupyter notebooks ?
- Jean-Zay supports triggering execution from an external notebook after setting a Jupyter server.
- Experiment tracking (reproducibility)
- Currently basic support but important to help user cleanly organize his/her experiments
- save logs, input and ouptut data, parameter files copied in a specific dated directory
- Currently basic support but important to help user cleanly organize his/her experiments
- Templating of OAR/SLURM scripts (see with marc) ?
- Protocol (for administrative stuff)
- melissa-api: very basic, difficult to extend
- new-launcher/server: in-house, somehow easier to extend (see with Marc)
- Protobuf used in DA, but license compatibility issue (MIT license)
- Json based was discussed at some point as the right intermediate solution, but never implemented.
- More general/long term question: how Melissa could integrate with Ray and/or Dask ?