The goal of this issue is to discuss and list the next big steps to improve Batsim and have a proper 3.0 version.
In progress: postpone for 4.0?
Make supported features possible with the same Batsim/SimGrid version.
- The energy model has trouble with parallel tasks in upstream SimGrid, we should implement a better model (should take 1 afternoon).
- Dynamic job kills are broken in corner cases probably due to a SimGrid bug
- OBFH prevented the use of energy platform and dynamic submissions at the same time. Investigation required to know the current status of this bug, as it probably came from SimGrid.
Build upon upstream SimGrid, not our aging fork.
This prevents us from using recent fixes and improvements, and getting help on our old version will be harder and harder. Done. Current master uses a recent SimGrid described there.
Use S4U, not MSG.
MSG is deprecated and adds a C layer between two C++ layers, which makes memory management and debugging more complex. Partly done. Some work is required in SimGrid to completely remove MSG (notably about killing parallel tasks).
Remove ambiguities in the protocol and the input files.
Such ambiguities are very dangerous for persistence, as resource management policies are free to interpret these ambiguities how they want and can quickly become incomparable.
More details below on the ambiguities.
Unify CLI and configuration file.
Some important options (that defines the decision space) are split between the two interfaces...
Simplify life for external tools.
One currently needs to parse the Batsim command, logs or output files to get information about a simulation instance.
Allowing to retrieve such information directly from the Batsim CLI would be much better.
A first problem is that the definition of the decision space is vague. Many parameters change the decision space:
- From CLI:
- From the configuration file: Whether and how dynamic submissions are allowed
A decision process should know what it is allowed to do, so that he can either adapt its behavior to the decision space or tell he does not know how to handle it.
An external program should be able to determine the decision space of an instance, so determining whether instances are comparable is possible.
Batsim currently stops the simulation when the decision process takes an invalid decision. However, losing this security is very easy as compiling without assertions will disable all checks...
Finding an unambiguous and clear way to express the decision space would be interesting.
Batsim currently only supports rigid jobs.
Rigid jobs define a number of required resources n, and can only be executed on allocations composed of n resources.
Some hacks are currently possible to have moldable and evolving jobs.
Supporting non-rigid jobs in an unambiguous and clearly-defined way would be much better for persistence and connection with existing (production) tools.
What is a resource?
Resources are currently undefined. Depending on the context:
- One may see a SimGrid host as a (computational) resource
- (One may see a core of a SimGrid host as a (computational) resource)
- One may see an executor (distinct thread of execution, MPI rank...) as a resource. This seems wrong!
Currently, some events must concern hosts (SET_RESOURCE_STATE). Other may concern hosts, cores or executors (EXECUTE_JOB).
What is a job?
Job requests are not specified nor documented.
I think that the job request should be unambiguous at the job-definition level
(currently, a JSON object in a workload file or submitted dynamically).
This would allow workloads to have a fully-defined behavior (regardless of the resource manager used, regardless of the underlying simulation profile).
What does the "number of resources" mean?
- Is this number required?
Should the decision process give an allocation of this size exactly?
- Is this number required at least?
Should the decision process give an allocation of size greater than or equal to this size?
- Is this number just an indication?
Should the decision process be allowed to give an allocation of any size?
The wall-time is now optional. As jobs now have a return value, what they should return when reaching wall-time is not defined.
Other things to discuss
Take a look at every issues in the gitlab.inria and github.