Milestone ID: 2405
Milestone for next Batsim big version.
- Serialization: flatbuffers (binary or JSON) -- was JSON.
Serialization libraries provided in batprotocol-
- Protocol cleanup.
- Probe support.
- Decision process call: C API or RPC -- was RPC (zmq).
- Simulation core: custom-fast or SimGrid -- was SimGrid.
- Cleanup/rework on jobs/profiles.
Batprotocol and flatbuffers rationale
The main asset is an improved protocol maintainability.
Pre-5.0.0 in a nutshell:
- New users need to understand a relatively big project (pybatsim, batsched...) to implement their own scheduler.
- Untyped JSON objects are sent over the network.
- Protocol management code has duplication (in C++ between batsim and batsched).
- No reference implementation has all the protocol features (batsched has most of them).
What we should have with 5.0.0:
- All code needed to talk with Batsim will be put in batprotocol libraries. This means new users can fork a dedicated small project (FCFS or EASY scheduler) that uses the library, and play with their own scheduler without struggling with pybatsim/batsched/etc..
- Protocol messages are typed with flatbuffers. Messages will remain easy to extend, but types will help avoiding mistakes. Batsim will still support sending/receiving JSON messages (format will change though).
- Different implementations are easier to compare.
Determining which features is present/absent in each batprotocol-
<lang>is quite easy: Read all the buffers generated by a batprotocol tests and detect what is missing.
- Performance: flatbuffers should be faster than rapidjson.
Protocol cleanup rationale
As the JSON->flatbuffers transition will break everything, this is a good opportunity to improve the protocol. In particular we wanted there:
- To remove features that make the protocol a lot more complicated with no obvious benefit. In particular, Redis could be used to exchange some data between Batsim and a decision process. It was never used by anyone in practice, it doubles the complexity of many messages and is probably very detrimental for performance (control message overhead + data message overhead on redis).
- To provide better abstraction for complex features, such as better profile composition instead of mixing additional IO parallel tasks with classical parallel tasks.
Pre-5.0.0, only energy on hosts could be probed, with a minimalistic API (get the cumulated power consumption on ALL computation machines since t=0 to now). With 5.0.0, many more metrics and ways to aggregate/cumulate them will be provided.
External decision process call API rationale
The main goal is to enable the simulation to be kept in a single process if desired by the user. This has many assets:
- Many tools (debuggers, performance analysis...) only work (or work way better) on single-process programs.
- Running a set of simulation instances will be easier (previously you had to make sure your instances were independent, as network resources could be shared by mistake).
- Performance: As the protocol is completely synchronous, having Batsim and the decision component in different processes is pure overhead (context switch depending on process placement, inter process communication...).
Custom simulation core rationale
The goal is to improve performance when you want to focus on scheduling only, not on infrastructure simulation. SimGrid simulates all hosts even when delay profiles are used, we should be able to achieve greater performance with a custom (stupid) core that simulates jobs as black boxes without interferences.
Job/profile rework rationale
Many fields around jobs and profiles are not well defined pre-5.0.0. We notably plan to fix:
- Does this job request 4 cores or 4 hosts? Should Batsim check if the allocated resources exactly match what the scheduler returned?
- What fields are read by Batsim? Where to put my custom fields?
- Complex profile composition : parallel (fork join) or merge of several parallel tasks.