batsim issueshttps://gitlab.inria.fr/batsim/batsim/-/issues2022-06-27T17:45:18+02:00https://gitlab.inria.fr/batsim/batsim/-/issues/141Should we keep using ZMQ?2022-06-27T17:45:18+02:00Millian PoquetShould we keep using ZMQ?I feel that [ZMQ](https://zeromq.org/) has more drawbacks than benefits for Batsim and that we should consider not using it anymore.
What do you think?
Pros of uzing ZMQ
-----------------
- Handles language-agnostic network messages. ...I feel that [ZMQ](https://zeromq.org/) has more drawbacks than benefits for Batsim and that we should consider not using it anymore.
What do you think?
Pros of uzing ZMQ
-----------------
- Handles language-agnostic network messages.
This is easy to implement ourselves by prefixing messages with a size encoded in a well defined endianness.
- Handles lower level communication protocols. This enables the transparent use of TCP or unix sockets.
Cons of uzing ZMQ
-----------------
- Adds a dependency to Batsim and to all projects that communicate with Batsim.
This dependency is a binding for non-C/C++ projects, which is quite annoying as language-specific packages do not work properly with bindings (e.g., `pybatsim` is bundled with `pyzmq` on PyPI but without `libzmq.so`, which may result in `dlopen`-like errors for end users)
- ZMQ tries hard to handle connection loss (either from servers or clients in REQ/REP). We do not want this at all for Batsim, we'd rather stop the simulation when either Batsim or the scheduler is lost. AFAIK disabling this feature is not possible and this has several impacts.
- We need to make sure connections do not already exist before starting a simulation to avoid hindering simulations that are already running. This is currently done by [robin](https://framagit.org/batsim/batexpe) but this lacks robustness, especially when several simulations are to be launched on the same machine.
- We need some black magic to detect when a connection is lost. If one of the two processes fail, robin currently handles the situation by killing the other process. In case of infinite loop nothing is done from Batsim nor most scheduler implementations, but pybatsim has a timeout mechanism that can be detrimental for the simulation (if for whatever reason SimGrid takes a huge time to simulate a simulation step, the simulation will stop).
- ZMQ uses a thread, which makes more complex performance analysis of the whole simulation. This is not so important for optimized code as we plan to use scheduler libraries for them (thus without ZMQ), but pybatsim will remain annoying to analyze because of ZMQ.https://gitlab.inria.fr/batsim/batsim/-/issues/138Git: add mailmap2022-01-26T18:43:36+01:00Millian PoquetGit: add mailmapHere is the current result of `git shortlog --summary --numbered` as of eb69df3.
```
1537 Millian Poquet
201 MERCIER Michael
115 clement-dell
96 Michael Mercier
85 David Glesser
70 Pierre-François
43 henric...Here is the current result of `git shortlog --summary --numbered` as of eb69df3.
```
1537 Millian Poquet
201 MERCIER Michael
115 clement-dell
96 Michael Mercier
85 David Glesser
70 Pierre-François
43 henricasanova
38 Olivier Richard
31 Steffen Lackner
22 Mommessc
16 Henri Casanova
10 ramdsc
5 Faure Adrien
5 Mercier Michael
3 Pierre-Francois Dutot
3 mpoquet
2 Adrien FAURE
2 Henri C
2 MOMMESSIN Clement
2 RICHARD Olivier
1 Anderson Andrei
1 Mael Madon
1 Maël Madon
1 adfaure
```
Writing a [gitmailmap](https://git-scm.com/docs/gitmailmap) would help keeping names consistent and avoiding redundancy.5.0.0https://gitlab.inria.fr/batsim/batsim/-/issues/109Help scheduler error diagnostic on SimGrid deadlock2022-01-20T15:41:22+01:00Millian PoquetHelp scheduler error diagnostic on SimGrid deadlockIt is common to face SimGrid deadlocks in Batsim when one develops its scheduler.
As SimGrid now enables the reaction to deadlock events thanks to the `on_deadlock` callback,
we should print a more user-friendly error to users when it h...It is common to face SimGrid deadlocks in Batsim when one develops its scheduler.
As SimGrid now enables the reaction to deadlock events thanks to the `on_deadlock` callback,
we should print a more user-friendly error to users when it happens.
- Common mistakes
- Current scheduler state
- ...5.0.0https://gitlab.inria.fr/batsim/batsim/-/issues/135batprotocol: enable EDC to break sharing constraints2022-01-20T09:06:26+01:00Millian Poquetbatprotocol: enable EDC to break sharing constraintsThis was previously done in Batsim's CLI. This should now be done for each EDC.This was previously done in Batsim's CLI. This should now be done for each EDC.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/39Add scheduler version in log (protocol handshake)2022-01-20T09:03:16+01:00Millian PoquetAdd scheduler version in log (protocol handshake)Batsim version is currently written in the log and in the ``_schedule.csv`` output file.
It would be interesting to ask its version to the scheduler (i.e., as an ACK to ``SIMULATION_BEGINS``) and to log it.Batsim version is currently written in the log and in the ``_schedule.csv`` output file.
It would be interesting to ask its version to the scheduler (i.e., as an ACK to ``SIMULATION_BEGINS``) and to log it.https://gitlab.inria.fr/batsim/batsim/-/issues/103Rework data staging job profiles2022-01-20T08:59:19+01:00MOMMESSIN ClementRework data staging job profilesThere are multiple problems with this profile:
- At the moment the alloc (`res` in the `EXECUTE_JOB` protocol event) permits to take a number different from 2 resources.
- The matrix of communication seems to be inverted in the case wher...There are multiple problems with this profile:
- At the moment the alloc (`res` in the `EXECUTE_JOB` protocol event) permits to take a number different from 2 resources.
- The matrix of communication seems to be inverted in the case where the `from` resource id is greater than the `to` resource id (i.e., when you want a communication between resources 14 -> 12) BUT the simulated communication is in the correct way.
- Some other ugly stuff that @mmercier can explain better?https://gitlab.inria.fr/batsim/batsim/-/issues/137Improve execution tracability2022-01-20T06:34:16+01:00Millian PoquetImprove execution tracability- [ ] implement `--batsim-git-commit` CLI option
- [ ] implement `--simgrid-git-commit` CLI option
- [ ] similarly to what is suggested in #104, tracability information should be written by batsim on logs and on a parsable output file.
...- [ ] implement `--batsim-git-commit` CLI option
- [ ] implement `--simgrid-git-commit` CLI option
- [ ] similarly to what is suggested in #104, tracability information should be written by batsim on logs and on a parsable output file.
- [ ] define what information should exactly be put into the file
- [ ] define what file format should be used
- [ ] define file name
- [ ] implement this in Batsim
This should close #39.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/64Add parallel job composition2022-01-20T06:26:32+01:00MERCIER MichaelAdd parallel job compositionOnly the sequence composition (a list of tasks that are executed one after the other) is implemented but we lack the possibility to compose tasks in parallel.
Making the composed profile to do that would be great.Only the sequence composition (a list of tasks that are executed one after the other) is implemented but we lack the possibility to compose tasks in parallel.
Making the composed profile to do that would be great.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/104Add a new batsim output file that contains the simulation begins data2022-01-20T06:26:13+01:00MERCIER MichaelAdd a new batsim output file that contains the simulation begins dataSome of the information that is available in the simulation begin message are not exported by batsim but are needed for post analysis (e.g, workload hash/filename mapping, configuration, resource list and properties, etc.)
I propose tha...Some of the information that is available in the simulation begin message are not exported by batsim but are needed for post analysis (e.g, workload hash/filename mapping, configuration, resource list and properties, etc.)
I propose that Batsim dumps the content of this message directly in a json file, so it can be used afterward without parsing simulation logs (this is what I'm doing right now...).
I propose `<prefix>_metadata.json`, so we can extend this with any kind of information in the future.https://gitlab.inria.fr/batsim/batsim/-/issues/110Protocol doc: Automatize JSON examples2022-01-20T06:04:12+01:00Millian PoquetProtocol doc: Automatize JSON examplesCurrently (669c383), examples of protocol events are hardcoded in the RST documentation file.
This increases the likelihood of documentation/implementation mismatch.
It would be better to do the same as in the tutorials:
- Generate fi...Currently (669c383), examples of protocol events are hardcoded in the RST documentation file.
This increases the likelihood of documentation/implementation mismatch.
It would be better to do the same as in the tutorials:
- Generate files from a reproducible simulation.
Here, at least one JSON file per event type.
- Include the generated files instead of hardcoding examples in the RST file.
- Check in CI that there is no mismatch:
- Run the simulations and get a new copy of the example files.
- Make sure the result files match the ones included in the doc.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/127Composed tasks may lead to stack overflow2022-01-20T06:03:58+01:00Millian PoquetComposed tasks may lead to stack overflowCurrently (6eae4b3), composed profiles are executed by a recursive function, which may easily lead to stack overflows is a big number of subtasks is requested. This can easily happen with sequential tasks if one wants to repeat the seque...Currently (6eae4b3), composed profiles are executed by a recursive function, which may easily lead to stack overflows is a big number of subtasks is requested. This can easily happen with sequential tasks if one wants to repeat the sequence many times.
Solution: Do not use the call stack to implement this behavior, use a non-recursive function with a stack data structure instead.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/120Protocol: Use flatbuffers instead of JSON?2022-01-20T06:03:37+01:00Millian PoquetProtocol: Use flatbuffers instead of JSON?Protocol: Use flatbuffers instead of JSON?
For performance reasons, I would like to change Batsim's architecture:
- Do not force the use of a network socket (enable the use of scheduler as libraries instead)
- Enable the use of a custom...Protocol: Use flatbuffers instead of JSON?
For performance reasons, I would like to change Batsim's architecture:
- Do not force the use of a network socket (enable the use of scheduler as libraries instead)
- Enable the use of a custom stupid and fast discrete event simulator (instead of simgrid)
These two changes require some modularity within Batsim's code itself, notably to separate the protocol de/serialization from the injection of events in the simulation.
Currently, Batsim deserializes protocol messages manually (via calls to [rapidjson](https://rapidjson.org/) functions) and direcly injects events in the simulation. There are no C++ structures corresponding to the protocol messages yet, and writing them would be quite painful.
**Proposition**: Use a serialization library that can generate tedious code for us. In particular, [FlatBuffers](https://google.github.io/flatbuffers/index.html) seems to nicely fit our needs as it focuses on performance and usability.
FlatBuffers in a nutshell
-------------------------
Open source (apache license), developed and maintained by Google, already on most distros (including NixOS).
Concept very similar to [protobuf](https://developers.google.com/protocol-buffers):
1. Describe data structures in a domain specific language
2. Generate source code (in the programming language of your choice) corresponding to the desired data structures, as well as serializes/deserializes functions.
3. Use generated functions in your code.
Pros
----
- I think that protocol maintainability would be improved.
- Less boilerplate in Batsim and in schedulers, notably in Batsim since JSON is not pleasant to use in C++.
- All the data structures involved in the protocol would be defined in a single versioned file. Currently, this is split in Batsim's code (real code and C++ comments) and on the Sphinx documentation.
- Propagating protocol updates to schedulers would become very simple, as most updates would consist in copying the new description file (from Batsim) to the various scheduler implementations.
- Sphinx doc could focus on pedagocical aspects, rather than being forced to describe data structures (comments are of course possible in the description language).
- Not forced to use binary format, JSON can still be used. This would allow to have a `--json` CLI flag so that Batsim generates JSON protocol messages instead of binary. This way, writing schedulers in funny languages (that have no flatbuffers support yet) would remain possible.
- De/serialization performance would be greatly improved, which is consistent with current focus.
Cons
----
- Big protocol break. Even if JSON will remain possible, format will most probably break.
- Some work needed to update the scheduler libraries.
FlatBuffers is available in all our scheduler libraries (C++ Python and Rust from official support, D from unofficial package) so the required amount of work would not be huge.
Concern 1: Is JSON usable?
--------------------------
**Yes**. [Here](https://stackoverflow.com/questions/48215929/can-i-serialize-dserialize-flatbuffers-to-from-json) are example codes to de/serialize into JSON with flatbuffers. This seems simple, we just need some compilation magic to put the description file into Batsim's memory (e.g., as a C string) so the `--json` variant is convenient to use.
Concern 2: Won't this make schedulers annoying to compile?
----------------------------------------------------------
**Not much**. Flatbuffers's compiler deterministically generates source code in the target language. Let's assume the target language is python. The generated python files can be versioned in pybatsim, so that the project can be compiled in pure python that remains transparent to users. In particular, it will not annoy users that use language-specific package managers: `pip install pybatsim` will not require to install flatbuffers's compiler.
Having access to the flatbuffers's compiler will still be required to update de/serialization functions, but this is not a problem as we use Nix for our development environments.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/107no-shed option is not working as expected2022-01-20T05:47:15+01:00MERCIER Michaelno-shed option is not working as expectedThe `--no-shed` CLI option documentation reads:
```
If set, the jobs in the workloads are
computed one by one, one after the other,
without scheduler nor Redis.
```
But currently, all the jobs launch at time 0 and share the resources.
I...The `--no-shed` CLI option documentation reads:
```
If set, the jobs in the workloads are
computed one by one, one after the other,
without scheduler nor Redis.
```
But currently, all the jobs launch at time 0 and share the resources.
I've made some changes to make the jobs start at their submission time and not before, but we still have resource sharing and all the jobs are placed on the first hosts and not dispatched on the resources or queued.
The question is what we do? The `no-sched-fix` branch contains my patch and this is the behavior I wanted bu maybe we should consider to have multiple very simple policies in argument to the `no-shed` option...https://gitlab.inria.fr/batsim/batsim/-/issues/133clean batexec implementation2022-01-20T05:47:15+01:00Millian Poquetclean batexec implementationOnce EDC libraries are sure to work well, we can implement a basic EDC library to implement general purpose simple tasks (typically, run all jobs one after the other, or apply a static predefined schedule).
- Package this EDC library wit...Once EDC libraries are sure to work well, we can implement a basic EDC library to implement general purpose simple tasks (typically, run all jobs one after the other, or apply a static predefined schedule).
- Package this EDC library with Batsim
- CLI option (named `--no-external-sched` with hidden aliases `--no-sched` and `--batexec`) that calls batsim with this EDC library for simplicity's sake for end users
This should enable us to remove batexec-specific code in Batsim, that has been left unmaintained for years and buggy.
If done, this should close #70 and #107.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/136update --dump-execution-context syntax for multiple EDC and libraries2022-01-20T05:05:15+01:00Millian Poquetupdate --dump-execution-context syntax for multiple EDC and librariesBatsim CLI has changed for 5.0.0.
As of current batprotocol (commit eb69df3), the execution context that can be dumped has not been updated accordingly, which prevents some external tools such as robin to use Batsim correcly.
Tasks.
- [...Batsim CLI has changed for 5.0.0.
As of current batprotocol (commit eb69df3), the execution context that can be dumped has not been updated accordingly, which prevents some external tools such as robin to use Batsim correcly.
Tasks.
- [ ] define the JSON format that should be dumped now
- [ ] implement it in Batsim
- [ ] update robin to use the new JSON format5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/134Rework code around basic FS manipulation (e.g., absolute_filename)2022-01-19T20:51:01+01:00Millian PoquetRework code around basic FS manipulation (e.g., absolute_filename)Batsim already uses C++17's `<filesystem>`, so functions like `file_exists` and `absolute_filename` in `cli.cpp` should disappear.Batsim already uses C++17's `<filesystem>`, so functions like `file_exists` and `absolute_filename` in `cli.cpp` should disappear.5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/70Batexec do not manage properly jobs that are exeeding the number of available...2022-01-19T20:16:28+01:00MERCIER MichaelBatexec do not manage properly jobs that are exeeding the number of available resourcesWhen a job in the workload exceed the number of available resources, batexec try to allocate non-existent machines here:
https://gitlab.inria.fr/batsim/batsim/blob/master/src/job_submitter.cpp#L407
Leading to an inconsistent message:
`...When a job in the workload exceed the number of available resources, batexec try to allocate non-existent machines here:
https://gitlab.inria.fr/batsim/batsim/blob/master/src/job_submitter.cpp#L407
Leading to an inconsistent message:
```Cannot get machine 4: it does not exist```
Adding a proper assert to check if the jobs fits in the total number of available machine would be greathttps://gitlab.inria.fr/batsim/batsim/-/issues/132batprotocol: enable EDC to select its allocation validation strategy2022-01-19T20:06:45+01:00Millian Poquetbatprotocol: enable EDC to select its allocation validation strategy5.0.0Millian PoquetMillian Poquethttps://gitlab.inria.fr/batsim/batsim/-/issues/131Migrate CI from Framasoft to Inria2021-12-14T17:42:17+01:00FAURE Adrienadrien.faure@inria.frMigrate CI from Framasoft to InriaAs I understood, the CI is currently on framagit for historical reasons that are no longer relevant.
We should migrate it to the inria infrastructure ASAP.As I understood, the CI is currently on framagit for historical reasons that are no longer relevant.
We should migrate it to the inria infrastructure ASAP.https://gitlab.inria.fr/batsim/batsim/-/issues/114Enable dynamic host/link cumulative usage probing2021-08-31T19:38:53+02:00Millian PoquetEnable dynamic host/link cumulative usage probingGoal
====
Let the scheduler do measures about host or link usage so it can adapt its decisions depending on saturation/etc..
Implementation plan
===================
Create SimGrid plugins to monitor hosts/links usage
-------------------...Goal
====
Let the scheduler do measures about host or link usage so it can adapt its decisions depending on saturation/etc..
Implementation plan
===================
Create SimGrid plugins to monitor hosts/links usage
---------------------------------------------------
- Accumulate usage on state change (examples in energy plugins)
- Define a reset() function in the plugin API (to reset counters to 0)
- (perf issue: dynamically enable the plugin for the specified resources rather than for all of them all the time)
Expose it in the batprotocol
----------------------------
Something like:
```json
{
"timestamp": 10.0,
"type": "QUERY",
"data": {
"requests": {"consumed_bytes": {
"resources": "link42",
"reset_after_probe": true
}}
}
}
```
```json
{
"timestamp": 10.1,
"type": "ANSWER",
"data": {
"requests": {
"consumed_bytes": {
"link42": 4096
}
}
}
}
```Convenient and powerful probes