refactor some message exchanges between researcher and node(s)

Short description:

add a Status of the computation thread on node side
add the TrainReceived message management (node and researcher side)
add the TrainingStart message management (node and researcher side)
add the StatusRequest/StatusReply messages management (node and researcher side)
add MessageId for all messages (and ReferedMessageId to all replies to a certain MessageId's message)

At the moment, a node component is composed of two threads, communicating through a message queue:

a communication thread which receives all (MQQT) messages:
- this hread deals directly with messages requesting an urgent and quick answer
- this thread stores the computation requests to a queue
the computation thread does all computation (described in the TrainingRequest messages stored in the queue)

The computation thread can be described as a (very simple) finite state machine following this algorithm:

while new TrainingRequest is present in the queue
- verify the validity of the request (eg: data validation, model approval)
- do the computation and send the results (TrainReply)
- remove the TrainingMessage from the queue

We propose to add a proper Status for each state of the computation thread FSM

We propose that the computation thread sends more messages to the researcher, do give it more information about the node state:

then receiving a TrainingRequest, the node (communication thread) sends a TrainingReceived message to inform the researcher that the message has be taken into account
then starting a new computation, the node (computation thread) sends a TrainingStart message
at the end of the computation, the node (computation thread) sends a TrainingReply message (this is already the case)

In order to inform the researcher about node status, we propose a new exchange message:

StatusRequest/StatusReply: the researcher ask for nodes status, the node send back some information about the computaiton thread (eg: current state, lenght of the queue,...)

Remark:

if the node does not agree to run the experiment (eg: model is not approved/signed) the TrainingReply message is used to inform the researcher (we do not add a new message for this)
on each "node to researcher" message, the message should contain an information about which message it refers. We may need to add a new ID in all communication messages (this may already be partly implemented for some of then (eg: Ping/Pong) but it should be extended to all messages)
AddScalar messages may also be sent during the training, depending on the experiment parameters

We may also consider to add more "control" messages from the researcher to the node, as:

Just as a remark, the researcher component has exactly the same infrastructure:

a communication thread: which deals with some messages (eg: log messages) and delegate the computation messages to the main thread
the main thread, which deals with the computation process (node selection, computation distribution, result collecting, data federation)

The main thread can also be described as a simple FSM.

Remark:

considering this, we may propose in the future to run this researcher on a dedicated server and control it "at distance" with controling messages

Edited Feb 25, 2022 by VESIN Marc

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information