refactor some message exchanges between researcher and node(s)
Short description:
-
add a Status of the computation thread on node side -
add the TrainReceived message management (node and researcher side) -
add the TrainingStart message management (node and researcher side) -
add the StatusRequest/StatusReply messages management (node and researcher side) -
add MessageId for all messages (and ReferedMessageId to all replies to a certain MessageId's message)
===========================================================================================
Long description
node modelisation
global architecture of the node component
At the moment, a node component is composed of two threads, communicating through a message queue:
- a communication thread which receives all (MQQT) messages:
- this hread deals directly with messages requesting an urgent and quick answer
- this thread stores the computation requests to a queue
- the computation thread does all computation (described in the TrainingRequest messages stored in the queue)
node's computation thread description
The computation thread can be described as a (very simple) finite state machine following this algorithm:
- while new TrainingRequest is present in the queue
- verify the validity of the request (eg: data validation, model approval)
- do the computation and send the results (TrainReply)
- remove the TrainingMessage from the queue
computation thread status
We propose to add a proper Status for each state of the computation thread FSM
- Ready: the thread is ready to work
- Training: the thread is computing the model
node to researcher messages
We propose that the computation thread sends more messages to the researcher, do give it more information about the node state:
- then receiving a TrainingRequest, the node (communication thread) sends a TrainingReceived message to inform the researcher that the message has be taken into account
- then starting a new computation, the node (computation thread) sends a TrainingStart message
- at the end of the computation, the node (computation thread) sends a TrainingReply message (this is already the case)
In order to inform the researcher about node status, we propose a new exchange message:
- StatusRequest/StatusReply: the researcher ask for nodes status, the node send back some information about the computaiton thread (eg: current state, lenght of the queue,...)
Remark:
- if the node does not agree to run the experiment (eg: model is not approved/signed) the TrainingReply message is used to inform the researcher (we do not add a new message for this)
- on each "node to researcher" message, the message should contain an information about which message it refers. We may need to add a new ID in all communication messages (this may already be partly implemented for some of then (eg: Ping/Pong) but it should be extended to all messages)
- AddScalar messages may also be sent during the training, depending on the experiment parameters
researcher to node messages (future work)
We may also consider to add more "control" messages from the researcher to the node, as:
- stop current computation
- empty the task queue (aka: stop all future computation)
- reset the node (aka: stop current computation and empty the queues...)
- change the LogLevel of a node
global architecture of the researcher component
Just as a remark, the researcher component has exactly the same infrastructure:
- a communication thread: which deals with some messages (eg: log messages) and delegate the computation messages to the main thread
- the main thread, which deals with the computation process (node selection, computation distribution, result collecting, data federation)
The main thread can also be described as a simple FSM.
Remark:
- considering this, we may propose in the future to run this researcher on a dedicated server and control it "at distance" with controling messages