Job Failed #936904 - not killing runner
Job #936904 failed for 63283d1b:
The following behaviour is observed (and hardly reproducible):
- 5 runners are up running.
- Testcase crashes runner 3 and 4.
- Launcher recognizes runner 3 and 4 as crashed and restarts 2 new runners with the ids 5 and 6 that connect successfully to the server and participate on DA
- server sees that runner 4 times out.
- BUT: server is still getting messages from a runner with id 3 What is happening?
- is there a zombie runner with id 3 from an earlier testcase?
- did the runner kill was not successful? (maybe it only killed the mpiexec job but not its children!)
The following 2 fixes re implemented:
before the testcase in question now always a ps aux
is executed for diagnostics. Further we make sure to not send the kill to a command that contains the string mpi
so we won't only kill the mpiexec.
We will see if it further occures.