Review time.sleep()s in virtual-cluster/launch-virtual-cluster.py
Virtual-cluster with slurm not starting up correctly: After debugging a while the following helped to fix a non working slurm installation in the virtual cluster:
diff --git a/virtual-cluster/launch-virtual-cluster.py b/virtual-cluster/launch-virtual-cluster.py
index 04c32c4..bbcf46f 100644
--- a/virtual-cluster/launch-virtual-cluster.py
+++ b/virtual-cluster/launch-virtual-cluster.py
@@ -256,7 +256,7 @@ def setup_slurm_cluster(dist: Distribution, num_containers: int, user: str) -> N
start_services(dist, master, ["slurmdbd"])
# wait a second before launching slurmctld to ensure slurmdbd is listening;
# if slurmctld is started too early, it will exit with a non-zero status
- time.sleep(1)
+ time.sleep(10)
start_services(dist, master, ["slurmctld"])
# The setup below relies on the existance of a Slurm cluster called
So there was a racing condition... We should review all uses of time.sleep in this script and try to get rid of them (e.g. querying the status of slurmdbd until it is up before running slurmctld.
Probably querying like shown here
lxc exec slurm-0 -- systemctl status slurmctld
lxc exec slurm-1 -- systemctl status slurmd
sinfo # Inside the virtual cluster
t check if the services are up is a good idea.
Edited by FRIEDEMANN Sebastian