Virtual Cluster with LXD containers
The idea is to set up several LXD containers that are managed by one batch scheduler instance. In comparison to oardocker, we would not be limited to any batch scheduler (unless they clash with some LXD container restrictions) and we can easily debug these images. It should be easy to run these containers outside of GitLab's continuous integration.
-
make OAR work -
do not run SSHD with nice value -20 because this requires a priviledged container - remove
-n "-20"
in/etc/init.d/oar-node
, line 87:start-daemon -p $PIDFILE -n "-20" /usr/bin/sshd... $SSHD_OPTS
- remove
-
test the setup with oarsub -l 'cpu=2' -- mpirun -n 2 ./main
edit: add-machinefile $OAR_FILE_NODES
-
-
make Slurm work - test the setup with
srun -n 4 -- my-mpi-program
- test the setup with
-
move the home directories in the containers to the shared LXD storage -
move home directory for into shared LXD storage (OAR image) -
move home directory for into shared LXD storage (Slurm image) - Motivation: this mimics supercomputer setups and makes it easier to run tests
- do not set up user Ubuntu when setting up the image (or remove it)
- check
/home
is empty - for each container:
- start container
- mount shared storage as
/home
- the first container (e.g.,
slurm-master
) creates the userUbuntu
- every container except the first container creates a user
Ubuntu
without home directory
-
-
make a script setting up a virtual cluster with N
nodes-
for OAR -
for Slurm -
N
can be fixed initially, laterN
is user-provided - check for running nodes; if there are running nodes abort before doing anything
-
Test files:
Update December 15
-
How much work is this to implement? -
@cconrads: best guess one week for OAR, one week for Slurm with a batch script setting up a virtual cluster of
N
nodes, whereN
is user-provided
-
@cconrads: best guess one week for OAR, one week for Slurm with a batch script setting up a virtual cluster of
-
How robust is this? - Slurm does not seem to care at all about being in a container
- OAR tries to launch an SSH server with nice value -20 which is not possible in unpriviledged LXD containers and I want to keep these containers unpriviledged
- There may be problems with respect to job management; Slurm uses cgroups by default (robust but may clash with container priviledges) but one can easily change this in the configuration
-
The computer running the continuous integration tests is running LXD version 3. Would this task benefit from more recent LXD releases? - @cconrads runs LXD 3.0 on the machine with continuous integration tests, he runs LXD 4.0 on his Laptop and he has not observed differences yet
Edited by Bartłomiej Pogodziński