Mentions légales du service

Skip to content
Snippets Groups Projects
user avatar
xtof authored
8512ab8f
History

PLM4All

Slurm configuration

Most scripts

For most scripts, you need to know how many gpus is needed for your tasks. Then you should use a number of --ntasks-per-node equal to the number of gpus you need. Then call the python file with srun.
Here is an example with 4 gpus:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.out
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=1
#SBATCH --hint=nomultithread
#SBATCH --time=00:50:00
#SBATCH --qos=qos_gpu-dev
#SBATCH --cpus-per-task=8
#SBATCH --account=example@a100
#SBATCH -C a100

module purge
module load cpuarch/amd
module load pytorch-gpu/py3/2.0.1

srun python example.py

Here is another example with 16 gpus:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.out
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=8
#SBATCH --nodes=2
#SBATCH --hint=nomultithread
#SBATCH --time=00:50:00
#SBATCH --qos=qos_gpu-dev
#SBATCH --cpus-per-task=8
#SBATCH --account=example@a100
#SBATCH -C a100

module purge
module load cpuarch/amd
module load pytorch-gpu/py3/2.0.1

srun python example.py

Inference with accelerate

For inference with accelerate, --ntasks-per-node need to be equal to 1 with any number gpus you'll use. Example with 4 gpus:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.out
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --hint=nomultithread
#SBATCH --time=00:50:00
#SBATCH --qos=qos_gpu-dev
#SBATCH --cpus-per-task=8
#SBATCH --account=example@a100
#SBATCH -C a100

module purge
module load cpuarch/amd
module load pytorch-gpu/py3/2.0.1

srun python example.py

DDP with accelerate

For ddp with accelerate, --ntasks-per-node need to be equal to 1 with any number gpus you'll use. You can also use idr_accelerate to launch your script, it will replace accelerate launch and create automatically a config file with the slurm parameters. This config file will be save in .accelerate_config dir. Example with 16 gpus:

#!/bin/bash
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --error=example.out
#SBATCH --gres=gpu:8
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=2
#SBATCH --hint=nomultithread
#SBATCH --time=00:50:00
#SBATCH --qos=qos_gpu-dev
#SBATCH --cpus-per-task=8
#SBATCH --account=example@a100
#SBATCH -C a100

module purge
module load cpuarch/amd
module load pytorch-gpu/py3/2.0.1

srun idr_accelerate example.py

Mini Benchmark

Finetuning on IMDB dataset

Optimization Model Nb GPUs Batch Size Global Batch Size per gpu Max GPU Memory Allocated Estimated Epoch time
DDP bloom-1b7 4 4 1 22.6 GB 15min 45s
Accelerate ddp bloom-1b7 4 4 1 22.6 GB 15min 56s
Deepspeed Zero3 bloom-1b7 4 4 1 12.3 GB 34min 25s
FSDP bloom-1b7 4 4 1 6.0 GB 13min 24s
QLoRA bloom-1b7 4 1 4 3.6 GB 1h 33min 30s