Commit d35c971f authored by Baptiste Jonglez's avatar Baptiste Jonglez

[big] [puppet] [nvidia] Add nvidia-smi service to create /dev/nvidiaX when required

Both cuda and prometheus have services that need to start only on machines
with GPU.  They rely on /dev/nvidia0 being present on such machines to
detect the presence of GPU.

However, /dev/nvidia0 is not created when the nvidia driver is loaded, but
only when `nvidia-smi` is first run.

The new service calls `nvidia-smi` once during boot to create /dev/nvidia0
and thus to allow the other services to start.  On machines without GPU,
`nvidia-smi` will fail, so we ignore its exit code.
parent f46b9b88
[Unit] [Unit]
Description=NVIDIA DCGM prometheus exporter service Description=NVIDIA DCGM prometheus exporter service
After=network.target After=network.target
# Ensure that /dev/nvidia0 is created by first calling nvidia-smi.
# If no GPU is found, nvidia-smi will not create /dev/nvidia0 and we will not run.
Wants=nvidia-smi.service
After=nvidia-smi.service
ConditionPathExists=/dev/nvidia0 ConditionPathExists=/dev/nvidia0
[Service] [Service]
......
[Unit] [Unit]
Description=NVIDIA Persistence Daemon Description=NVIDIA Persistence Daemon
Wants=syslog.target Wants=syslog.target
# Ensure that /dev/nvidia0 is created by first calling nvidia-smi.
# If no GPU is found, nvidia-smi will not create /dev/nvidia0 and we will not run.
Wants=nvidia-smi.service
After=nvidia-smi.service
ConditionPathExists=/dev/nvidia0 ConditionPathExists=/dev/nvidia0
[Service] [Service]
......
[Unit]
Description=Call nvidia-smi once to create /dev/nvidiaX
[Service]
Type=oneshot
# Ignore the exit code: the command fails when no GPU is found
ExecStart=-/usr/bin/nvidia-smi
[Install]
WantedBy=multi-user.target
...@@ -6,6 +6,8 @@ class env::big::configure_nvidia_gpu () { ...@@ -6,6 +6,8 @@ class env::big::configure_nvidia_gpu () {
include 'env::big::configure_nvidia_gpu::modules' include 'env::big::configure_nvidia_gpu::modules'
# Install nvidia drivers # Install nvidia drivers
include 'env::big::configure_nvidia_gpu::drivers' include 'env::big::configure_nvidia_gpu::drivers'
# Install additional services (currently nvidia-smi, needed by cuda and prometheus)
include 'env::big::configure_nvidia_gpu::services'
# Install cuda # Install cuda
include 'env::big::configure_nvidia_gpu::cuda' include 'env::big::configure_nvidia_gpu::cuda'
# Install nvidia ganglia plugins # Install nvidia ganglia plugins
......
class env::big::configure_nvidia_gpu::services () {
# We only install the service but do not enable it.
# Services that depend on it can add "Wants=nvidia-smi.service"
# and "After=nvidia-smi.service", and this will automatically start
# this service.
file{
'/etc/systemd/system/nvidia-smi.service':
ensure => file,
owner => root,
group => root,
mode => '0644',
source => 'puppet:///modules/env/big/nvidia/nvidia-smi.service';
}
}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment