The fix is using an init script to modprobe the NVIDIA driver and create the necessary /dev files. My script for doing that is below. In addition to loading the driver and creating the necessary device files, it also sets compute-exclusive mode on each card and loops nvidia-smi to keep a persistent connection for faster kernel launches.
#!/bin/bash
#
# /etc/init.d/cuda startup script for nvidia driver
# symlink from /etc/rc3.d/S80cuda in non xdm environments
#
# Creates devices, sets persistent and compute-exclusive mode
# Useful for compute nodes in runlevel 3 w/o X11 running
#
# chkconfig: 345 80 20
# Source function library
. /lib/lsb/init-functions
# Alias RHEL's success and failure functions
success() {
log_success_msg $@
}
failure() {
log_failure_msg $@
}
# Create /dev nodes
function createdevs() {
# Count the number of NVIDIA controllers
N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`
# Create Devices, exit on failure
while [ ${N} -gt 0 ]
do
let N-=1
/bin/mknod -m 666 /dev/nvidia${N} c 195 ${N} || exit $?
done
/bin/mknod -m 666 /dev/nvidiactl c 195 255 || exit $?
}
# Remove /dev nodes
function removedevs() {
/bin/rm -f /dev/nvidia*
}
# Set compute-exclusive
function setcomputemode() {
# Count the number of NVIDIA controllers
N=`/sbin/lspci -m | /bin/egrep -c '(3D|VGA).+controller.+nVidia'`
# Set Compute-exclustive mode, continue on failures
while [ $N -gt 0 ]
do
let N-=1
/usr/bin/nvidia-smi -c 1 -g ${N} > /dev/null
done
}
# Start daemon
function start() {
echo -n $"Loading nvidia kernel module: "
/sbin/modprobe nvidia && success || { failure ; exit 1 ;}
echo -n $"Creating CUDA /dev entries: "
createdevs && success || { failure ; exit 1 ;}
echo $"Setting CUDA compute-exclusive mode."
setcomputemode
echo $"Starting nvidia-smi for persistence."
/usr/bin/nvidia-smi -l -i 60 > /dev/null &
}
# Stop daemon
function stop() {
echo $"Killing nvidia-smi."
/usr/bin/killall nvidia-smi
echo -n $"Unloading nvidia kernel module: "
sleep 1
/sbin/rmmod -f nvidia && success || failure
echo -n $"Removing CUDA /dev entries: "
removedevs && success || failure
}
# See how we were called
case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
*)
echo $"Usage: $0 {start|stop|restart}"
exit 1
esac
exit 0
This script is fairly well-tested on RHEL/CentOS systems, but probably works on other distros with no or minor modifications.
No comments:
Post a Comment