Add ability to specify multiple GPU devices
In the training config file we have an option cuda_device
to specify which GPU number to use. If it's set to null
then it uses all GPUs in parallel.
It can also be set to an integer to specify a GPU like cuda_device: 1
.
However, there is a bug where if you specify cuda_device: 1
the model training doesn't work, because when it tries to use the DataParallel
wrapper around the network, it passes device_ids
for all available devices, including ones that we've just explicitly configured pytorch not to use, and crashes.
I think this is related to the call to torch.cuda.set_device
here which actually sets pytorch's default device to a different one from the one just assigned to self.device
(which is the previous default, typically cuda:0
.
It would also be good if cuda_device
could be set to an explicit list of devices like cuda_device: [1, 3]
. Sometimes we might be running on a multi-GPU cluster where we have access to multiple GPUs, but only a subset of the total available.
I think both of these issues could be fixed at the same time.