Add ability to specify multiple GPU devices

In the training config file we have an option cuda_device to specify which GPU number to use. If it's set to null then it uses all GPUs in parallel.

It can also be set to an integer to specify a GPU like cuda_device: 1.

However, there is a bug where if you specify cuda_device: 1 the model training doesn't work, because when it tries to use the DataParallel wrapper around the network, it passes device_ids for all available devices, including ones that we've just explicitly configured pytorch not to use, and crashes.

I think this is related to the call to torch.cuda.set_device here which actually sets pytorch's default device to a different one from the one just assigned to self.device (which is the previous default, typically cuda:0.

It would also be good if cuda_device could be set to an explicit list of devices like cuda_device: [1, 3]. Sometimes we might be running on a multi-GPU cluster where we have access to multiple GPUs, but only a subset of the total available.

I think both of these issues could be fixed at the same time.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Add ability to specify multiple GPU devices