I’m running into a memory allocation error when submitting a job to be ran on one of the GPU nodes, specifically:
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 237.94 MiB already allocated; 19.62 MiB free; 242.00 MiB reserved in total by PyTorch)
Is there a way to specify a GPU that is not in use, or set the amount of memory my job requires beforehand so that the scheduler doesn’t attempt to run my job on a GPU node that doesn’t have adequate resources?
Not sure, I think @tiankang has also been having this problem. @arnsong might be able to help. I understand that there will be some new GPUs available soon accompanying a migration to the SLURM scheduler. This is also supposed to help with some of the wonky things we’ve all been experiencing with the GPUs on discovery.
Also, in case it’s helpful, here is the manual for the version of Torque Discovery uses. There’s a full list with descriptions of all resources that can be requested using the #PBS -l directive starting on pg. 74.
You can also view the list on discovery with man pbs_resources