How do I specify the amount of memory my job requires?

I’m running into a memory allocation error when submitting a job to be ran on one of the GPU nodes, specifically:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 237.94 MiB already allocated; 19.62 MiB free; 242.00 MiB reserved in total by PyTorch)

Is there a way to specify a GPU that is not in use, or set the amount of memory my job requires beforehand so that the scheduler doesn’t attempt to run my job on a GPU node that doesn’t have adequate resources?

Not sure, I think @tiankang has also been having this problem. @arnsong might be able to help. I understand that there will be some new GPUs available soon accompanying a migration to the SLURM scheduler. This is also supposed to help with some of the wonky things we’ve all been experiencing with the GPUs on discovery.

1 Like

It turns out it is as simple as using the mem argument. In the PBS script you would add:

#PBS -l mem=2gb

The argument is formatted as a positive integer followed by b, kb, mb or gb. Source

1 Like

Also, in case it’s helpful, here is the manual for the version of Torque Discovery uses. There’s a full list with descriptions of all resources that can be requested using the #PBS -l directive starting on pg. 74.

You can also view the list on discovery with man pbs_resources

2 Likes