I was testing both the nltools and Brainiak implementations of ISC, and was getting wildly different p-values between the two. I was running it on Discovery with 1 node and 16ppn, and set the n_jobs parameter in the nltools implementation to the default value of -1 (which sets it to use all available processors to do the computation). I played around with the number of processors just to compare speed, and realized that specifying less processors (around 1-4) would give me more similar results to the Brainiak implementation, but increasing the number of processors gave me smaller and smaller p-values (and strangely also slower computing speed). This happens both when I change the n_jobs parameter or when I request different numbers of processors in the job script I submit to Discovery. I’m wondering if this is a bug and that the implementation wasn’t optimized to be parallelized with that many processors. I’m hoping someone can shed some light/look into this! Thanks in advance!
Hi @josie.equita, thanks for posting your experience and debugging. Can you say more about the versions of nltools and joblib you are using? We just encountered a different problem with another permutation function that ended up being an issue with using an old version of joblib.
Hi @ljchang, thank you for responding! I have joblib version 0.17.0 and nltools version 0.4.2 installed in the environment I’m using. They both seem to be the latest version from what I have just searched, but please correct me if I’m mistaken. Let me know if there’s also anything else I could check.
Do you mind posting your experience as a github issue? @ejolly or I will take a look and see if we can figure out what’s going on. https://github.com/cosanlab/nltools/issues
@josiequita Can you provide a little bit more information about how you’re calling ISC from within nltools
? Also are you able to reproduce this on a non-cluster computer? Here’s a link to a notebook running on my local machine where I’m not able to reproduce this error with randomly generated data: notebook link
Also in general, I tend to avoid using n_jobs=-1
on the cluster because of how resource sharing works. To avoid premature killing of jobs and to be a good citizen for others, it’s preferable to be explicit (e.g. n_jobs=16
). For example if requested 16ppn, and your job lands on a node with 64 cores, then the scheduler on Discovery will mark the other (64-16) 48 available for others to use. However, n_jobs=-1
will try to use everything on that node potentially causing issues for yourself or other users. Not an issue if your job lands on a machine with exactly the number of ppn you request and no one else is using that node.