Common Cluster Pitfalls

These are some of the common problems that people have when using the cluster. We hope that these will not be a problem for you as well.

Asking for multiple cores but forgetting to specify one node. -n 4 -N 1 is very different from -n 4

Problem	Symptom/Reason
Throwing multiple cores at Python and R code	Without special programming, code written for Python and R is single-threaded. That means give more cores in SLURM to your code will do nothing except waste resources. If you which to use multiple cores, you must explicitly write your code to use them, using modules such as 'multiprocessing' or packages like 'Rparallel' or 'RMPI'.
Jobs PENDing for >48 hrs	Asking for very large resource requests (cores & memory): adjust lower and try again. Or very low Fairshare score
Quick run and FAIL...Not including -t parameter	no -t means shortest possible in all partitions == 10 min
Not specifying enough cores	prog1 \| prog2 \| prog3 > outfile should run with 3 cores!
Causing massive disk I/O on home folders/lab disk shares	Your work & others on the same filesystem slows to a crawl: simple commands like ls take forever
Hundreds/thousands of jobs access one common file	Your work & others on the same filesystem slows to a crawl. Make copies of the file and have jobs access one of the group
Don’t pack more than 5K files in one directory	I/O for your jobs will slow to a crawl
Bundle your work into ~10 min jobs	Kinder for us, kinder for you, kinder for the cluster
Please understand your software -- look at the options	Who knows what could happen?? You wouldn't use an instrument without reading the instructions, would you?
Trying to sudo when installing software	Please don’t -- we admin the boxes for you.

Last UpdatedJuly 15, 2024

Search Docs by Keyword