Search Docs by Keyword
Frequently Asked Questions (FAQ)
LOGIN AND AUTHENTICATION
My login is slow or my batch commands are slow
Nine times out of ten, slowness at login, starting file transfers, failed SFTP sessions, or slow batch command starts is caused by un-needed module loads in your .bashrc
We do not recommend putting multiple module loads in your .bashrc as each and every new shell you or your jobs create will call those module loads. It is recommended that you put your module loads in your job scripts so that you are not loading un-needed modules and waiting on those module calls to complete before commencing the job. Alternately, you can create a login script or alias containing your frequently used modules that you can run when you need to use them.
Either way, try to keep any module loads in your .bashrc down to a bare minimum, calling only those modules that you absolutely need in each and every login or job.
Additionally, as time goes on modules change or are removed. Please ensure you remove any deprecated modules from your .bashrc or other scripts. For example, the legacy modules no longer exist. So if you have a call to module load legacy
and any of the legacy modules, or if you have source new-modules.sh
your login will be delayed as the module system searches for and then times out on those non-existent modules.
My alternate shell (csh, tcsh, etc.) doesn’t work right
Having a non-standard default shell will cause problems and does not allow us to set global environmental defaults for everyone. As 2019 we will no longer change the default shell on any account or support the use of alternate shells as default login shell.
Users can, of course, still launch an alternate shell once logged in. Built-in shells such as sh, zsh, and csh already exist on most nodes.
SSH key error, DNS spoofing message
If you are getting SSH key or host errors, see this page.
SFTP exits after a few seconds
When connecting via a SFTP client like Filezilla, if you experience a short delay and then disconnection, this is most likely an issue caused by your .bashrc
During SFTP connections, your .bashrc will be evaluated just as if you were logging in via SSH. If you’ve added anything to your .bashrc that attempts to echo to the terminal/standard out, this will cause your SFTP client to hang and then disconnect.
You can either remove the statement in your .bashrc that is writing output (an echo statement, a call to an app or module that sends a message to standard out, etc.) -or- you can put the offending statement into an evaluation clause that first checks to see if this is a interactive login, like so:
if [ “$SSH_TTY” ]
then
echo “SFTP connections won’t evaluate the things inside this clause."
echo "Only real login sessions will.”
fi
What happens to my account when I leave/graduate?
Please see this page: What happens to my FASRC account when I leave Harvard?
How do I request membership in additional lab groups?
Please see Additional Group Membership
Can I use SSH keys to log in without a password?
No. Our cluster login relies on two-factor authentication. This makes using key-based authentication impractical.
How do I get a Research Computing account?
Before You Sign Up
If you are unsure whether you qualify for an RC account, please see Qualifications and Affiliations. More information on using the signup tool can be found here.
Please Note: You may have only one RC account. If you need to add cluster access or membership in a different/additional lab group, please submit a help ticket. Please do not sign up for a second account. This is unnecessary and against our account policies.
The Process
To request an account to access resources operated by Research Computing. (Cluster, Storage, Software Downloads, Workstation access, Instrument sign-up, etc.), please proceed to the
PLEASE NOTE: Do not select FACULTY as your job type is you do not have a faculty appointment. If you are a researcher with additional rights (fellowship, PI-like rights, funding, etc.), please select STAFF or POSTDOC. Faculty accounts are intended only for those holding an active Associate Professor or higher appointment.
Once you’ve submitted the request, the process is:
If You Selected: Internal/Using Harvard Key to verify your information and qualifications:
- The request is on hold while the PI is asked to approve or reject it.
- Once approved, the account is finalized and set up.
- Once finalized, you receive an automated email confirmation with your new account information and instructions for setting the password.
If You Selected: External/Not using Harvard Key to verify your information and qualifications:
- The request goes to RC personnel to check that it is complete and meets affiliation requirements.
- Once approved by RC, an email is sent to your PI to approve/reject the request.
- The request is on hold while the PI is asked to approve or reject it.
- Once approved, we finalize the account on our side (during business hours).
- Once finalized, you receive an automated email confirmation with your new account information and instructions for setting the password..
You can then proceed to set up your OpenAuth token and get connected to the cluster. The turnaround time is directly related to the PI/Sponsor’s approval of the account. External accounts are reviewed by RC staff during business hours and generally vetted and sent on to the PI/Sponsor for approval within one business day
NOTE! If you request “Cluster Use” (the ability to run jobs on the cluster), attend one of our monthly New User Trainings or watch our Introduction to the Cluster videos.
Can someone else approve my account request?
Initially, only the PI for a lab can sponsor and approve new accounts under their lab group. They may also at any point designate another account holder(s), such as a lab admin or faculty assistant, in their lab as additional approvers by contacting FASRC directly. Approval to add additional approvers can only come directly from the PI to FASRC (e.g. – a forwarded email is not sufficient, the PI needs to contact us directly.)
The sharing of passwords or login credentials is not allowed under RC and Harvard information security policies. Please bear in mind that this policy also protects the end-user.
Sharing credentials removes the ability to audit and accountability for the account holder in case of account misuse. Accounts which are in violation of this policy may be disabled or otherwise limited. Accounts knowingly skirting this policy may be banned.
If you find that you need to share resources among multiple individuals, Faculty can approve accounts for outside collaborators to their lab groups. Otherwise, please contact us and we will be happy to assist you with finding a safe and secure way to do so.
How do I login to the FASRC cluster?
See our Access and Login page.
and/or
Our terminal access page.
How do I reset my Research Computing account password?
Please click here to reset your Research Computing account password using your email address.
This will send an email to you with a one-time use link to set a new password.
Please note: Your username is not your email address. Your email address is used here only for password resets and to contact you.
How do I unlock my locked Research Computing account?
Typically, after entering the incorrect password multiple times your account will become locked. Once your account is locked, your account will automatically unlock after ~ 5 – 10 minutes. If your account remains locked for longer please contact us.
How do I install and launch OpenAuth?
If you do not yet have an account, see: How do I get a Research Computing account? For additional instructions, see: Account Signup
Setting Up Your OpenAuth Token
- Visit https://two-factor.rc.fas.harvard.edu/ to start setup of OpenAuth.
- A login box will appear. Log in with your FAS RC username and password (your username is not your email address or Harvard Key, it is the short username you initially set up when requesting an account. Example: jsmith )
- After logging in, allow a few seconds as the site generates your token.
- A page will be displayed outlining next steps
- Await an email. This email will contain a link to your personalized token. You can download the Java applet or use the QR code on that page to add your RC token in Google Authenticator or Duo Mobile
Since the site uses email verification to authenticate you, you must also have a valid account and email address on record with Research Computing. All OpenAuth tokens are software-based, and you will choose whether to use a smart phone or java desktop app to generate your verification codes. Java 1.6 or higher is required for the desktop app.
You will need to use OpenAuth when accessing the Research Computing VPN and logging into the FAS RC cluster.
How do I logon to the Research Computing VPN?
Please see our VPN setup guide here.
Linux users please see our guide to using OpenVPN here.
I need an AWS account and/or Amazon AWS virtual machine
AWS offerings are through HUIT. Please see https://cloud.huit.harvard.edu/ or contact ithelp@harvard.edu
FILESYSTEMS AND AUTHORIZATION
Where is ftp?
Modern secure transfer protocols like SFTP and SCP secure data during transit and should be used when moving files from one place to another. However you may still need to use plain, un-secured FTP to download data sets or other files from remote locations while logged into the cluster.
While we do not offer the largely outmoded ‘ftp’ program on the cluster, we do offer the feature-rich and largely command compatible ‘lftp’. From any login or compute node type ‘man lftp’ to see its usage and options.
How do I request membership in additional lab groups?
Please see Additional Group Membership
What’s the best way to transfer my data?
INTERNAL
See our ‘Transferring data on the cluster‘ page for a list of options and best practices for data transfer within the cluster.
EXTERNAL
For transferring data to and from the cluster, see ‘Transferring data externally‘.
How do I access my cluster home directory from my laptop?
FASRC cluster home directories are available through SAMBA and so can be mounted as a network drive on Mac, Windows, and Linux computers. See the Mounting Storage page for specific instructions on how to mount the directory.
How do I check how much space I’ve used?
While logged into a login node, you can check your home directory usage by issuing the following command:
df -h ~
(the ~ character is POSIX shortcut to your home directory) Example output:
[jharvard@holylogin03 ~]$ df -h ~
Filesystem Size Used Avail Use% Mounted on
rcstorenfs:/ifs/rc_homes/home12 95G 11G 85G 12% /n/home12
There is a hard limit of 100GB on home directories and a warning threshold of 95GB, hence the size showing 96GB.
Sorry, but we cannot increase this allotment. Please use disk shares associated with your lab or one of our scratch files systems if you require more space. Please see our Storage document for more information.
To check your lab share’s total usage, you can use the same command, but provide the path to your lab share:
df -h /n/jharvard_lab
For Lustre filesystems (e.g. holylfs*, boslfs*, holystore01) the above command will show you the whole filesystem instead of just your part. To get your group’s quota you can run:
lfs quota -hg jharvard_lab /n/filesystem
I accidentally deleted my data, how do I get it back?
Your home directory has periodic snapshots taken. These snapshots are of your home directory files from various recent points in time. They are in a hidden directory named .snapshot
, within every other directory in your home directory. The command ls -a
will not show these, but you can ls .snapshot
directly, and cd .snapshot
to go into the directory.
In the .snapshot folder you will see “hourly” “daily” “monthly” folders with the date of the snapshots. Traverse (cd) to the snapshot folder corresponding to the period you wish to restore data from. From there you can simply copy the relevant files back into your home folder using your favorite file copy tool (rsync, cp, etc.)
Lab directory backups are for disaster recovery only, as they are handled separately and do not have snapshot capabilities. Additionally, these backups, given the size, run over a period of many days and are not ‘up-to-the-minute’ backups. As such, they are not intended to recover accidental file deletions. Please contact FASRC if you have any questions.
Please also see our Storage document for more info.
Why are all my files executable?
You may notice that the x
(execute) bit is set on all your files:
Furthermore, chmod
does not remove it:
This is a feature, a result of the storage system doing mixed Unix-style and Windows-style permissions. If this is causing a problem for you, please contact FASRC.
Why does my UMASK
not work?
You may also notice that your UMASK
environment variable does not work as expected:
Normally, the outcome would be -rw-rw-r--
. If this is causing a problem for you, please contact FASRC.
Yes, your cluster home directory is available as a network filesystem share to which you can directly connect your own desktop or laptop. The technical protocol for this is called CIFS or Samba, so you will often hear us refer to it in that way. On Windows, this is also referred to as mapping a network drive, and on a Mac it is called connecting to a server.
In all cases, you need your RC username, password, server name, and path. Please see the Mounting Storage document for detailed information.
SOFTWARE
I need cluster access to Gaussian
Please contact us if you require Gaussian access. It is controlled on a case-by-case basis and requires membership in a security group.
To see all available versions of Gaussian, visit the All Modules page and Search for ‘gaussian’.
I need to download GaussView or MOE
FASRC users can download these clients from our Downloads page. You must be connected to the FASRC VPN to access this page. Your FASRC username and password are required to log in.
FASRC no longer has access to a JMP Pro/Genomics license. Please see the JMP site for licensing details. FASRC does provide SAS 9.4 for use in jobs on the cluster.
I need to download Geneious Pro or MOE (only available for FAS users)
FAS members can download these clients from our Downloads page. You must be connected to the FASRC VPN to access this page. Your FASRC username and password are required to log in. Not for use by members of other schools or external users.
I can’t search for R
Unfortunately, having a single letter as the name of an application makes searching problematic.
Here are links to our R Basics and R Packages pages
Where is FTP?
Modern secure transfer protocols like SFTP and SCP secure data during transit and should be used when moving files from one place to another. However you may still need to use plain, un-secured FTP to download data sets or other files from remote locations while logged into the cluster.
While we do not offer the largely outmoded ‘ftp’ program on the cluster, we do offer the feature-rich and largely command compatible ‘lftp’. From any login or compute node type ‘man lftp’ to see its usage and options.
How do I load a module or software on FASRC cluster?
Step 1: Login to the cluster through your Terminal window. Please see here for login instructions.
Step 2: Load a module/software by typing: module load MODULENAME
. Replace MODULENAME
with the specific software you want to use. A complete listing of modules can be found on the module list page.
To see what modules you have loaded type: module list
To unload a module type: module unload MODULENAME
Details can be found in the modules section of the Running Jobs page.
FileZilla: I have to enter my OpenAuth code every 30 seconds
If you are using Filezilla to transfer files to the cluster, and you are prompted frequently (like every 30 seconds!) to enter your username and/or OpenAuth token code, then most likely you did not configure FileZilla according to our instructions. You must limit the number of connections to 1, else Filezilla will spawn more connections, each requiring you to authenticate.
Please see this document on how to set the connection limit and avoid the OpenAuth challenge frustration while transferring files to and from the cluster.
Git/Github: 403 Forbidden while accessing https://github.com…
If you issue a git push
to a cloned repository, you might receive the following error:
error: The requested URL returned error: 403 Forbidden while accessing https://github.com/yourusername/planets.git/info/refs
fatal: HTTP request failed
Authorization to Github repositories on the cluster is can be a little tricky. Please follow our instructions at or git and github on the FASRC cluster.
How do I run a Matlab script on the FASRC cluster?
To run a Matlab script (with no graphical interface component) on the cluster, login using your preferred terminal application then activate the application by loading the module.
Then, assuming your script is named calc.m
, either run it through an interactive session
salloc --mem 1000 -p test matlab -nojvm -nodisplay -nosplash < calc.m
or use the matlab command in a batch script
#!/bin/bash #SBATCH -o calc.out #SBATCH -o calc.err #SBATCH -p serial_requeue #SBATCH -n 1 #SBATCH --mem 1000 #SBATCH -t 1000 matlab -nojvm -nodisplay -nosplash < calc.m
Make sure that `calc.m` finishes with an `exit` command. Otherwise, the process will hang waiting for further input.
Perl modules: Can’t locate XX.pm in @INC
Perl modules have been developed over the past 15 to 20 years, and the installation method has changed significantly. Unfortunately, you might run into a program that needs to install a really old Perl module, and its installation is just not behaving properly under the new installation methods. You might see something like the following:
[bfreeman@holylogin01 PfamScan]$ ./pfam_scan.pl --help
Can't locate Data/Printer.pm in @INC (@INC contains: /n/sw/fasrcsw/apps/Core/perl-modules.....
The remedy can be rather simple:
1. Follow our new lmod – Perl instructions here on setting up your home directory for installing Perl modules ‘locally’.
Note that the export PERL5LIB
command must include both $LOCALPERL
and $LOCALPERL/lib/perl5
(it’s subdirectory) as some installation routines honor one; some the other.
2. Sometimes, you might need to install the module manually. Try both the Makefile.PL build
and the Build.PL build
if one or the other doesn’t work.
3. In CPAN, you can do this manual install method without the hassle of the download process:
cpan
look Data::Printer
This latter command will download the module and unpack it for you, and leave you at the shell, where you can try either the Makefile.PL
or Build.PL build
process.
Illegal Instruction
If you are getting an error indicating an illegal instruction that likely means that your code was built on a different processor type than the one you are running on. The cluster has a variety of different hardware and if your code tries to leverage instructions specific to that hardware then the code cannot run on other types of hardware. To resolve this error you will either need to build your code with out the hardware specific instruction sets, or tell the scheduler via the --constraint
option to only run your jobs on the specific hardware types you have built your code for. A full list of constraints can be found on the Running Jobs page.
JOBS AND SLURM
How do I know what partitions I have access to?
The spart
command can be used find a quick summary of this information. scontrol show partition
and sinfo
will also give more detailed information about the various partitions you have rights to use.
How do I know what memory limit to put on my job?
Add to your job submission:
#SBATCH --mem X
where X is the maximum amount of memory your job will use per node, in MB. The larger your working data set, the larger this needs to be, but the smaller the number the easier it is for the scheduler to find a place to run your job. To determine an appropriate value, start relatively large (job slots on average have about 4000 MB per core, but that’s much larger than needed for most jobs) and then use sacct to look at how much your job is actually using or used:
sacct -o MaxRSS -j JOBID
where JOBID is the one you’re interested in. The number is in KB, so divide by 1024 to get a rough idea of what to use with –mem (set it to something a little larger than that, since you’re defining a hard upper limit).
For more information see here.
How do I figure out how efficient my job is?
You can see your job efficiency by comparing Elapsed, CPUTime, and NCPUS in sacct. For example:
[user@boslogin01 home]# sacct -j 1234567 -o Elapsed,CPUTime,NCPUS
Elapsed CPUTime NCPUS
---------- ---------- ----------
13:22:35 35-16:05:20 64
13:22:35 17-20:02:40 32
13:22:41 35-16:11:44 64
13:21:39 1-02:43:18 2
In this job you see that the user used 64 cores and their job ran for 13 hours. However their CPUTime is 35.5 hours which is close to 64*13 hours. If your code is scaling effectively CPUTime = NCPUS * Elapsed. If it is not that number will diverge. The best way to test this is to do some scaling tests. There are two styles you can do. Strong scaling is where you leave the problem size the same but increase the number of cores. If your code scales well it should take less time proportional to the number of cores you use. The other is weak scaling where the amount of work per core remains the same but you increase the number of cores, so the size of the job scales proportionally to the number of cores. Thus if your code scales in this case the run time should remain the same.
Typically most codes have a point where the scaling breaks down due to inefficiencies in the code. Thus beyond that point there is not any benefit to increasing the number of cores you throw at the problem. That’s the point you want to look for. This is most easily seen by plotting log of the number of cores vs. log of the runtime.
The other factor that is important in a scheduling environment is that the more cores you ask for the longer your job will pend for as the scheduler has to find more room for you. Thus you need to find the sweet spot where you minimize both your runtime and how long you pend in the queue for. For example it may be the case that if you asked for 32 cores your job would take a day to run but pend for 2 hours, but if you ask for 64 cores your job would take half a day to run but would pend for 2 days. Thus it would have been better to ask for 32 cores even though the job is slower.
Will single core/thread jobs run faster on the cluster?
The cluster cores, in general, will not be any faster than the ones in your workstation, in fact they may be slower if your workstation is relatively new. While we have a variety of chipsets available on the cluster, most of the cores are AMD and will be slower than many Intel chips, which are most common in modern desktops and laptops. The reason we use so many AMD chips is that we could purchase a larger number of cores and RAM this way. This is the power of the cluster. The cluster isn’t designed to run a single core code as fast as possible as the chips to do that are expensive. Rather you trade off raw chip speed for core count. Then you gain speed and efficiency via parallelism. So the cluster excels at multicore jobs (using threads or MPI ranks) or doing many jobs that take a single core (such as parameter sweeps or image process). This way you leverage the parallel nature of the cluster and the 60,000 cores available.
So if you have a single job, the cluster isn’t really a gain. If you have lots of jobs you need to get done, or your job is too large to fit on a single machine (due to RAM or its parallel nature), the cluster is the place to go. The cluster can also be useful for offloading work from your workstation. That way you can use your workstation cores for other tasks and offload the longer running work onto the cluster.
In addition since the cluster cores are a different architecture from your workstation one needs to be aware that the code will need to be optimized differently. This is where compiler choice and compiler flags can come in handy. That way you can get the most out of both sets of cores. Even there you may not get the same performance out of the cluster as your local machine. The main processor we have on the cluster is now 4 years old, and if you are using serial_requeue you could end up on hardware bought today to stuff purchased 7 years ago. There is about a factor of 2-4 in performance in just the natural development of processor technology.
My login is slow or my batch commands are slow
Nine times out of ten, slowness at login, starting file transfers, failed SFTP sessions, or slow batch command starts is caused by un-needed module loads in your .bashrc
We do not recommend putting multiple module loads in your .bashrc as each and every new shell you or your jobs create will call those module loads. It is recommended that you put your module loads in your job scripts so that you are not loading un-needed modules and waiting on those module calls to complete before commencing the job. Alternately, you can create a login script or alias containing your frequently used modules that you can run when you need to use them.
Either way, try to keep any module loads in your .bashrc down to a bare minimum, calling only those modules that you absolutely need in each and every login or job.
Additionally, as time goes on modules change or are removed. Please ensure you remove any deprecated modules from your .bashrc or other scripts. For example, the legacy modules no longer exist. So if you have a call to module load legacy
and any of the legacy modules, or if you have source new-modules.sh
your login will be delayed as the module system searches for and then times out on those non-existent modules.
How do I request membership in additional lab groups?
Please see Additional Group Membership
Can I query SLURM programmatically?
I’m writing code to keep an eye on my jobs. How can I query SLURM programmatically?
We highly recommend that people writing meta-schedulers or that wish to interrogate SLURM in scripts do so using the squeue
and sacct
commands. We strongly recommend that your code performs these queries once every 60 seconds or longer. Using these commands contacts the master controller directly, the same process responsible for scheduling all work on the cluster. Polling more frequently, especially across all users on the cluster, will slow down response times and may bring scheduling to a crawl. Please don’t.
SLURM also has an API that is documented on the website of our developer partners SchedMD.com.
Are their policies or guidelines for using the cluster responsibly?
Yes. Please see out Customs and Responsibilities page.
How do I submit a batch job to the FASRC cluster queue with SLURM?
Step 1: Login to cluster through your Terminal window. Please see the Access and Login page for login instructions.
Step 2: Run a batch job by typing: sbatch RUNSCRIPT
. Replace RUNSCRIPT
with the batch script (a text file) you will use to run your code.
The batch script should contain #SBATCH
comments that tell SLURM how to run the job.
#!/bin/bash #SBATCH -n 1 #Number of cores #SBATCH -t 5 #Runtime in minutes #SBATCH -p serial_requeue #Partition to submit to #SBATCH --mem-per-cpu=100 #Memory per cpu in MB (see also --mem) #SBATCH -o hostname.out #File to which standard out will be written #SBATCH -e hostname.err #File to which standard err will be written
See the batch submission section of the Running Jobs page for detailed instructions and sample batch submission scripts.
Note: You must declare how much memory and how many cores you are using for your job. By default SLURM assumes you need 100 MB. The script assumes that it is running in the current directory and will load your .bashrc
.
How do I submit an interactive job on the cluster?
Step 1: Log in to the cluster through your Terminal window. Please see here for login instructions.
Step 2: Run an interactive job by typing: salloc -p test MYPROGRAM
This will open up an interactive run for you to use. If you want a bash prompt, type: salloc --mem 500 -p test
If you need X11 forwarding type: salloc --mem 500 -p test --x11 MYPROGRAM
This will initiate an X11 tunnel to the first node on your list.
See also the interactive jobs section of the Running Jobs page.
How do I view or monitor a submitted job?
Step 1: Login to the cluster through your Terminal window. Please see the Access and Login page for login instructions.
Step 2: From the command line type one of three options: smap
, squeue
, or showq-slurm
If you want more details about your job, from the command line type: sacct -j JOBID
You can view the runtime and memory usage for a past job by typing: sacct -j JOBID --format=JobID,JobName,MaxRSS,Elapsed
, where JobID
is the numeric job ID of a past job.
See the Running Jobs page for more details on job monitoring.
My job is PENDING. How can I fix this?
How soon a job is scheduled is due to a combination of factors: the time requested, the resources requested (e.g. RAM, # of cores, etc), the partition, and one’s FairShare score.
Quick solution? The Reason column in the squeue
output can give you a clue:
- If there is no reason, the scheduler hasn’t attended to your submission yet.
- Resources means your job is waiting for an appropriate compute node to open.
- Priority indicates your priority is lower relative to others being scheduled.
There are other Reason codes; see the SLURM squeue documentation for full details.
Your priority is partially based on your FairShare score and determines how quickly your job is scheduled relative to others on the cluster. To see your FairShare score, enter the command sshare -u RCUSERNAME
. Your effective score is the value in the last column, and, as a rule of thumb, can be assessed as lower priority ≤ 0.5 ≤ higher priority.
In addition, you can see the status of a given partition and your position relative to other pending jobs in it by entering the command showq-slurm -p PARTITION -o
. This will order the pending queue by priority, where jobs listed at the top are next to be scheduled.
For both Resources and Priority squeue
Reason output codes, consider shortening the runtime or reducing the requested resources to increase the likelihood that your job will start sooner.
Please see this document for more information and this presentation for a number of troubleshooting steps.
SLURM Errors: Job Submission Limit (per user)
If you attempt to schedule more than 10,000 jobs (all inclusive, both running and pending) you will receive an error like the following:
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user’s size and/or time limits)
For more info about being a good cluster neighbor, see: https://docs.rc.fas.harvard.edu/kb/responsibilities/
SLURM Errors: Device or resource busy
What’s up? My SLURM output file terminates early with the following error:
"slurmstepd: error: _slurm_cgroup_destroy: problem deleting step cgroup
path /cgroup/freezer/slurm/uid_57915/job_25009017/step_batch: Device or
resource busy"
Well, usually this is a problem in which your job is trying to write to a network storage device that is busy — probably overloaded by someone doing high amounts of I/O (input/output) where they shouldn’t, usually on low throughput storage like home directories or lab disk shares.
Please contact RCHelp about this problem, giving us the jobID, the filesystem you are working on, and additional details that may be relevant. We’ll use this info to track down the problem (and, perhaps, the problem user(s)).
(If you know who it is, tap them on the shoulder and show them our Cluster Storage page.)
SLURM errors: Job cancelled due to preemption
If you’ve submitted a job to the serial_requeue
partition, it is more than likely that your job will be scheduled on a purchased node that is idle. If the node owner submits jobs, SLURM will kill your job and automatically requeue it. This message will appear in your STDOUT or STDERR files you indicated with the -o
or -e
options. This is simply an informative message from SLURM.
SLURM Errors: Memory limit
Job <jobid> exceeded <mem> memory limit, being killed:
Your job is attempting to use more memory than you’ve requested for it. Either increase the amount of memory requested by --mem
or --mem-per-cpu
or, if possible, reduce the amount your application is trying to use. For example, many Java programs set heap space using the -Xmx
JVM option. This could potentially be reduced.
For jobs that require truly large amounts of memory (>256 Gb), you may need to use thebigmem
SLURM partition. Genome and transcript assembly tools are commonly in this camp.
See this FAQ on determining how much memory your completed batch job used under SLURM.
SLURM Errors: Node Failure
JOB <jobid> CANCELLED AT <time> DUE TO NODE FAILURE:
This message may arise for a variety of reasons, but it indicates that the host on which your job was running can no longer be contacted by SLURM. Not a good sign. Contact RCHelp to help with this problem.
SLURM errors: Socket timed out. What?
If the SLURM master (the process that listens for SLURM requests) is busy, you might receive the following error:
[bfreeman@holylogin02 ~]$ squeue -u bfreeman
squeue: error: slurm_receive_msg: Socket timed out on send/recv operation
slurm_load_jobs error: Socket timed out on send/recv operation
Since SLURM is scheduling 1 job every second (let alone doing the calculations to schedule this job on 1 of approximately 100,000 compute nodes), it’s going to be a bit busy at times. Don’t worry. Get up, stretch, pet your cat, grab a cup of coffee, and try again.
SLURM Errors: Time limit
JOB <jobid> CANCELLED AT <time> DUE TO TIME LIMIT:
(or you may also see ‘Job step aborted’ when using salloc/srun)
Either you did not specify enough time in your batch submission script, or you didn’t specify the amount of time and SLURM assigned the default time of 10 minutes. The -t
option sets time in minutes or can also take D-HH:MM
form (0-12:30
for 12.5 hours). Submit your job again with a longer time window.
FairShare is a score that determines what priority you have in the scheduling queue for your jobs. The more jobs you run, the lower your score becomes, temporarily. A number of factors are used to determine this score — please read this Fairshare document for more information.
To find out what your score is, enter `sshare -U` in your terminal session on the cluster to see a listing for your group (this is not your individual score, but an aggregate for your group). In general, a score of 0.5 or above means you have higher priority for scheduling.
Example of a fairly full Fairshare:
$ sshare -U
Account User RawShares NormShares RawUsage EffectvUsage FairShare
------------ ----- ------ -------- ------- ------------- ----------
jharvard2_lab jharv parent 0.000936 171281 0.000003 0.997620
Example of a depleted Fairshare:
$ sshare -U
Account User RawShares NormShares RawUsage EffectvUsage FairShare
------------ ----- ------ -------- ------- ------------- ----------
jharvard_lab johnh parent 0.000936 361920733 0.007145 0.005046
See also: Managing FairShare for Multiple Groups if you belong to more than one lab group
For further information, see the RC fairshare document.
Can I send mail from the cluster?
The short answer is no. You can receive job emails as covered in our Running Jobs doc, but you cannot send email from cluster nodes.
The longer answer is that the FASRC cluster could easily be weaponized to send bulk email if we allowed this and could cause a portion of Harvard’s IP range (or even all of Harvard’s IP range) to be blacklisted. The cluster is intended as a research compute platform and its nodes, while running Linux, are not the same as workstation or server nodes one might be used to. Any post-processing or use of such tools as email or printing should be done using another system.
I see dummy4XD jobs, but I didn’t submit them?
We use a tool called XDMoD for record keeping. In order to ensure our usage statistics are correct, dummy jobs are submitted on behalf of users. You do not need to delete them; they run very quickly and your fairshare is not used for them. It is safe to ignore these jobs.
I see nodes marked as DRAINING, DOWN, or COMPLETING in the partition that I am using what can I do?
When you see nodes in this state there is nothing you need to do and there is no need to notify FASRC staff. At any given time there will be a number of nodes that are in a state of DRAINING, DOWN, or seem to be stuck in the COMPLETING state. This generally means that the scheduler has identified one or more problems with these nodes and has set these states so that the nodes will not accept any jobs until the problem is resolved. FASRC staff patrol the cluster for broken nodes and will open the nodes once they are fixed. If you notice a node is still closed then that just means that the FASRC staff have deemed it not ready for service yet. A reason for the node closure is noted in slurm which you can see by doing scontrol show node NODENAME
. If you are curious what these reasons mean or if you see INCXXXXXX
(which indicates a hardware issue we are dealing with) you can contact us to find out more details.
BILLING
VDI (Open OnDemand)
Why is my Jupyter notebook VDI session terminated right after it starts?
This problem is common when there is a conda initialize
section in your .bashrc
file located in your home directory (more about .bashrc
). The conda initialize section was added when, at some point, you used the command conda init
. We strongly discourage the use of conda init
. Instead use source activate environment_name
, for more details, refer to our Python (Anaconda) page.
To solve this problem, delete or comment out the conda initialize
section of your .bashrc
and create a new Jupyter notebook VDI session.
Bookmarkable Links
- 1 LOGIN AND AUTHENTICATION
- 1.0.1 My login is slow or my batch commands are slow
- 1.0.2 My alternate shell (csh, tcsh, etc.) doesn’t work right
- 1.0.3 SSH key error, DNS spoofing message
- 1.0.4 SFTP exits after a few seconds
- 1.0.5 What happens to my account when I leave/graduate?
- 1.0.6 How do I request membership in additional lab groups?
- 1.0.7 Can I use SSH keys to log in without a password?
- 1.0.8 How do I get a Research Computing account?
- 1.0.9 Can someone else approve my account request?
- 1.0.10 Can I share an account? – Account Security Policies
- 1.0.11 How do I login to the FASRC cluster?
- 1.0.12 How do I reset my Research Computing account password?
- 1.0.13 How do I unlock my locked Research Computing account?
- 1.0.14 How do I install and launch OpenAuth?
- 1.0.15 How do I logon to the Research Computing VPN?
- 1.0.16 I need an AWS account and/or Amazon AWS virtual machine
- 2 FILESYSTEMS AND AUTHORIZATION
- 2.0.1 Where is ftp?
- 2.0.2 How do I request membership in additional lab groups?
- 2.0.3 What’s the best way to transfer my data?
- 2.0.4 How do I access my cluster home directory from my laptop?
- 2.0.5 How do I check how much space I’ve used?
- 2.0.6 I accidentally deleted my data, how do I get it back?
- 2.0.7 Why are all my files executable?
- 2.0.8 Why does my UMASK not work?
- 2.0.9 Is my home directory available as a network filesystem share?
- 3 SOFTWARE
- 3.0.1 I need cluster access to Gaussian
- 3.0.2 I need to download Geneious Pro or MOE (only available for FAS users)
- 3.0.3 I can’t search for R
- 3.0.4 Where is FTP?
- 3.0.5 How do I load a module or software on FASRC cluster?
- 3.0.6 FileZilla: I have to enter my OpenAuth code every 30 seconds
- 3.0.7 Git/Github: 403 Forbidden while accessing https://github.com…
- 3.0.8 How do I run a Matlab script on the FASRC cluster?
- 3.0.9 Perl modules: Can’t locate XX.pm in @INC
- 3.0.10 Illegal Instruction
- 4 JOBS AND SLURM
- 4.0.1 How do I know what partitions I have access to?
- 4.0.2 How do I know what memory limit to put on my job?
- 4.0.3 How do I figure out how efficient my job is?
- 4.0.4 Will single core/thread jobs run faster on the cluster?
- 4.0.5 My login is slow or my batch commands are slow
- 4.0.6 How do I request membership in additional lab groups?
- 4.0.7 Can I query SLURM programmatically?
- 4.0.8 Are their policies or guidelines for using the cluster responsibly?
- 4.0.9 How do I submit a batch job to the FASRC cluster queue with SLURM?
- 4.0.10 How do I submit an interactive job on the cluster?
- 4.0.11 How do I view or monitor a submitted job?
- 4.0.12 My job is PENDING. How can I fix this?
- 4.0.13 SLURM Errors: Job Submission Limit (per user)
- 4.0.14 SLURM Errors: Device or resource busy
- 4.0.15 SLURM errors: Job cancelled due to preemption
- 4.0.16 SLURM Errors: Memory limit
- 4.0.17 SLURM Errors: Node Failure
- 4.0.18 SLURM errors: Socket timed out. What?
- 4.0.19 SLURM Errors: Time limit
- 4.0.20 What is Fair-Share?
- 4.0.21 Can I send mail from the cluster?
- 4.0.22 I see dummy4XD jobs, but I didn’t submit them?
- 4.0.23 I see nodes marked as DRAINING, DOWN, or COMPLETING in the partition that I am using what can I do?
- 5 BILLING
- 6 VDI (Open OnDemand)