Go to the FAQ for: Investors | Users | Illinois Computes | RCaaS
Investor FAQ
What is the minimum investment for a researcher to ‘buy in’ to the cluster?
The minimum buy-in is one (1) compute node. There will be options for differing memory sizes on the nodes and interconnect type (InfiniBand or 10GbE).
Is the buy-in cost a one-time ‘per node’ cost, or is there a yearly fee also to account for oversight/administration?
The buy-in cost is a one-time cost. System administration is subsidized by the campus.
Is there a similar model for a unit to ‘buy in’?
A unit will be treated no differently than a research team. A unit will presumably have a larger set of researchers associated with the buy-in, but the process is the same. It is anticipated that various units would want to buy-in in this way to provide some level of support at the unit level for researchers.
Is it possible to add nodes to an existing primary resource as a node purchase option?
Yes—visit Buy Compute to see options and submit an order request.
Who do I contact with more questions or if I need help?
Contact help@campuscluster.illinois.edu.
What is Infiniband and do I need it?
In some HPC applications, where fast communication and response times are crucial over any compute nodes, low latency is important for achieving high-performance results. Where this is highly desired and the cost is justified, Infiniband can be a great choice of interconnect over Ethernet. HPC applications that would take the most advantage with Infiniband would include, weather modeling, Computational Fluid Dynamics (CFD). However, for most investors in the campus cluster, high-speed Ethernet options typically suffice for the majority of their workloads. Nonetheless, if you have inquiries regarding whether your application would benefit from InfiniBand as a new or existing investor, please don’t hesitate to contact us for a consultation.
User FAQ
Access and Accounts
How can I get access to the Campus Cluster?
Individuals, groups, and campus units at the University of Illinois can invest in compute and storage equipment, purchase compute time on demand, or rent storage space on the Campus Cluster. See Available Resources and Services for details.
Illinois researchers can also get access to the Illinois Campus Cluster via the Illinois Computes program, free of charge! Learn more on the Illinois Computes website.
Which primary queue should I select on the Access Request Form?
The primary queue is mapped to the investor group that has purchased nodes. See the Investors page for the current list of investors.
How can I provide external collaborators access to the Campus Cluster?
The best solution is to get a guest NetID for the collaborators. See the Technology Services page with information on requesting an Affiliate NetID for details.
Once you have the guest NetID, you can request access via the Access Request for Research form. Or, have your ICCP allocation manager/technical representative add you to your existing group via the investor portal.
What is the lifespan for the Campus Cluster hardware I purchase?
Both compute nodes and storage disks have a lifespan of five years from in-service date.
I am in multiple groups and would like to change my default group.
On the Campus Cluster, the default group is determined by the group ownership of the home directory. To change the group of the home directory, the command is:
chgrp defgroupname $HOME
where defgroupname is the name of the group that you wish to be the default. Log off and log back on for the new default group to take effect.
Why does my SFTP connection attempt fail with error message: ‘Received message too long 1751714304’?
This error is usually caused by commands in a shell run-control file (.bashrc, .profile, .cshrc, etc.) that produce output to the terminal. To solve this problem, you should place any commands that will produce output in a conditional statement that is executed only if the shell is interactive.
For example, in the .bashrc file:
if tty -s; then ; fi
I have problems transferring data to my home directory on the Campus Cluster – I get 0 byte files, partial files, or the files do not transfer at all.
A common reason for this is due to being over quota in your home directory. Use the quota
command to check your usage.
Will I be able to access the Campus Cluster after graduation?
Your access to the cluster post-graduation will be related to how long your IL account stays active and/or your group access. If you are a member of a group, and your PI removes your access after graduation, you would lose access to those resources.
Otherwise, your access would remain until your IL account is removed. You can read campus policy for that information.
System Policies
Are there any user disk quotas in place?
The home directory has a 5GB soft quota and a 7GB hard quota with a 7 day grace period. There is a 10TB/user quota on the cluster’s scratch directory. Project directories have capacity quotas based on investors’ disk purchases, and a 20 million inode/user quota.
I accidentally deleted files in my home directory. Is there any way to get them back?
Nightly snapshots are available for the last 30 days—see the Storage and Data Guide for location of the snapshots.
What are my options for additional disk space?
The scratch filesystem is shared storage space available to all users.
Investor groups may also have project space on the Illinois Campus Cluster that can be used for data storage. Please consult with your technical representative regarding availability and access.
I get error ‘/usr/bin/xauth: error in locking authority file /home//.Xauthority’ when I log in.
This usually indicates that you are over your home directory quota.
Is there a disk purge policy in place?
Files in scratch that are older than 30 days are purged daily to ensure adequate space for all users of the system.
Is there a backup for the storage space, so that data are not lost in case of disk failure?
The file system is configured to protect data in the event of server or drive failures. An investor’s data is chunked up and stored across all drives in the storage appliance to ensure maximum performance and data availability. The purchasing of a “single disk” is merely an artifact of accounting to provide a purchase method for investors who have grant money that must be spent on physical hardware.
The campus cluster program offers a DR (Disaster recovery) service that allows users to pay an additional rate (lower than the primary storage rate) for their data to be backed up daily to an off-site/off-campus location in case of catastrophic file system failure or a disastrous environmental event (large power surge, severe storm damage to data center, etc.). If this option is of interest to you or your team, please check out the Buy Storage page for more information. While Campus Cluster Operators will do everything to make the service as reliable as possible, having only one copy of data is a risk. So no data on the Campus Cluster system should be considered safe with a single copy. Campus Cluster customers are encouraged to back up their important data themselves or by enrolling in the DR program.
Does the cluster provide a long-term storage option for finished projects (e.g., on a tape)?
No, we currently do not offer long-term storage. At the end of a storage investment, we ask investors to make a new storage purchase or migrate data to other locations. We may offer that type of service in the future, but do not at this time. Please check back at a later date for more information.
Programming, Software, and Libraries
Is there a math library available on the Campus Cluster?
The Intel Math Kernel Library (MKL) is available as part of the Intel Compiler Suite. The OpenBLAS library is also available. See the software section of the User Guide for details.
Is there information on running MATLAB on the Campus Cluster?
See the software section of the User Guide for information on software.
What is the process for installing investor-specific software on the Campus Cluster?
See the document Investor Specific Software Installation for recommended guidelines to follow when installing software.
How do I install R packages specific to my needs that are not available in the Campus Cluster installation?
See the document R on the Campus Cluster for information about installing R add-on packages.
How can I access numeric and scientific Python modules on the Campus Cluster?
Python versions 2 and 3 are available.
See the output from the command module avail python
or module avail anaconda
for the specific modules.
Load the needed module into your environment with: module load modulefile_name
Note: Use the command python3
for Python 3.
How do I enable syntax highlighting in the vi editor on the Campus Cluster?
The default vi/vim installed with the OS on the campus cluster does not include syntax highlighting. You can load a newer version that includes this into your environment with the module command: module load vim
See the Managing Your Environment section of the User Guide for information on adding this version of vi into your regular user environment.
Running Jobs
Why is the wait time of my batch job so long?
There can be various reasons that contribute to job wait times.
- If your job is in your primary queue:
All nodes in your investor group are in use by other primary queue jobs. In addition, since the Campus Cluster allows users access to any idle nodes via the secondary queue, jobs submitted to a primary queue could have a wait time of up to the secondary queue maximum wall time of 4 hours. - If your job is in the secondary queue:
Since this queue makes use of idle nodes not being used by investor primary queues, it is almost entirely opportunity scheduled. This means that secondary jobs will only run if there is a big enough scheduling hole on the number and type of nodes requested. - Preventative Maintenance (PM) on the Campus Cluster is generally scheduled quarterly on the thied Wednesday of the month. If the wall time requested by a job will not allow it to complete before an upcoming PM, the job will not start until after the PM.
- Your job has requested a specific type of resource—for example, nodes with 96GB memory.
- Your job has requested a combination of resources that are incompatible—for example, 96GB memory and the cse queue.
[In this case, the job will never run.]
How can I get more than four hours of wall clock time in my batch jobs?
The secondary queue is the default queue on the campus cluster, so batch jobs that do not specify a queue name are routed to this queue. This queue has a maximum wall time of 4 hours. Specify your primary queue using the --partition
option to sbatch for access to longer batch job wall times. The following commands sinfo -a --format="%.16R %.4D %.14l"
or qstat -q
allow users to view the maximum wall time for all queues on the Campus Cluster.
What does the error in my batch job: ‘=>> PBS: job killed: swap rate due to memory oversubscription is too high Ctrl-C caught… cleaning up processes’ mean?
This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Running Jobs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).
What does the error in my batch job: ‘Job exceeded a memory resource limit (vmem, pvmem, etc.). Job was aborted’ mean?
This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Running Jobs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).
I need to run a large number of single-core (serial) jobs and the jobs are moving very slowly through the batch system.
See here for information on combining multiple serial processes within a single batch job to help expedite job turnaround time.
I get the following error when I run multi-node batch jobs: ‘Permission denied (publickey,gssapi-keyex,gssapi-with-mic).’
Possible causes:
- when the file $HOME/.ssh/authorized_keys has been removed, incorrectly modified, or zeroed out. To resolve, remove or rename the .ssh directory, and then log off and log back on. This will regenerate a default .ssh directory along with its contents. If you need to add an entry to $HOME/.ssh/authorized_keys, make sure to leave the original entry in place.
- when group writable permissions are set for the user’s home directory. [golubh1 ~]$ ls -ld ~jdoe drwxrwx— 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe To resolve, remove the group writable permissions: [golubh1 ~]$ chmod g-w ~jdoe [golubh1 ~]$ ls -ld ~jdoe drwxr-x— 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
How can I move my queued batch job from one queue to another?
Use the command scontrol update
The syntax is: scontrol update jobid=[JobID] partition=[queue_name]
where queue_name is the queue that you want to move the job to. Note that the operation will not be permitted if the resources requested do not fit the queue limits.
What is the maximum wall time on the secondary/secondary-eth queues?
Secondary/Secondary-eth queues max wall time is 4 hours
Illinois Computes
Illinois Computes Queue Policy on ICCP
The Illinois Computes partition of 16 CPU nodes and 4 nodes each with 4xNvidia A100 (16 GPU) is allocated via the Illinois Computes Program, a separate entity from Campus Cluster. Illinois Computes is an investor in Campus Cluster; one that gives away allocations for compute.
The standard allocation granted from Illinois Computes to researchers on the Campus Cluster is as follows:
- 100,000 CPU core hours
- 1000 GPU hours
- 5TB Campus Cluster parallel filesystem storage
When a researcher exceeds their allocation, they are placed into a deprioritized primary queue within SLURM. This enables them to continue their work beyond their allocation limit if there are free cycles in the queue. However, researchers still within their allocation limits receive priority over those in the deprioritized queue.
Illinois Computes queue job priority (in order)
- Primary
- De-prioritized Primary
- Secondary (can be used by anyone with access to campus cluster, 4H max runtime)
Research Computing as a Service (RCaaS)
How many RCaaS nodes are there? In what configurations?
There are currently 32 compute nodes in the RCaaS queue. These have various hardware configurations, so core counts and memory may differ. Existing node configurations include:
Node Count | Cores (per node) | Memory (per node) | Ethernet/InfiniBand |
---|---|---|---|
4 | 16 | 64GB | Ethernet |
4 | 16 | 128GB | Ethernet |
8 | 16 | 64GB | InfiniBand |
10 | 20 | 64GB | InfiniBand |
6 | 20 | 128GB | InfiniBand |
How quickly can I get access to RCaaS resources?
Once we have all the appropriate account setup and billing information, the typical turnaround time for RCaaS access is 1-2 business days.
How will this work for GPU jobs?
Currently there are no GPU or Intel Xeon Phi nodes in the RCaaS queue. Down the road, we may introduce GPU, Xeon Phi (MIC), or large memory nodes to RCaaS. At that point there will be additional queues with a different charge rate for jobs that need the special hardware configurations. If you have a current need, however, we would like to document it for our planning purposes. Please send us an email and let us know what kind of hardware nodes you need and for how long you need them.
Can multiple RCaaS requests share resources on the same node, and if so are they each charged for the full node?
The default configuration is for one job per node. There are ways to allow sharing of the resource, but you would have to specify that. We are currently charging for the whole node because the current implementation doesn’t have a way to tell how much of the node you are using. If multiple jobs were to share the node, then all the jobs would be charged for using the whole node. This might be changed in the future, but for now we recommend not sharing.
How is access priority handled for the RCaaS queue?
The RCaaS queue functions as a primary queue for the set of nodes that make up the RCaaS resource pool. Priority is controlled similar to multiple users submitting jobs to the same primary queue. While secondary queue jobs will still make use of the RCaaS nodes if they are available, the RCaaS queue will have a priority boost over secondary jobs.
What is the cost for RCaaS?
Using RCaaS will cost $0.0217 per core-hour, which works out to $15.84 per core month. This incorporates all the hardware, software, and infrastructure. This rate is approved through Government Costing to charge to grants and exempt from overhead. Please note that you will be charged for all the cores on the nodes allocated to your job whether you utilize all of them or not, so plan to make as much use of each node as you can.
Will there be a way to check the current RCaaS charges to my group other than waiting for the monthly bill?
We can set up your account to send a notification email when your usage approaches an estimate you provide. Further, your account can be set up to stop access if you reach a specified dollar amount. Notify the Campus Cluster staff if you would like this service enabled on your account.
Does any storage (disk) space come with RCaaS?
Each active RCaaS customer will be allocated 25GB at minimum of User/Group storage while they use the service. If you need additional storage beyond the 25GB allocation, talk with ICCP staff about adding an additional storage allocation through Research Storage as a Service (RSaaS). RSaaS allows you to rent storage space by the TB/month for as long as you need it.
Is there a time limit for access to this storage space once I’ve finished using RCaaS?
Once an RCaaS account has gone inactive, and assuming there are no other arrangements made for storage (e.g. hardware investment or monthly storage service) the customer will be expected to remove their data within the next calendar month. After that time period, all data will be removed.
After I have purchased time in the RCaaS queue, is there a time frame in which I must use the resources?
It’s important to note that you can’t “purchase time” before it is used with RCaaS. Rather, you estimate your monthly usage and the duration of time you will use the service. From there, the program will bill you monthly for actual time used. That said, any RCaaS account that sees no usage activity for two consecutive monthly periods will be notified of account termination which will occur at the end of the third consecutive monthly period with no activity. In the event of termination and for a Customer with no other storage investment on the system, the Customer will have the duration of the subsequent billing cycle to move their data off of the system.
Is it possible to add groups to the RCaaS user management screen?
Yes, we can add as many groups as needed. One of the administrators would need to create the group and add an allocation manager. To get this process started, send in a help ticket.
Is there a way for me to allow other users from my campus group or outside collaborators to use my RCaaS resources?
The allocation manager of your RCaaS group can add anyone with a University of Illinois NetID to your RCaaS group, allowing them to make use of your purchase. In order to add outside collaborators they would need to first obtain a NetID. Instructions for requesting a NetID may be found at https://answers.uillinois.edu/illinois/128347.
Is there a limit to how long jobs can run?
The RCaaS queue is configured for a maximum wall time of 120 hours. If you need jobs to run for a longer duration, let us know so we can evaluate if we might be able to help you with these requirements.
How do I submit jobs to the RCaaS resources?
Simply specify the rcaas queue when you specify the queue in your job submission. From the command line specify “sbatch –partition=rcaas” or from the script specify “#SBATCH –partition=rcaas”.
Note that capitalization matters.
Does secondary (or test) queue access come with my RCaaS queue time purchase/access?
RCaaS users are able to submit jobs to the Secondary queue if they wish, but they will be charged for that time just like they are charged for jobs submitted to the RCaaS queue.
Can you install (make available) a specific software that I need for use with my RCaaS time?
While we do not install software for individual users/groups, you may install software in your group storage space. If you run into problems you can submit a help ticket and we can attempt to guide your installation, but there are so many software packages in use that we are unlikely to have experience using any particular package. Hopefully, with your knowledge of the package, and our knowledge of the system, we can help get most software working.