Go to the FAQ for: Investors | Users | Illinois Computes | RCaaS

Investor FAQ

The minimum buy-in is one (1) compute node. There will be options for differing memory sizes on the nodes and interconnect type (InfiniBand or 10GbE).

The buy-in cost is a one-time cost. System administration is subsidized by the campus.

A unit will be treated no differently than a research team. A unit will presumably have a larger set of researchers associated with the buy-in, but the process is the same. It is anticipated that various units would want to buy-in in this way to provide some level of support at the unit level for researchers.

In some HPC applications, where fast communication and response times are crucial over any compute nodes, low latency is important for achieving high-performance results. Where this is highly desired and the cost is justified, Infiniband can be a great choice of interconnect over Ethernet. HPC applications that would take the most advantage with Infiniband would include, weather modeling, Computational Fluid Dynamics (CFD). However, for most investors in the campus cluster, high-speed Ethernet options typically suffice for the majority of their workloads. Nonetheless, if you have inquiries regarding whether your application would benefit from InfiniBand as a new or existing investor, please don't hesitate to contact us for a consultation.

User FAQ

Access and Accounts

Individuals, groups, and campus units at the University of Illinois can invest in compute and storage equipment, purchase compute time on demand, or rent storage space on the Campus Cluster. See Available Resources and Services for details.

Illinois researchers can also get access to the Illinois Campus Cluster via the Illinois Computes program, free of charge! Learn more on the Illinois Computes website.

The primary queue is mapped to the investor group that has purchased nodes. See the Investors page for the current list of investors.

The best solution is to get a guest NetID for the collaborators. See the Technology Services page with information on requesting an Affiliate NetID for details.

Once you have the guest NetID, you can request access via the Access Request for Research form. Or, have your ICCP allocation manager/technical representative add you to your existing group via the investor portal.

Both compute nodes and storage disks have a lifespan of five years from in-service date.

On the Campus Cluster, the default group is determined by the group ownership of the home directory. To change the group of the home directory, the command is:

chgrp defgroupname $HOME

where defgroupname is the name of the group that you wish to be the default. Log off and log back on for the new default group to take effect.

This error is usually caused by commands in a shell run-control file (.bashrc, .profile, .cshrc, etc.) that produce output to the terminal. To solve this problem, you should place any commands that will produce output in a conditional statement that is executed only if the shell is interactive.

For example, in the .bashrc file:

if tty -s; then
   ;
fi

A common reason for this is due to being over quota in your home directory. Use the quota command to check your usage.

Your access to the cluster post-graduation will be related to how long your IL account stays active and/or your group access. If you are a member of a group, and your PI removes your access after graduation, you would lose access to those resources.

Otherwise, your access would remain until your IL account is removed. You can read campus policy for that information: https://techservices.illinois.edu/2023/05/08/when-do-i-lose-access-to-my-university-accounts/

System Policies

The home directory has a 5GB soft quota and a 7GB hard quota with a 7 day grace period. There is a 10TB/user quota on the cluster's scratch directory. Project directories have capacity quotas based on investors' disk purchases, and a 20 million inode/user quota.

Nightly snapshots are available for the last 30 days—see the User Guide for location of the snapshots.

The scratch filesystem is shared storage space available to all users.

Investor groups may also have project space on the Illinois Campus Cluster that can be used for data storage. Please consult with your technical representative regarding availability and access.

Files in scratch that are older than 30 days are purged daily to ensure adequate space for all users of the system.

The file system is configured to protect data in the event of server or drive failures.  An investor’s data is chunked up and stored across all drives in the storage appliance to ensure maximum performance and data availability.  The purchasing of a “single disk” is merely an artifact of accounting to provide a purchase method for investors who have grant money that must be spent on physical hardware.

The campus cluster program offers a DR (Disaster recovery) service that allows users to pay an additional rate (lower than the primary storage rate) for their data to be backed up daily to an off-site/off-campus location in case of catastrophic file system failure or a disastrous environmental event (large power surge, severe storm damage to data center, etc.).  If this option is of interest to you or your team, please check out the Buy Storage page for more information.  While Campus Cluster Operators will do everything to make the service as reliable as possible, having only one copy of data is a risk. So no data on the Campus Cluster system should be considered safe with a single copy. Campus Cluster customers are encouraged to back up their important data themselves or by enrolling in the DR program.

No, we currently do not offer long-term storage. At the end of a storage investment, we ask investors to make a new storage purchase or migrate data to other locations. We may offer that type of service in the future, but do not at this time. Please check back at a later date for more information.

Programming, Software, and Libraries

The Intel Math Kernel Library (MKL) is available as part of the Intel Compiler Suite. The OpenBLAS library is also available. See the Libraries section of the User Guide for details.

Python versions 2 and 3 are available.

See the output from the command module avail python or module avail anaconda for the specific modules.

Load the needed module into your environment with: module load modulefile_name

Note: Use the command python3 for Python 3.

The default vi/vim installed with the OS on the campus cluster does not include syntax highlighting. You can load a newer version that includes this into your environment with the module command: module load vim

See the Managing Your Environment section of the User Guide for information on adding this version of vi into your regular user environment.

Running Jobs

There can be various reasons that contribute to job wait times.

  • If your job is in your primary queue:
    All nodes in your investor group are in use by other primary queue jobs. In addition, since the Campus Cluster allows users access to any idle nodes via the secondary queue, jobs submitted to a primary queue could have a wait time of up to the secondary queue maximum wall time of 4 hours.
  • If your job is in the secondary queue:
    Since this queue makes use of idle nodes not being used by investor primary queues, it is almost entirely opportunity scheduled. This means that secondary jobs will only run if there is a big enough scheduling hole on the number and type of nodes requested.
  • Preventative Maintenance (PM) on the Campus Cluster is generally scheduled quarterly on the thied Wednesday of the month. If the wall time requested by a job will not allow it to complete before an upcoming PM, the job will not start until after the PM.
  • Your job has requested a specific type of resource—for example, nodes with 96GB memory.
  • Your job has requested a combination of resources that are incompatible—for example, 96GB memory and the cse queue.
    [In this case, the job will never run.]

The secondary queue is the default queue on the campus cluster, so batch jobs that do not specify a queue name are routed to this queue. This queue has a maximum wall time of 4 hours. Specify your primary queue using the --partition option to sbatch for access to longer batch job wall times. The following commands sinfo -a --format="%.16R %.4D %.14l" or qstat -q allow users to view the maximum wall time for all queues on the Campus Cluster.

This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Memory needs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Memory needs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

See here for information on combining multiple serial processes within a single batch job to help expedite job turnaround time.

Possible causes:

  • when the file $HOME/.ssh/authorized_keys has been removed, incorrectly modified, or zeroed out. To resolve, remove or rename the .ssh directory, and then log off and log back on. This will regenerate a default .ssh directory along with its contents. If you need to add an entry to $HOME/.ssh/authorized_keys, make sure to leave the original entry in place.
  • when group writable permissions are set for the user's home directory.
    [golubh1 ~]$ ls -ld ~jdoe
    drwxrwx--- 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

    To resolve, remove the group writable permissions:

    [golubh1 ~]$ chmod g-w ~jdoe
    [golubh1 ~]$ ls -ld ~jdoe
    drwxr-x--- 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

Use the command scontrol update The syntax is: scontrol update jobid=[JobID] partition=[queue_name]

where queue_name is the queue that you want to move the job to. Note that the operation will not be permitted if the resources requested do not fit the queue limits.

Secondary/Secondary-eth queues max wall time is 4 hours

Illinois Computes

The Illinois Computes partition of 16 CPU nodes and 4 nodes each with 4xNvidia A100 (16 GPU) is allocated via the Illinois Computes Program, a separate entity from Campus Cluster. Illinois Computes is an investor in Campus Cluster; one that gives away allocations for compute.

The standard allocation granted from Illinois Computes to researchers on the Campus Cluster is as follows:

  • 100,000 CPU core hours
  • 1000 GPU hours
  • 5TB Campus Cluster parallel filesystem storage

When a researcher exceeds their allocation, they are placed into a deprioritized primary queue within SLURM. This enables them to continue their work beyond their allocation limit if there are free cycles in the queue. However, researchers still within their allocation limits receive priority over those in the deprioritized queue.

Illinois Computes queue job priority (in order)

  • Primary
  • De-prioritized Primary
  • Secondary (can be used by anyone with access to campus cluster, 4H max runtime)

Research Computing as a Service (RCaaS)

There are currently 32 compute nodes in the RCaaS queue. These have various hardware configurations, so core counts and memory may differ. Existing node configurations include:

Node Count Cores (per node) Memory (per node) Ethernet/InfiniBand
4 16 64GB Ethernet
4 16 128GB Ethernet
8 16 64GB InfiniBand
10 20 64GB InfiniBand
6 20 128GB InfiniBand

Once we have all the appropriate account setup and billing information, the typical turnaround time for RCaaS access is 1-2 business days.

Currently there are no GPU or Intel Xeon Phi nodes in the RCaaS queue. Down the road, we may introduce GPU, Xeon Phi (MIC), or large memory nodes to RCaaS. At that point there will be additional queues with a different charge rate for jobs that need the special hardware configurations. If you have a current need, however, we would like to document it for our planning purposes. Please send us an email and let us know what kind of hardware nodes you need and for how long you need them.

The default configuration is for one job per node. There are ways to allow sharing of the resource, but you would have to specify that. We are currently charging for the whole node because the current implementation doesn't have a way to tell how much of the node you are using. If multiple jobs were to share the node, then all the jobs would be charged for using the whole node. This might be changed in the future, but for now we recommend not sharing.

The RCaaS queue functions as a primary queue for the set of nodes that make up the RCaaS resource pool. Priority is controlled similar to multiple users submitting jobs to the same primary queue. While secondary queue jobs will still make use of the RCaaS nodes if they are available, the RCaaS queue will have a priority boost over secondary jobs.

Using RCaaS will cost $0.0217 per core-hour, which works out to $15.84 per core month. This incorporates all the hardware, software, and infrastructure. This rate is approved through Government Costing to charge to grants and exempt from overhead. Please note that you will be charged for all the cores on the nodes allocated to your job whether you utilize all of them or not, so plan to make as much use of each node as you can.

We can set up your account to send a notification email when your usage approaches an estimate you provide. Further, your account can be set up to stop access if you reach a specified dollar amount. Notify the Campus Cluster staff if you would like this service enabled on your account.

Each active RCaaS customer will be allocated 25GB at minimum of User/Group storage while they use the service. If you need additional storage beyond the 25GB allocation, talk with ICCP staff about adding an additional storage allocation through Research Storage as a Service (RSaaS). RSaaS allows you to rent storage space by the TB/month for as long as you need it.

Once an RCaaS account has gone inactive, and assuming there are no other arrangements made for storage (e.g. hardware investment or monthly storage service) the customer will be expected to remove their data within the next calendar month. After that time period, all data will be removed.

It's important to note that you can't "purchase time" before it is used with RCaaS. Rather, you estimate your monthly usage and the duration of time you will use the service. From there, the program will bill you monthly for actual time used. That said, any RCaaS account that sees no usage activity for two consecutive monthly periods will be notified of account termination which will occur at the end of the third consecutive monthly period with no activity. In the event of termination and for a Customer with no other storage investment on the system, the Customer will have the duration of the subsequent billing cycle to move their data off of the system.

Yes, we can add as many groups as needed. One of the administrators would need to create the group and add an allocation manager. To get this process started, send in a help ticket.

The allocation manager of your RCaaS group can add anyone with a University of Illinois NetID to your RCaaS group, allowing them to make use of your purchase. In order to add outside collaborators they would need to first obtain a NetID. Instructions for requesting a NetID may be found at https://answers.uillinois.edu/illinois/47711.

The RCaaS queue is configured for a maximum wall time of 120 hours. If you need jobs to run for a longer duration, let us know so we can evaluate if we might be able to help you with these requirements.

Simply specify the rcaas queue when you specify the queue in your job submission. From the command line specify "sbatch --partition=rcaas" or from the script specify "#SBATCH --partition=rcaas".

Note that capitalization matters.

RCaaS users are able to submit jobs to the Secondary queue if they wish, but they will be charged for that time just like they are charged for jobs submitted to the RCaaS queue.

While we do not install software for individual users/groups, you may install software in your group storage space. If you run into problems you can submit a help ticket and we can attempt to guide your installation, but there are so many software packages in use that we are unlikely to have experience using any particular package. Hopefully, with your knowledge of the package, and our knowledge of the system, we can help get most software working.