Go to the FAQ for: Investors | Users | RCaaS

Investor FAQ

The minimum buy-in is one (1) compute node. There will be options for differing memory sizes on the nodes and interconnect type (InfiniBand or GigE).

The buy-in cost is a one-time cost. System administration is subsidized by the campus.

A unit will be treated no differently than a research team. A unit will presumably have a larger set of researchers associated with the buy-in, but the process is the same. It is anticipated that various units would want to buy-in in this way to provide some level of support at the unit level for researchers.

User FAQ

Access and Accounts

Individuals, groups, and campus units at the University of Illinois can invest in compute and storage equipment, purchase compute time on demand, or rent storage space on the Campus Cluster. See Available Resources and Services for details.

The primary queue is mapped to the investor group that has purchased nodes. See the Investors page for the current list of investors.

The best solution is to get a guest NetID for the collaborators. See the Technology Services page Requesting a Guest NetID for details. Once you have the guest NetID, request access via the Access Request for Research form.

Both compute nodes and storage disks have a lifespan of five years from in-service date.

On the Campus Cluster, the default group is determined by the group ownership of the home directory. To change the group of the home directory, the command is:

chgrp defgroupname $HOME

where defgroupname is the name of the group that you wish to be the default. Log off and log back on for the new default group to take effect.

This error is usually caused by commands in a shell run-control file (.bashrc, .profile, .cshrc, etc.) that produce output to the terminal. To solve this problem, you should place any commands that will produce output in a conditional statement that is executed only if the shell is interactive.

For example, in the .bashrc file:

if tty -s; then
   <your echo statements>;
fi

A common reason for this is due to being over quota in your home directory. Use the quota command to check your usage.

System Policies

The home directory has a 2GB quota. There are currently no quotas in the scratch directory. Project directories have quotas based on investors' disk purchases.

Nightly snapshots are available for the last 30 days—see the User Guide for location of the snapshots.

The scratch filesystem is shared storage space available to all users.

Investor groups may also have project space on the Illinois Campus Cluster that can be used for data storage. Please consult with your technical representative regarding availability and access.

Files in scratch that are older than 30 days are purged daily Monday through Friday to ensure adequate space for all users of the system.

Programming, Software, and Libraries

The Intel Math Kernel Library (MKL) is available as part of the Intel Compiler Suite. The OpenBLAS library is also available. See the Libraries section of the User Guide for details.

See the document Investor Specific Software Installation for recommended guidelines to follow when installing software.

Python versions 2 and 3 are available.

See the output from the command module avail python for the specific modules.

Load the needed module into your environment with: module load 

Note: Use the command python3 for Python 3.

The default vi/vim installed with the OS on the campus cluster does not include syntax highlighting. You can load a newer version that includes this into your environment with the module command: module load vim

See the Managing Your Environment section of the User Guide for information on adding this version of vi into your regular user environment.

Running Jobs

There can be various reasons that contribute to job wait times.

  • If your job is in your primary queue:
    All nodes in your investor group are in use by other primary queue jobs. In addition, since the Campus Cluster allows users access to any idle nodes via the secondary queue, jobs submitted to a primary queue could have a wait time of up to the secondary queue maximum wall time of 4 hours.
  • If your job is in the secondary queue:
    Since this queue makes use of idle nodes not being used by investor primary queues, it is almost entirely opportunity scheduled. This means that secondary jobs will only run if there is a big enough scheduling hole on the number and type of nodes requested.
  • Preventative Maintenance (PM) on the Campus Cluster is generally scheduled on the third Wednesday of each month. If the wall time requested by a job will not allow it to complete before an upcoming PM, the job will not start until after the PM.
  • Your job has requested a specific type of resource—for example, nodes with 96GB memory.
  • Your job has requested a combination of resources that are incompatible—for example, 96GB memory and the cse queue.
    [In this case, the job will never run.]

The secondary queue is the default queue on the campus cluster, so batch jobs that do not specify a queue name are routed to this queue. This queue has a maximum wall time of 4 hours. Specify your primary queue using the -q option to qsub for access to longer job wall times. The qstat -q command gives the maximum wall time for all queues on the campus cluster.

This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Memory needs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Memory needs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

See here for information on combining multiple serial processes within a single batch job to help expedite job turnaround time.

Possible causes:

  • when the file $HOME/.ssh/authorized_keys has been removed, incorrectly modified, or zeroed out. To resolve, remove or rename the .ssh directory, and then log off and log back on. This will regenerate a default .ssh directory along with its contents. If you need to add an entry to $HOME/.ssh/authorized_keys, make sure to leave the original entry in place.
  • when group writable permissions are set for the user's home directory.
    [golubh1 ~]$ ls -ld ~jdoe
    drwxrwx--- 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

    To resolve, remove the group writable permissions:

    [golubh1 ~]$ chmod g-w ~jdoe
    [golubh1 ~]$ ls -ld ~jdoe
    drwxr-x--- 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

Use the command qmove The syntax is: qmove queue_name JobID

where queue_name is the queue that you want to move the job to. Note that the operation will not be permitted if the resources requested do not fit the queue limits.

Research Computing as a Service (RCaaS)

There are currently 24 compute nodes in the RCaaS queue. These have various hardware configurations, so core counts and memory may differ. Existing node configurations include:

Node Count Cores (per node) Memory (per node)
8 16 64GB
10 20 64GB
6 20 128GB

Currently there are no GPU or Intel Xeon Phi nodes in the RCaaS queue. Down the road, we may introduce GPU, Xeon Phi (MIC), or large memory nodes to RCaaS. At that point there will be additional queues with a different charge rate for jobs that need the special hardware configurations. If you have a current need, however, we would like to document it for our planning purposes. Please send us an email and let us know what kind of hardware nodes you need and for how long you need them.

The default configuration is for one job per node. There are ways to allow sharing of the resource, but you would have to specify that. We are currently charging for the whole node because the current implementation doesn't have a way to tell how much of the node you are using. If multiple jobs were to share the node, then all the jobs would be charged for using the whole node. This might be changed in the future, but for now we recommend not sharing.

The RCaaS queue functions as a primary queue for the set of nodes that make up the RCaaS resource pool. Priority is controlled similar to multiple users submitting jobs to the same primary queue. While secondary queue jobs will still make use of the RCaaS nodes if they are available, the RCaaS queue will have a priority boost over secondary jobs.

Using RCaaS will cost $0.0217 per core-hour, which works out to $15.84 per core month. This incorporates all the hardware, software, and infrastructure. This rate is approved through Government Costing to charge to grants and exempt from overhead. Please note that you will be charged for all the cores on the nodes allocated to your job whether you utilize all of them or not, so plan to make as much use of each node as you can.

We can set up your account to send a notification email when your usage approaches an estimate you provide. Further, your account can be set up to stop access if you reach a specified dollar amount. Notify the Campus Cluster staff if you would like this service enabled on your account.

Each active RCaaS customer will be allocated 25GB at minimum of User/Group storage while they use the service. If you need additional storage beyond the 25GB allocation, talk with ICCP staff about adding an additional storage allocation through Research Storage as a Service (RSaaS). RSaaS allows you to rent storage space by the TB/month for as long as you need it.

Once an RCaaS account has gone inactive, and assuming there are no other arrangements made for storage (e.g. hardware investment or monthly storage service) the customer will be expected to remove their data within the next calendar month. After that time period, all data will be removed.

It's important to note that you can't "purchase time" before it is used with RCaaS. Rather, you estimate your monthly usage and the duration of time you will use the service. From there, the program will bill you monthly for actual time used. That said, any RCaaS account that sees no usage activity for two consecutive monthly periods will be notified of account termination which will occur at the end of the third consecutive monthly period with no activity. In the event of termination and for a Customer with no other storage investment on the system, the Customer will have the duration of the subsequent billing cycle to move their data off of the system.

Yes, we can add as many groups as needed. One of the administrators would need to create the group and add an allocation manager. To get this process started, send in a help ticket.

The allocation manager of your RCaaS group can add anyone with a University of Illinois NetID to your RCaaS group, allowing them to make use of your purchase. In order to add outside collaborators they would need to first obtain a NetID. Instructions for requesting a NetID may be found at https://answers.uillinois.edu/illinois/47711.

The RCaaS queue is configured for a maximum wall time of 120 hours. If you need jobs to run for a longer duration, let us know so we can evaluate if we might be able to help you with these requirements.

Simply specify the rcaas queue when you specify the queue in your job submission. From the command line specify "qsub –q rcaas" or from the script specify "#PBS –q rcaas".

Note that capitalization matters.

RCaaS users are able to submit jobs to the Secondary queue if they wish, but they will be charged for that time just like they are charged for jobs submitted to the RCaaS queue.

While we do not install software for individual users/groups, you may install software in your group storage space. If you run into problems you can submit a help ticket and we can attempt to guide your installation, but there are so many software packages in use that we are unlikely to have experience using any particular package. Hopefully, with your knowledge of the package, and our knowledge of the system, we can help get most software working.