Go to the FAQ for: Investors | Users | RCaaS

 

Investor FAQ

What is the minimum investment for a researcher to "buy in" to the cluster?

The minimum buy-in is one (1) compute node. There will be options for differing memory sizes on the nodes and interconnect type (InfiniBand or GigE).

Is the buy-in cost a one-time "per node" cost, or is there a yearly fee also to account for oversight/administration?

The buy-in cost is a one-time cost. System administration is subsidized by the campus.

Is there a similar model for a unit to "buy in"?

A unit will be treated no differently than a research team. A unit will presumably have a larger set of researchers associated with the buy-in, but the process is the same. It is anticipated that various units would want to buy-in in this way to provide some level of support at the unit level for researchers.

Is it possible to add nodes to an existing primary resource as a node purchase option?

Yes—visit Buy Compute to see options and submit an order request.

Who do I contact with more questions or if I need help?

 

User FAQ

Access and Accounts

How can I get access to the Campus Cluster?

Individuals, groups, and campus units at the University of Illinois can invest in compute and storage equipment, purchase compute time on demand, or rent storage space on the Campus Cluster. See Available Resources and Services for details.

Which primary queue should I select on the Access Request Form?

The primary queue is mapped to the investor group that has purchased nodes. See the Investors page for the current list of investors.

How can I provide external collaborators access to the Campus Cluster?

The best solution is to get a guest NetID for the collaborators. See the Technology Services page Requesting a Guest NetID for details. Once you have the guest NetID, request access via the Access Request for Research form.

What is the lifespan for the Campus Cluster hardware I purchase?

Both compute nodes and storage disks have a lifespan of 5 years from in-service date.

I am in multiple groups and would like to change my default group.

On the Campus Cluster, the default group is determined by the group ownership of the home directory. To change the group of the home directory, the command is:

chgrp defgroupname $HOME

where defgroupname is the name of the group that you wish to be the default. Log off and log back on for the new default group to take effect.

Why does my SFTP connection attempt fail with error message: Received message too long 1751714304?

This error is usually caused by commands in a shell run-control file (.bashrc, .profile, .cshrc, etc.) that produce output to the terminal. To solve this problem, you should place any commands that will produce output in a conditional statement that is executed only if the shell is interactive.

For example, in the .bashrc file:

if tty -s; then
   <your echo statements>;
fi
I have problems transferring data to my home directory on the Campus Cluster - I get 0 byte files, partial files, or the files do not transfer at all.

A common reason for this is due to being over quota in your home directory. Use the quota command to check your usage.

System Policies

Are there any user disk quotas in place?

The home directory has a 2 GB quota. There are currently no quotas in the scratch directory. Project directories have quotas based on investors' disk purchases.

I accidentally deleted files in my home directory. Is there any way to get them back?

Nightly snapshots are available for the last 30 days—see the User Guide for location of the snapshots.

What are my options for additional disk space?

The scratch filesystem is shared storage space available to all users.

Investor groups may also have project space on the Illinois Campus Cluster that can be used for data storage. Please consult with your technical representative regarding availability and access.

I get error /usr/bin/xauth: error in locking authority file /home/<user>/.Xauthority when I log in.

This usually indicates that you are over your home directory quota.

Is there a disk purge policy in place?

Files in scratch that are older than 30 days are purged daily Monday through Friday to ensure adequate space for all users of the system.

Programming, Software, and Libraries

Is there a math library available on the Campus Cluster?

The Intel Math Kernel Library (MKL) is available as part of the Intel Compiler Suite. The OpenBLAS library is also available. See the Libraries section of the User Guide for details.

Is there information on running MATLAB on the Campus Cluster?

See the document Using MATLAB on the Campus Cluster for information.

What is the process for installing investor-specific software on the Campus Cluster?

See the document Investor Specific Software Installation for recommended guidelines to follow when installing software.

How do I install R packages specific to my needs that are not available in the Campus Cluster installation?

See the document R on the Campus Cluster for information about installing R add-on packages.

How can I access numeric and scientific Python modules on the Campus Cluster?

Python versions 2 and 3 are available.

See the output from the command module avail python for the specific modules.

Load the needed module into your environment with: module load <module name>

Note: Use the command python3 for Python 3.

How do I enable syntax highlighting in the vi editor on the Campus Cluster?

The default vi/vim installed with the OS on the campus cluster does not include syntax highlighting. You can load a newer version that includes this into your environment with the module command: module load vim

See the Managing Your Environment section of the User Guide for information on adding this version of vi into your regular user environment.

Running Jobs

Why is the wait time of my batch job so long?

There can be various reasons that contribute to job wait times.

  • If your job is in your primary queue:
    All nodes in your investor group are in use by other primary queue jobs. In addition, since the Campus Cluster allows users access to any idle nodes via the secondary queue, jobs submitted to a primary queue could have a wait time of up to the secondary queue maximum wall time of 4 hours.
  • If your job is in the secondary queue:
    Since this queue makes use of idle nodes not being used by investor primary queues, it is almost entirely opportunity scheduled. This means that secondary jobs will only run if there is a big enough scheduling hole on the number and type of nodes requested.
  • Preventative Maintenance (PM) on the Campus Cluster is generally scheduled on the third Wednesday of each month. If the wall time requested by a job will not allow it to complete before an upcoming PM, the job will not start until after the PM.
  • Your job has requested a specific type of resource—for example, nodes with 96 GB memory.
  • Your job has requested a combination of resources that are incompatible—for example, 96 GB memory and the cse queue.
    [In this case, the job will never run.]
How can I get more than four hours of wall clock time in my batch jobs?

The secondary queue is the default queue on the campus cluster, so batch jobs that do not specify a queue name are routed to this queue. This queue has a maximum wall time of 4 hours. Specify your primary queue using the -q option to qsub for access to longer job wall times. The qstat -q command gives the maximum wall time for all queues on the campus cluster.

What does the error in my batch job: =>> PBS: job killed: swap rate due to memory oversubscription is too high Ctrl-C caught... cleaning up processes mean?

This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Memory needs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

What does the error in my batch job: Job exceeded a memory resource limit (vmem, pvmem, etc.). Job was aborted mean?

This indicates that your job used more memory than available on the node(s) allocated to the batch job. If possible, you can submit jobs to nodes with larger amount of memory—see Memory needs for details. For MPI jobs, you can also resolve the issue by running fewer processes on each node (so each MPI process will have more memory) or using more MPI processes in total (so each MPI process will need less memory).

I need to run a large number of single-core (serial) jobs and the jobs are moving very slowly through the batch system.

See here for information on combining multiple serial processes within a single batch job to help expedite job turnaround time.

I get the following error when I run multi-node batch jobs: Permission denied (publickey,gssapi-keyex,gssapi-with-mic).

Possible causes:

  • when the file $HOME/.ssh/authorized_keys has been removed, incorrectly modified, or zeroed out. To resolve, remove or rename the .ssh directory, and then log off and log back on. This will regenerate a default .ssh directory along with its contents. If you need to add an entry to $HOME/.ssh/authorized_keys, make sure to leave the original entry in place.
  • when group writable permissions are set for the user's home directory.
    [golubh1 ~]$ ls -ld ~jdoe
    drwxrwx--- 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    

    To resolve, remove the group writable permissions:

    [golubh1 ~]$ chmod g-w ~jdoe
    [golubh1 ~]$ ls -ld ~jdoe
    drwxr-x--- 15 jdoe SciGrp 32768 Jun 16 14:20 /home/jdoe
    
How can I move my queued batch job from one queue to another?

Use the command qmove. The syntax is: qmove queue_name JobID

where queue_name is the queue that you want to move the job to. Note that the operation will not be permitted if the resources requested do not fit the queue limits.

 

Research Computing as a Service (RCaaS)

How many RCaaS nodes are there? In what configurations?

There are currently 24 compute nodes in the RCaaS queue. These have various hardware configurations, so core counts and memory may differ. Existing node configurations include:

Node Count Cores (per node) Memory (per node)
8 16 64GB
10 20 64GB
6 20 128GB

How will this work for GPGPU jobs?

Currently there are no GPU or Intel Xeon Phi nodes in the RCaaS queue. Down the road, we may introduce GPU, Xeon Phi (MIC), or large memory nodes to RCaaS. At that point there will be additional queues with a different charge rate for jobs that need the special hardware configurations. If you have a current need, however, we would like to document it for our planning purposes. Please send us an email and let us know what kind of hardware nodes you need and for how long you need them.

Can multiple RCaaS requests share resources on the same node, and if so are they each charged for the full node?

The default configuration is for one job per node. There are ways to allow sharing of the resource, but you would have to specify that. We are currently charging for the whole node because the current implementation doesn't have a way to tell how much of the node you are using. If multiple jobs were to share the node, then all the jobs would be charged for using the whole node. This might be changed in the future, but for now we recommend not sharing.

How is access priority handled for the RCaaS queue?

The RCaaS queue functions as a primary queue for the set of nodes that make up the RCaaS resource pool. Priority is controlled similar to multiple users submitting jobs to the same primary queue. While secondary queue jobs will still make use of the RCaaS nodes if they are available, the RCaaS queue will have a priority boost over secondary jobs.

What is the cost for RCaaS?

Using RCaaS will cost $0.0217 / core-hour, which works out to $15.84 per core month. This incorporates all the hardware, software, and infrastructure. This rate is approved through Government Costing to charge to grants and exempt from overhead. Please note that you will be charged for all the cores on the nodes allocated to your job whether you utilize all of them or not, so plan to make as much use of each node as you can.

Will there be a way to check the current RCaaS charges to my group other than waiting for the monthly bill?

We can set up your account to send a notification email when your usage approaches an estimate you provide. Further, your account can be set up to stop access if you reach a specified dollar amount. Notify the Campus Cluster staff if you would like this service enabled on your account.

Does any storage (disk) space come with RCaaS?

Each active RCaaS customer will be allocated 25GB at minimum of User/Group storage while they use the service. If you need additional storage beyond the 25GB allocation, talk with ICCP staff about adding an additional storage allocation through Research Storage as a Service (RSaaS). RSaaS allows you to rent storage space by the TB/month for as long as you need it.

Is there a time limit for access to this storage space once I've finished using RCaaS?

Once an RCaaS account has gone inactive, and assuming there are no other arrangements made for storage (e.g. hardware investment or monthly storage service) the customer will be expected to remove their data within the next calendar month. After that time period, all data will be removed.

After I have purchased time in the RCaaS queue, is there a time frame in which I must use the resources?

It's important to note that you can't "purchase time" before it is used with RCaaS. Rather, you estimate your monthly usage and the duration of time you will use the service. From there, the program will bill you monthly for actual time used. That said, any RCaaS account that sees no usage activity for two consecutive monthly periods will be notified of account termination which will occur at the end of the third consecutive monthly period with no activity. In the event of termination and for a Customer with no other storage investment on the system, the Customer will have the duration of the subsequent billing cycle to move their data off of the system.

Is it possible to add groups to the RCaaS user management screen?

Yes, we can add as many groups as needed. One of the administrators would need to create the group and add an allocation manager. To get this process started, send in a help ticket.

Is there a way for me to allow other users from my campus group or outside collaborators to use my RCaaS resources?

The allocation manager of your RCaaS group can add anyone with a University of Illinois NetID to your RCaaS group, allowing them to make use of your purchase. In order to add outside collaborators they would need to first obtain a NetID. Instructions for requesting a NetID may be found at https://answers.uillinois.edu/illinois/47711.

Is there a limit to how long jobs can run?

The RCaaS queue is configured for a maximum wall time of 120 hours. If you need jobs to run for a longer duration, let us know so we can evaluate if we might be able to help you with these requirements.

How do I submit jobs to the RCaaS resources?

Simply specify the rcaas queue when you specify the queue in your job submission. Note that capitalization matters.

Does secondary (or test) queue access come with my RCaaS queue time purchase/access?

RCaaS users are able to submit jobs to the Secondary queue if they wish, but they will be charged for that time just like they are charged for jobs submitted to the RCaaS queue.

Can you install (make available) a specific software that I need for use with my RCaaS time?

While we do not install software for individual users/groups, you may install software in your group storage space. If you run into problems you can submit a help ticket and we can attempt to guide your installation, but there are so many software packages in use that we are unlikely to have experience using any particular package. Hopefully, with your knowledge of the package, and our knowledge of the system, we can help get most software working.