Table of Contents

Storage Areas
Storage Policies
Accessing Storage
Managing Your Data Usage
Guides and Tutorials
Optimizing Data & I/O Access

Illinois Research Storage

Welcome to the documentation covering the Research Storage offerings managed by Research IT @ Illinois in partnership with engineers at the National Center for Supercomputing Applications. As the storage related offerings of the program increase in scope and capability we wanted to provide a centralized place to look for information about how to best leverage these services to their maximum potential.

We hope this information is a helpful guide in your utilization of this service. If you have any questions related to the information covered in this documentation please feel free to email us at help@campuscluster.illinois.edu.

Storage Areas

NameMount PathQuota PolicyPurge PolicySnapshots
home/home5GB Soft/7GB Hard; 7 Day Grace; No inode quotaNo PurgeYes; Daily for 30 days
project/projectBased on investment; 20 million inodes per userNo PurgeYes; Daily for 30 days
scratch/scratch10TB/User; No inode quotaDaily Purge of Files Older than 30 daysNo
local scratch/scratch.localNoneFollowing job completionNo

Home

The /home area of the file system is where users land upon logging in to the cluster via SSH, and is where a user’s $HOME env. variable points. This area has a fairly small quota and is meant to contain a user’s configuration files, job output/error files, and smaller software installations. A user’s /home area of the file system is automatically provisioned during the account provisioning process and the space is provided by the program. It is not possible to request an expansion of home directory quota. The 7-day grace period on this area means that if the usage in a user’s home directory is over 5GB for a 7-day period, the user’s home directory will stop accepting new writes until data has been cleared to get below the 5GB threshold.

Project

The /project area of the file system is where investment group’s (be they a single faculty member, a lab group, a department, or an entire college) storage capacity resides. Users can have access to multiple project subdirectories if they are a member of various investment groups, and have been granted access to the space by the investments PI or Tech Rep. This storage area is where the bulk of the file system’s capacity is allocated.

Scratch

The /scratch area of the file system is where users can place data while it is under active work. The scratch area of the cluster is provisioned by the campus cluster program for all users to have access to. As noted in the table above, files older than 30 days are purged from this space, based on a file’s last access time. The admin team maintains various tools to detect and monitor for users abusing the scratch area (such as by artificially modifying file’s access times) in attempt to circumvent the purge policy. Doing so is a violation of the cluster’s policy and is not allowed. If you believe you have a legitimate need to retain data in scratch for longer than 30 days, please contact the program office.

Scratch — Local

The /scratch.local area is allocated on an individual compute node on the campus cluster or HTC system, this disk is provided by a compute node’s local disk, not the shared file system. The size of /scratch.local will vary across nodes of different investments, so be careful on assuming size, especially when running in the secondary queue where you have less control over what node your job lands on. Data place in /scratch.local is purged following a job’s completion prior to the next job beginning on the node.

Storage Policies

Scratch Purge

Files in the /scratch area of the file system will be purged daily, based on the file’s access time as recorded in the file system’s metadata. Once data is purged via the purge policy, there is no recovering the data. It has been permanently destroyed. Please move data that is of high value out of this space to make sure it doesn’t get forgotten.

File System Snapshots

On a daily basis snapshots are run on the file system for the /home and /project areas. These snapshots allow users to go back to a point in time and retrieve data they may have accidentally modified, deleted, or overwritten. These snapshots are not backups and reside on the same hardware as the primary copy of the data.

To access snapshots for /home visit /home/.snapshots/home_YYYYMMDD*/$USER

To access snapshots for /projects visit /projects/$PROJECT_NAME/.snapshots/YYYYMMDD*/

Inode Quotas

An inode is a metadata record in the file system that tracks information about a file or directory (such as what blocks it lives on on disk, permissions, ACL, extended attributes, etc.). For every file or directory in the file system their exists and inode record. For project directories, there is a 20 million inode per-user policy; since metadata is stored on NVME for fast access this quota ensures we stay within a tolerable ratio of data to metadata. If this quota becomes an issue for your team, please reach out to the program office to discuss a solution. For ways to help decrease inode usage, see the Data Compression section later in this guide.

Data Classifications

There are many special data classifications in use by researchers across campus, some types are permitted to be stored on Illinois Research storage, while some are not. Below are descriptions about some of those data types. If you have any questions, please contact the program office.

ITAR (International Traffic in Arms Regulation) Data – is permitted to be stored on Campus Research Storage, however all proper procedures and notifications must be followed. For more information about those procedures and who to contact, see the OVCR’s documentation here.

HIPAA (Health Insurance Portability and Accountability Act) /PHI (Personal Identifiable Health) Data – is not permitted to be stored on Campus Research storage.

Accessing Storage

There are a variety of different ways users can access Research Storage, and we continue to work with users to find more ways to help them access it better. Below is a table summary of where/how Research Storage is accessible with further descriptions below.

Location/Method/home/projects/scratch
HPC Head NodesYesYesYes
HPC Compute NodesYesYesYes
HTC Head NodesOn RoadmapYesNo
HTC Compute NodesOn RoadmapOn RoadmapNo
cc-xfer CLI DTN NodesYesYesYes
Illinois Research Storage Globus EndpointsYesYesYes
Globus Shared Endpoints (external sharing)NoYesNo
Research Storage S3 EndpointNoComing SoonNo
Lab Workstations (Research Labs)NoYes*No

* Available Upon Request (see NFS/SAMBA section)

HPC Head Nodes & Compute Nodes

All file systems areas are accessible via the Campus Cluster’s batch head nodes and compute nodes for use in running jobs, and interacting with data on the command line.

HTC Head Nodes

We are working to implement the access of investor’s /project areas on the head nodes of the HTC subsystem. This work is in progress and this guide will be updated when that capability is available. The mounting of /home on the HTC head nodes (eg. having a shared $HOME between the HPC and HTC subsystems) has been discussed and is in the process of being placed on the road map of new feature delivery.

HTC Compute Nodes

The mounting of the /home and /projects on the HTC compute nodes has been discussed and is likely to be placed on our road map in the near future. An upgrade and architectural shift of some sub system components will be required to provide this in a stable fashion, and planning for that effort is underway. This section will be updated as new information becomes available.

CLI DTN Nodes

All the file systems areas are available for access via the cluster’s cc-xfer DTN service. These DTN nodes provide users a target for transferring data to and from the cluster using common CLI data transfer methods such as rsync, scp, and others. They sit behind the round-robin alias of cc-xfer.campuscluster.illinois.edu. For information on how to use these tools, see the example commands in the tutorials section further down in this guide.

Globus Endpoints

All the file system areas are available for access via the cluster’s Globus Endpoint Illinois Research Storage not just for transfers to/from the system from/to other Globus Endpoints, but also to/from Box storage and Google Drive storage via the respective Globus connectors available at Illinois Research Storage Box and Illinois Research Storage Google Drive. For information on how to use these connectors, see the Globus section further down in this guide.

S3 Endpoint

The storage engineering team is working on a reliable and secure method for making data on the Research Storage system available via an S3 endpoint. This will allow researchers to GET/PUT data into and out of the system from tools that support the S3 protocol. This is also helpful for moving data to/from the system from/to cloud resources such as AWS or Azure. Our efforts in this area are ongoing and we hope to have an update soon.

Lab Workstations/Laptops

Groups are able to request that their /projects area be made accessible for mounting on workstations and laptops in their research labs on campus. This access method is especially helpful for data acquisition from instruments straight on to the Research Storage system, and also for viewing files remotely for use in other GUI-based software that is run on local machines instead of in a clustered environment. This access method is only available to machines on campus, and some security restrictions must be followed. For more information see the NFS/Samba Section of this guide below.

Managing Your Data Usage

CLI Quota Command

The Research Storage admins created a command users can use on the CLI to view a summary of their usage across all areas of the file system and their team’s usage of the project space(s) they have access to. A user simply needs to run the command quota to view their usage, an example is provided below.

[testuser1@cc-login1 ~]$ quota
Directories quota usage for user testuser1:

-----------------------------------------------------------------------------------------------------------
|      Fileset       |  User   |  User   |  User   |  Project |  Project |   User   |   User   |   User   |
|                    |  Block  |  Soft   |  Hard   |  Block   |  Block   |   File   |   Soft   |   Hard   |
|                    |  Used   |  Quota  |  Limit  |  Used    |  Limit   |   Used   |   Quota  |   Limit  |
-----------------------------------------------------------------------------------------------------------
| labgrp1            | 160K    | 1T      | 1T      | 58.5G    | 1T       | 6        | 20000000 | 20000000 |
| home               | 36.16M  | 5G      | 7G      | 58.5G    | 1T       | 1180     | 0        | 0        |
| scratch            | 2.5M    | 20T     | 20T     | 58.5G    | 1T       | 15292    | 0        | 0        |
| labgrp2            | 0       | 60T     | 60T     | 54.14T   | 60T      | 1        | 20000000 | 20000000 |
| labgrp3            | 0       | 107T    | 107T    | 88.31T   | 107T     | 11       | 20000000 | 20000000 |
-----------------------------------------------------------------------------------------------------------

The “User Block Used” column shows how much capacity the user testuser1 is consuming in each of the areas. The “Project Block Used” column shows how much capacity the entire team is using in their project space. The “User File Used” column shows how many inodes the user testuser1 is consuming in a given space. The relevant soft and hard quotas for each of these areas are also shown in their respective columns.

The output of the quota command is updated every ~15 minutes on the system, so there will be a slight delay between the creation/deletion of data and the update to the output of the quota command. Also, as noted in the policy section, at times when the file system is under heavy load, quota data may get a bit out of sync with reality; thus daily in the early morning hours the system runs a quota verification script to force the quota data to sync up across the system.

Storage Web Dashboard Interface

When an investment group purchases storage on the Research Storage service, a dashboard gets created for the viewing of information related to their usage. The dashboard will not only show point-in-time usage of the storage resources, but also trends over time and a break down on a per-user level for both capacity and inodes. All members of a project should be able to log in and view their storage dashboard. Access to the dashboard is governed by group membership which PIs and Tech Reps control via the online User Portal. There will potentially be a few hour delay between when a user is added/removed from the group and their access to the dashboards being provisioned/removed.

Main Dashboard URL: https://ccmon.campuscluster.illinois.edu

The main Cluster Overview dashboard as well as the storage dashboards can be accessed starting at the link above. The landing dashboard shows an overall view of the state of the cluster including job counts, node health numbers, number of users logged in, file system activity and overall usage, etc. To access your storage dashboard follow the steps below.

– Go to the Main Dashboard URL noted above, then navigate to the login button in the lower left corner of the screen

– Login in with your campus AD credentials (the same ones you use to SSH into the cluster

– Click on the “Campus Cluster Overview” title to reveal the dashboard search bar

– Search for your project’s name in the drop down, it should appear in the search results. Let’s use the “NCSA” project as an example

– Your team’s storage dashboard should look similar to the screenshot below. Trends for individuals will be displayed in the lower left for the username selected from the drop down, and the time period of the graphs can be adjusted in the Time Picker in the upper right corner. The utilization table in the lower right can be sorted by username (alphabetical), capacity used, or inodes used.

Guides/Tutorials

NFS Access to Research Storage

Investor groups are able to request that their /projects area be made available to them via NFS for mounting on machines local to their lab team. To request NFS access to your team’s area, please have the project’s PI or Tech Rep send an email to help@campuscluster.illinois.edu with the following information:

– Mount Type (Read-Only or Read-Write)

– Project Area being exported

– List of IP’s or IP CIDR range of the machines that need mount (These machines must have a public IP address or a Campus internal routed IP address for them to be able to reach the NFS servers)

NFS exports of the file system will be root-squashed which means that a user interacting with the storage on the remote machine via that local machine’s root account will have file access permissions that map to the nfsnobody user (generally UID 65534 on Linux systems).

When NFS mounting storage, it is advised to have user’s UIDs align with what they are on the Research Storage system, to get a user’s UID on a system, have them run the command “id” and look for the UID section. See example below:

[testuser1@cc-login ~]$ id
uid=7861(testuser1) gid=7861(testuser1) groups=7861(testuser1) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
[testuser1@cc-login ~]$

Once exported, the file system can be NFS mounted via a couple different methods. The first and preferred method is via autofs. The second method being manually adding the mount to the host’s /etc/fstab file. The round-robin DNS entry for the Research Storage NFS clustered endpoint is nfs.campuscluster.illinois.edu

For the autofs method, check out this guide from Red Hat. To mount via the /etc/fstab method, check out this guide from Red Hat. Make sure you have the nfs-utils package installed on your machine. Our recommendations for NFS mount parameters are as follows:

[rw/ro],soft,timeo=20,vers=4,rsize=16384,wsize=16384

Samba Access to Research Storage

Investor groups are able to request that their /projects area be made available to them via SAMBA for mounting on machines local to their lab team. To request SAMBA access to your team’s area, please have the project’s PI or Tech Rep send an email to help@campuscluster.illinois.edu with the following information:

– Project Area being Exported

Following the request, the Research Storage team will export that area of the file system to all users who are in that project’s group as seen in the User Portal. To add/remove users who can mount the project area, add/remove them from the group in the portal.

Once exported the round robin DNS entry for the SAMBA node pool is samba.campuscluster.illinois.edu. A guide to mounting a SAMBA share on Windows machines can be found here, and a guide for mounting on Mac OS based machines can be found here. Make sure the machine is connected to the campus network and has a campus public IP address or an internally routed private IP address.

For Windows the path to the share should look like:

\\samba.campuscluster.illinois.edu/$your_project_name

For MacOS the Server Address to use is:

smb://samba.campuscluster.illinois.edu

For both operating systems, use your campus AD credentials (the same ones you use to access the cluster) to access these shares.

Data Compression and Consolidation

Often times it can be handy to bundle up a bunch of files into a single file bundle. This can make data transport easier to do, and more efficient. It also helps reduce the space the data takes up in disks, both in terms of capacity and inodes.

As noted in the above policy section on inode limits, we will discuss how to compress files together into a bundle and then zip them up to save space. To compress files with tar + gzip, see the example below where the images folder is run through tar + gz to create images_bundle.tar.gz:

## Just for illustration, this folder has 4,896 image files in it
[testuser1@cc-login hubble]~ ls images/ | wc -l
4896

## tar and compress the folder, example:
[testuser1@cc-login hubble]~ tar -zcvf images_bundle.tar.gz images

## There should now be a single archive file that contains all the images
[testuser1@cc-login hubble]~ ls
images images_bundle.tar.gz

## You can now remove the original folder as all its contents are in the tar.gz file
[testuser1@cc-login hubble]~ rm -rf images

CLI Transfer Method: rsync

## Users wants to transfer the images directory
[testuser1@users-machine hubble]~ ls
images

## Transfer using rsync to a project directory
[testuser1@users-machine hubble]~ rsync -avP images testuser1@cc-xfer.campuscluster.illinois.edu:/projects/$teams_directory/

CLI Transfer Method: scp

## Users wants to transfer the images directory
[testuser1@users-machine hubble]~ ls
images

## Transfer using scp to a project directory
[testuser1@users-machine hubble]~ scp -rp images testuser1@cc-xfer.campuscluster.illinois.edu:/projects/$teams_directory/

CLI Transfer Method: sftp

Users can transfer data using sftp via the command line or via one of many common sftp transfer utilities. Two of our favorites are WinSCP and Cyberduck both are free to download and install.

Cyberduck

– Click the “Open Connection button in the upper left corner.

– Select the SFTP protocol in the drop down menu, fill in the DTN Endpoint URL and your credentials.

– Once connected you should now see a listing of your home directory, and you can navigate around the file system via the GUI, downloading and uploading files as needed.

WinSCP

– After installing WinSCP, and opening it, it should prompt you to enter information for a new Login, fill in the information like the example below and click “Login”.

– After successfully connecting you will see your home directory on the right half of the window, and your local machine’s Explorer on the left half and you should be able to navigate the folder structure on both sides and transfer data back and forth as needed.

CLI Transfer Method: bbcp

Transferring data via bbcp requires the tool to be installed on both sides of the transfer. Research Storage admins have it installed on the cc-xfer.campuscluster.illinois.edu endpoint, for information on how to install bbcp on your local machine, for pre-compiled binaries look here

## Users wants to transfer the images directory
[testuser1@users-machine hubble]~ ls
images

## Transfer using bbcp to a project directory
[testuser1@users-machine hubble]~ bbcp -r -w 4m images testuser1@cc-xfer.campuscluster.illinois.edu:/projects/$teams_directory/

CLI Transfer Method: rclone

The research storage admins have the rclone file transfer utility installed on the cc-xfer.campuscluster.illinois.edu endpoint for users to use if they so desire. Configuring rclone for data transfer can be done for a large variety of storage backends. If you are targeting transferring data to/from Research Storage and a local machine, it is best to set up the Research Storage Endpoint using the SFTP connector on your local machine. If you are targeting the transfer of data to/from the cluster and another cloud storage service (Amazon S3, Box, Dropbox, Google Drive, OneDrive, etc.), it is best to setup that endpoint as a source/target on the cc-xfer.campuscluster.illinois.edu endpoint itself. For documentation for configuring rclone to send/receive data from your desired location, check out the relevant section in their documentation here.

Globus Endpoint: POSIX Endpoint

– Navigate to globus.org and cling “Log In” in the upper right corner

– Choose “University of Illinois Urbana-Champaign” as your Identity Provider and click “Continue”

– If prompted click “Allow” when asked to authorized the Globus Web App

– Login in via the Illinois Shibboleth service, this will be a Duo 2FA prompt

– Once logged in you should be taken to the File Manager section, on one side search for Illinois Research Storage collection and you should see a list of endpoints, click on the “plain” endpoint

– The system will prompt you to Authenticate to the endpoint, click continue; Globus may prompt you to link your netid@illinois.edu identity, go ahead and do so

– You should then get dropped back into the “File Manger” view and be able to see your home directory in the explorer window

– Then in a similar manner (in the right half of the “File Manger” view) search for and authenticate to the collection you are planning to transfer data to/from, then use the GUI to transfer the data; you can choose transfer settings. Also on the left is a button to view your current transfer activity

Globus Endpoint: Box

– Follow the same steps as the POSIX endpoint above until searching for a collection; in that step choose the Illinois Research Storage – Box Collection

– Again Globus will ask you to link your netid@illinois.edu identity, go ahead and do so; and then grant the Globus Web App permissions; should should then be dropped into the root of your UIUC Box folder.

– Choose another endpoint on the other side of the “File Manger” view and transfer data in/out of your Illinois Box account to/from another Globus Collection, such as the Illinois Research Storage Collection or one from another site/institution

Globus Endpoint: Google Drive

– Follow the same steps as the POSIX endpoint above until searching for a collection; in that step choose the Illinois Research Storage – Google Drive Collection

– Link your identity and grant the Globus Web App permissions

– You will then be asked to further authorize the Google Drive Endpoint, go ahead and do; it will take you to a page that looks like below, click on “Continue”

– Choose your Illinois Google Acount and grant the Globus Connect Server Access

– You should then be taken back to the “File Manager” view and be in your Google Drive directory; authenticate to the other Collection you want to transfer data to/from and then Start the transfer like a transfer between two normal endpoints

Globus Endpoint: Creating a Shared Endpoint

Globus Shared Endpoint functionality is a great way to share data with people at institutions who are not affiliated with the University of Illinois system. If you are granting access to data you manage to another person at an external organization, all the other person needs is a free Globus account and an endpoint on their side to transfer the data to. To set up a shared endpoint follow the instructions below:

– Log in to Globus and connect to the Illinois Research Storage Collection following the steps in the POSIX Endpoint tutorial above, then navigate to and select the directory you want share with external users; currently only data in /projects is allowed to be shared externally, and click the “Share” option to the right of the directory

– Select the “Add Guest Collection” option

– Fill in all the information about the share, the more the better and the easier for others to find

– After creating the collection you will get dropped into the permissions tab for that shared endpoint, go ahead and start adding the people you want to share the data with. You have fine grained control if you want them to have access to even a smaller subset of your dataset, read vs read/write access, etc.

– Once added you should see the person in the share’s permissions section

Optimizing Data and I/O Access

Understanding File System Block Size

The Research Storage sub-system currently is based on IBM’s Spectrum Scale File System (formerly known as GPFS) v5.1. This file system, like others, is formatted at a given block size, however it offers the advanced feature of sub-block allocation. A file system’s block size is the name for the smallest unit of allocation on the file system.

For example, on a normal file system with a block size of 512KB, files smaller than that size (let’s say 200KB) will sit in a single block, using only part of its available capacity. The remaining 312KB in that block will be rendered un-usable because that block can only be mapped for a single file and the first file is already associated with that block, leaving that block’s use efficiency less 50%. That wasted space is counted against a user’s quota as no other user can leverage that space. Thus setting a smaller block size generally improves the efficiency of the file system and wastes less space. However, it comes at the cost of performance. The smaller the block size, the slower the file system will perform on high-bandwidth I/O applications.

The ability to have sub-block allocation on the file system allows the file system to be formatted at a given block size, but have the minimum allocatable block size be much smaller. In this way, you can get performance benefits of large block sizes, with much less efficiency loss on small files.

The Research Storage sub-system is formatted at a 16MB block size which allows for very high system throughput (which many user applications demand). However that block can be sub-divided into 1,024 sub-blocks, so the minimum allocatable block size of 16KB. Thus for example, a 12KB file will orphan ~4KB of space.

Impact of I/O Size on Throughput performance

In light of the above section, there are implications to the I/O performance applications will receive when running compute jobs on the HPC and HTC systems. When applications are doing I/O on the file system, the size of their I/O requests is a key trait to determining how much performance they will receive. When possible users should configure their workflows to use larger files if possible.

For applications that can’t escape using tiny files for their work, sometimes the use of HDF files can help improve application performance by creating effectively a virtual file system within a file that contains all the data. Information on HDF files and their use can be found here.