Requesting Research Storage Solutions
Tufts TTS provides a networked storage solution to faculty researchers who request it based on their identified needs. This solution is often used in support of grant based projects that produce or require large data sets, and these data sets need both to be accessed by multiple users and to be backed up. Networked storage solutions often support a small lab where multiple persons may interact with research data from computers at different locations. Another possibility is to request that the storage be made available to one or more accounts on the research cluster. Note, please allow up to 2 weeks for requests to be processed.
Estimation of storage needs
Please take some time in estimating your storage needs near term and longer term projections. For example, consider the following hypothetical experiment as follows:
- number of subjects including potential drop-outs (N)
- number of samples per subjects (n1)
- duration of acquisition in seconds (T)
- data rate acquisition in Hz (F)
- size of a sample in megabytes (s)
Size of dataset: N*n1*T*F*s
Example: If I recruit 20 subjects for three 5-min trials acquiring black and white images of 1MB each at 60Hz will give me a data set of 20*3*5*60*60*1~1000GB or 1TB.
For additional information, please contact Lionel Zupan, Associate Director for Research Technology, at x74933 or via email firstname.lastname@example.org.
Please work with your IT support person to request or update Cluster Research Storage or CIFS(Desktops) storage request. Note, Windows based storage is not mounted on the cluster and cluster storage is not re-exported to desktops.
Storage User guide
Cluster Specific Storage
As of Oct. 10 2013, all cluster accounts are created with a fixed 5 GB home directory disk quota.
In addition, a directory is automatically created in filesystem /scratch on the head node and each compute node. The directory is named with your Tufts UTLN; such as /scratch/utln/. There is no quota and no backup for files located there. Typically ~100+ gig is available. In addition this file system is subject to automated cleaning based on a moving window of 14 days. You may use this for temporary storage supporting your research computing needs. Note: this storage is not available to your desktop as a mounted file system.
For additional temporary storage beyond what is offered in /scratch, Tufts TTS provides a 3TB file system called /cluster/shared/. This file system is subject to automated cleaning based on a moving window of 14 days. There is no quota and no backup for files you located here. Every created account has access to this storage. The naming convention is the same as /scratch/ storage and thus would be /cluster/shared/utln/. Please work out of your named directory. Note, /cluster/shared/ file system is mounted on all nodes, but /tmp and /scratch/ are only local to a compute node and the login node.
Another available file system is /tmp. This file system is used by applications and the operating system for temporary storage. Do not use this for temporary storage. If you fill it up, the node may stop working and affect all processes on that node.
Finally, filesystem /cluster/tufts is a GPFS Parallel file system available by request. Also known as Cluster Research Storage, this storage is only mounted on the cluster. Use this link to request or update an existing storage request. See below for details.
Cluster Storage Summary Table
> 100 gig
not in the traditional sense, but daily and 4 hour hour snapshots are available. Look in directory
/cluster/shared/.snapshot/ for what is needed and drill down to your UTLN
Optional GPFS research storage, backups are limited to a 30 and 90 day moving window as snapshots.
These files are located in .snapshot directory located: /cluster/kappa/.snapshots/
amount requested by user
Storage Archive for /cluster/tufts/ research storage area
How long before some of my files on /cluster/tufts/your_storage/ are archived?
|Files older than 90 days are archived.|
What data on what files systems are subject to 90 archive?
|All data on /cluster/tufts/|
Where are the snapshots for /cluster/tufts/ filesystem?
|> ls /cluster/kappa/.snapshots/|
I think my file, my_data.dat, may have been archived, how can I tell?
> stat /cluster/tufts/your_storage/my_data.dat
If the file has zero blocks reported, it has been archived.
If my file is archived can I get it back with snaphots?
|You not snapshot recover a file once it has been archived.|
How does one get an archived file "back"?
Your normal linux treatment or access of the file should, with some delay, be transparent. This depends on several factors such as size
and time of day.
Is my /cluster/tufts/my_storage/ quota independent of what I have archived?
|No, your old archived files are counted against your quota.|
I have a file I deleted by mistake on /cluster/tufts/your_storage/ and is less than 90 days old. How can I get it back?
Copy the file to /tmp or your home directory if it will fit, verify it is ok and then move it to the directory you wish to use it in.
>cp /cluster/kappa/.snapshots/monthly-2016-09-15_12\:51\:36/90-days-archive/your_storage/your_file.dat /tmp
>ls -lt /tmp
> mv /tmp/your_file.dat /cluster/tufts/your_storage/your_file.dat
How can I verify that my requested storage is mounted on the cluster?
Check the path of the created storage you requested. For example:
|> ls /cluster/tufts/your_named_storage/|
When using /cluster/shared/your_utln/ directory and large files, how does one monitor overall usage so as to not fill this public storage to 100%?
> df -H | grep "cluster/shared"
> du --summarize /cluster/tufts/some_dir_you_own/
How does one monitor the top 20 files and subdirectories in /cluster/shared/your_utln/ ?
-bash-3.2$ du /cluster/shared/your_utln/ --max-depth=10 | sort -k1n | tail -20
Since cluster node logins are unnecessary, how do I see my data on a particular node /scratch filesystem area?
Logins to compute nodes via ssh is not possible.
Cross mounting of compute node specific file systems such as /scratch are not available on the new cluster. It is best to use /cluster/shared/your_utln/ for most situations.
How do I reference my data that is located on the login node /scratch area from my process on a compute node?
This is related to the last question. If you need to do this, you should have access to either the public temp filesystem, /cluster/shared/ , or your particular research files on an optional research file share.
Sometime I notice that my usage on a filesystem such as /cluster/tufts/ varies to the point that I am over quota and then later it is under. This often happens when I do lots of file deletions. What is going on?
File system tracking, or snapshots, are taking place. This is in support of possible restoration of a deleted file.
Storage and Data Transfers
Getting your data to cluster mounted storage is normally straight forward. You may use any desktop file transfer program such as WinScp that supports sftp or scp protocols. For cluster inbound transfers, use the file transfer node: xfer.cluster.tufts.edu, instead of login.cluster.tufts.edu.
However, if you have a large amount of data(100s of gigabytes or larger) located outside of the Tufts network domain the task may be non trivial. This can be a result of network traffic, hardware bottlenecks, and other issues beyond your control, that may result in a dropped connection; all resulting in taking a lot longer time than you thought. For transferring large amounts of data to the cluster, make sure you have the enough storage to receive it and consider using the file transfer method rsync instead of scp. rsync is available on the cluster as a command line tool. Documentation for rsync is available as a cluster man page or on the web. A simple example: (note, that . is not a speck on your screen, it is required to indicate the target path location)
How can I transfer via rsync some files of a known pattern, for example?
|> rsync -av --stats --progress ftp.ncbi.nih.gov::blast/db/nr.*.tar.gz YOUR_DEST_DIR|
Note: embedded passwords are sometimes needed.
> rsync -avz yourusername@remotehost:/path-to-your-data .
>sbatch -c 1 -p batch --wrap="rsync -avz yourusername@remotehost:/path-to-your-data /some/new/path"
Should I use scp instead of cp or mv to move data around my filesystems?
On the cluster many filesystems that you may have access to are NFS or GPFS mounted on all nodes. From your shell, you can use cp and mv in the normal manner. Do not use scp to copy your data between filesystems. It is extra overhead that is not needed.
I am using the Tufts Box.com cloud storage option, can I transfer files that way to the cluster?
Yes. Firefox is available on the cluster. You must have X11 graphics support on your computer to receive the Firefox displays. Note that your cluster home dir. has a storage quota limit of 5gig, while your Box.com account is likely more than that. To transfer large files you should use the /cluster/shared/your_utln directory or one of the other options above.
Cluster Mounted Storage "Ownership" and Succession
Often a request for mounted storage is to support an ongoing project. While the storage ownership is most often a faculty member, it is the project members that are often using and coordinating use of the storage. When project members status change due to graduation,etc., some account adjustments may be needed to secure access and affect appropriate file permission changes. TTS/RT requires the approval of the owner to make these changes. Please send a brief email to email@example.com requesting what changes are needed.
If you own a research storage share, such as /cluster/tufts/your_data, (not your home dir.) and you leave Tufts, the following steps are taken prior to storage removal:
1. either 90 day archives or snapshots will capture the data
2. then deleting the storage share withing 90 days
Research Database (HPCdb) Node
Currently there is no database service attached to the cluster for research use. However, cluster users may inquire about access to a mySQL database for supporting their research computing needs. Under some circumstances we may be able to assist in finding a host that cluster nodes my query remotely. This option is not designed for heavy I/O requests, large volume client traffic, or large storage.
Requests are treated like software requests. Please reference the Software Request Policy statement in this document.