Child pages
  • Bioinformatics

The Tufts High Performance Compute (HPC) cluster delivers 35,845,920 cpu hours and 59,427,840 gpu hours of free compute time per year to the user community.

Teraflops: 60+ (60+ trillion floating point operations per second) cpu: 4000 cores gpu: 6784 cores Interconnect: 40GB low latency ethernet

For additional information, please contact Research Technology Services at tts-research@tufts.edu


Skip to end of metadata
Go to start of metadata

2.

Bioinformatics services

Bioinformatics services

a. Timings for Genome Mapping

The newest update to SLURM has better handling of backfill, which means if you specify a expected time for your program to run it can be placed earlier as nodes open up. Using sbatch you can specify a limit on the total run-time with -t or --time d-h:m:s. Times can be specified as min, min:sec, hr:min:sec, day-hr, day-hr:min, and day-hr:min:sec. So -t 5 means five minutes -t 5:00:00 is five hours.

Here are some approximate run times (min sec) for BWA-mem, bowtie2, and samtools with 7, 15, 30, and 60 million fastq sequences. Samtools was used to convert SAM files and then sort the resulting BAM files.  These runs were done with 8 cores and 16 GB memory. The timings were obtained using the time command  ( e.g. time bwa mem -t 8  <hg38 ref> <sequences (human)> ) though some  programs report the runtimes in the output. To effectively use the back fill, take note of how long your programs run and add a bit more time to give your programs some extra run time using the -t or --time parameter.  Make sure to add run times for workflows, e.g. a bwa mem  run followed by  samtools reformatting as well as accounting for multiple runs, e.g. processing 6 fastq files. 

 

#SequencesBWA memBowtie2Samtools
7 M1' 29"1' 39"1' 52"
15 M3' 8"2' 30"3' 57"
30 M6' 36"4' 53"5' 38"
60 M12' 32"10' 6"10' 18"

b. Genome Indexes on Cluster

Several mammalian and model system genomes, indexes, and annotations are located on the Tufts HPC cluster.  Currently the genomes are listed below in the indicated directory tree are UCSC genome builds, except for canFam3 which is a NCBI build.

/cluster/tufts/genomes
  /HomoSapiens
    /hg18
    /hg19
/hg38 (New) /MusMusculus /mm9 /mm10 /RattusNorvegicus /rn4
/rn5
/rn6 (New) /CanisFamiliaris
/canFam2
/canFam3 /BosTaurus
/bosTau7
/EquusCaballus
/equCab2
/GallusGallus
/galGal4
/SusScrofa
/susScr3 /DrosophilaMelanogaster
/dm3
/dm6 (New)
/CaenorhabditisElegans
/ce10 
/DanioRerio /danRer7
/danRer10 (New)
/SaccharomycesCerevisiae (no Transcriptome)
/sacCer2
/sacCer3 

 

  Within each build subdirectory, there are two subdirectories.

  /Annotation
  /Sequence

 

In the Annotation directory there are subdirectories for gene annotations ( Gene), and depending upon the degree of annotation, directories for smallRNA and Variation. 

Under the Sequence directory, there are subdirectories containing indexes for popular short read sequence mapping programs.

  /AbundantSequences -- data files with over-represented sequences 
  /BlastDB  -- blast formatted genomic indexes: use genome.fa as reference name
  /Bowtie2Index  -- Bowtie2 formatted indexes: use genome as reference name
  /BowtieIndex  -- Bowtie formatted indexes: use genome as reference name
  /BWAIndex  -- BWA formatted indexes: use genome.fa as reference name
  /Chromosomes  -- individual chromosomes as fasta files
  /Transcriptome  -- Bowtie2 formatted index of transcriptome sites: use transcript as ref name
  /WholeGenomeFasta -- Genome as one file with accessory files

 

Please read the documentation for a mapping program to understand the way in which the reference indexes are referred.

Examples

The examples listed below should be included in a script and then submitted

Example: BWA

It helps to set up environmental variables to avoid having to type long paths. Here a set of short reads ( myreads.fq) are mapped to the mouse genome (mm10) with a SAM formatted file as output. Note that bwa uses genome.fa as a reference index name and the bwa mem analysis is used. See the BWA documentation for other ways to invoke bwa.

  module load bwa/0.7.9a
  
  export MM10=/cluster/tufts/genomes/MusMusculus/mm10/Sequence/BWAIndex
  export MYDATADIR=/cluster/shared/myutln/mmdata
  bwa mem $MM10/genome.fa $MYDATADIR/myreads.fq >$MYDATADIR/myreads.sam

 

Example: Bowtie2

Similarly, environmental variables can be set up, and in the case of bowtie2  a BOWTIE2_INDEXES variable must be set also. Here we have an example of a paired end analysis, with minimal options. See the bowtie2 documentation for a complete set of command options. Note Bowtie2 uses genome as reference index name   (-x genome ).

  module load bowtie2
  export BOWTIE2_INDEXES=/cluster/tufts/genomes/MusMusculus/mm10/Sequence/Bowtie2Index
  export MYDATADIR=/cluster/shared/myutln/mmdata
  
  bowtie2 -q -x genome -1 $MYDATADIR/myreads_1.fq -2 $MYDATADIR/myreads_2.fq -S $MYDATADIR/myreads.sam

 

 

  

c. HPC Modules for Bioinformatics

  Note: some bioinformatic software modules, such as R modules like bioconductor or python modules, are not listed here because they are part of a larger module, for example R/3.1.0 or python/2.7.6. Load those modules to get to bioconductor or python modules such as numpy or matplotlib.

  To list the entirety of the module collection use this command

       module avail

To load a module use this command

       module load modulename/version

 as listed below. Default settings are annotated by '*'

       module list

shows currently loaded modules.

To unload a module use this command

       module unload modulename/version

 

 
ClassificationModule
Align/Mappingblast/2.2.24
blat/20140708
bowtie/0.12.7*
bowtie/1.0.1
bowtie/2.1.0
bowtie2/2.2.3*
bwa/0.7.9a
exonerate/2.2.0
AssemblyAbySS/1.5.2
pandaseq/2.5
trinity/7.17.14
velvet/1.0.19
velvet/1.2.03
velvet/1.2.10
BioVisualCytoscape/2.8.3
IGV/1.5.30
ChIP-SeqMACS/1.4.2-1
MAnorm/2014-04-03
General Purpose

R/2.15.3
R/3.01
R/3.0.2
R/3.0.3
R/3.1.0
mathematica/8.0
mathematica/8.0.4
mathematica/9.0.0
mathematica/9.0.1
mathematica/10.0.2
matlab/2012b
matlab/2013a*
matlab/2014a
matlab/2014b

Microbial ecologyQIIME/1.5.0*
QIIME/1.6.0
QIIME/1.7.0
QIIME/1.8.0
QIIME/1.9.0
mothur/1.25.1
mothur/1.29.1
ClassificationModule
NGS

BCFtools/1.2

Picard/1.139

CirSeq/3
GATK/3.1-1
GATK/3.7
HTseq/0.5.4a*
HTseq/0.6.1p1
IGV/1.5.30
bedtools/2.17.0*
bedtools/2.19.1
bowtie/0.12.7*
bowtie/1.0.1
bowtie/2.1.0
bowtie/2.2.3*
bwa/0.7.9a
fastx/0.0.13
lastz
ngsplot
picard
samtools/0.1.18*
samtools/0.1.19

Phylogeneticsmrbayes/3.1.2
paml/4.8
PhyML/3.1
RNAViennaRNA/2.1.6*
mirdeep2/2.0.0.5*
randfold/2.0*
RNA-SeqSTAR/2.30e*
cufflinks/0.8.3
cufflinks/2.0.0
cuffinks/2.0.2*
cufflinks/2.1.1
misopy/0.5.2
rsem/1.2.4
tablemaker/2.1.1*
tophat/1.0.14
tophat/2.0.9*
tophat/2.0.10
tophat/2.0.13
transdecoder/2.0
trinity/7.17.14
Statistical Genetics
/GWAS

ancestrymap/6210
fbat/2.0.3
haploview/4.1
impute/2.0.3
mach/1.0.16
merlin/1.1.2
pbat/3.61
pedcheck/1.1
plink/1.06
plinkseq/0.10
vcftools/0.1.12b

 

Performance Considerations  using threads

In general, there are useful performance gains using threads, but it can also be abused by using too many.  Applications supporting thread parallelism may have varying degree of internal support.

Application performance is not always well documented and it may be beneficial to you to do some benchmarking.  By doing so you will be in a position to better utilize the cluster resources. For example here is a benchmark examination of blastp and other tools.  

d. Tufts Center for Neuroscience Research Genomics Core

The Tufts CNR Genomics Core supplies links to bioinformatics resources related to their operation. See Tufts CNR Genomics Core Resources for more information.

 

 

Misc

A separate server is used to support these services in some cases. However some software may require installation on the linux research cluster. Check the Installed Software for Bioinformatic software available on the cluster. To make a special request for software installation, please follow the instructions as noted elsewhere on this page.

  • No labels