Home

SABMark Dataset

The Sequence Alignment Benchmark (SABmark) provides sequences of low intermediate similarity and covers the entire known fold space. For our studies we use proteins from superfamilies set. This set contains a representative set of sequences from each of SCOP superfamily.
Link to SABMark Paper: http://bioinformatics.oxfordjournals.org/cgi/content/full/21/7/1267
Link to datasets: http://bioinformatics.vub.ac.be/databases/databases.html

Comparison of Protein Sequence and their structures

We investigated the case when two protein sequences are similar, i.e. are close hits in BLAST search yet differ in their structure. Following file contains proteins whose sequences are similar yet have different structure

Generating Confusion Matrix for Pfam HMMs

Using Hidden Markov models to learn a set of aligned sequences and later use the HMM to classify an unknown sequence is the technique employed by Pfam(http://www.sanger.ac.uk/Software/Pfam/). Sometimes more than one HMM can classify a protein sequence to the domain. We investigated the cases in which more than one HMM matches a sequence, which means the sequence belongs to two pfams. We then created a confusion matrix for the pfams which counts number of times a pair of pfam classify same sequence. The files below lists the pairs of pfams classify same sequence with e values of e-10, e-5 and e-2 respectively.

  • ResultConfusionMatrixe10.xls
  • ResultConfusionMatrixe5.xls
  • ResultConfusionMatrixe2.xls
    (First two columns in each file are the pfams, column 3 is number of times pfams are confused, column 4 and 5 are clans to which these pfams belong and column 6 says "match" when the both pfams belong to dame clan)

    Possible approaches for picking correct pfam hmm from confused set

  • "Recursive Class Elimination"

    Information theory based measures

We want to apply Discriminative Information criterion to pick the hmms. Steps involved

  1. Pick the sets of hmms that are confused
  2. Pull them out from Pfam_fs file
  3. Generate Training dataset for these hmms
  4. Compute P(X/T) - The probability of of seeing the data(X) from model(T)
  5. Compute DIC and BIC for confused models
  6. Pick the model with highest DIC and BIC models

Experiments on Confused Classifiers

GO Annotations of the proteins

GO is one of the popular annotation scheme for proteins and other biological components. When avialable from reliable source go annotations can be used for verifying predicted annotations. GO annotations for proteins in PDB (http://www.rcsb.org/pdb/home/home.do) are available from EBI at http://www.ebi.ac.uk/GOA/goaHelp.html. The mapping file is gene_association.goaa_pdb. The file was downloaded and imported into mysql database.

Since SABMark dataset is subset of proteins in PDB, GO annotations for the sabmark proteins can be produced by making queries to the database. A program(bio.go.Util.java) is written to get GO annotations for all sabmark proteins.

On the other had GO annotations for pfams are available on pfam web-site at the following link: ftp://ftp.sanger.ac.uk/pub/databases/Pfam/. gene_ontology.txt.gz file is dowloaded and imported into database.

In this study we compare the GO annotations of the proteins in SABMark dataset to the GO annotations of the predicted Pfams of the proteins.
The spreadsheet contains GO annotations for proteins and their pfams
GOSabMarkPfam.xls

HMMer

http://www.bioinformatics.org/pipermail/ssml-general/2004-October/000112.html

http://biowulf.nih.gov/apps/sam/

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.