The Classification of Protein Domains into Families and Clans through Entropy Measures based Methods

Rubem P. Mondaini

Chalmers Conferences, 9th European Conference on Mathematical and Theoretical Biology

Rubem P. Mondaini

Last modified: 2014-04-01

Abstract

The idea of a Clan of Protein families came from the concept that a new family is not allowed to overlap with existing families as registerd in a database. Any residue in a given sequence can only appear in one family. The adopted rules for buildinghese families new families are the requirement of non-overlapping but related families and the impossibility of building a single Hidden Markov Model (HMM) which is able to detect all these related families.

From the families registered on a database, we can identify a Clan as the group of related families with a single evolutionary origin. The relationship of these families is then decided in terms of related structure, related function, profile-profile comparisons with an admisible E-score as well as the matching of a specified sequence to HMMs obtained from different families. To summarize, from a set of related families, we should try to merge them in order to obtain a comprehensive model which should be able to detect all of the proteins of this set. If this can be done as described, we have created a Clan. Otherwise, there is no Clan structure for this set of related families. The profile-profile E-score comparison referred above is considered as significant if E<0.001.ional st

The fundamental aim of the present work is to introduce a statistical analysis of the family and clan classification in a protein database. However, the conventional statistics estimates will be difficult to apply to sequences of symbols which natural ordering or underlying metric is not known yet. The entropy masures based methods seem to be the best for the assessment of protein databases and we have obtained interesting results fro the intensive analysis of regions of 100 X 100 residues in terms of a disproof of the clan classification of protein families as the work with the selected set of entropy measures is concerned.

For a pedagogical useful approach to interested scientific practitioners, which are non-biologists, we also succeed to set all calculations in terms of the MAPLE 16 algebraic computing system.

References:

1. E.L.L. Sonhammer, S.R.Eddy, R. Durbin - A Comprehensice database of Protein Doamin Families based on Seed alignments - Proteins: Structure, Function and Genetics 28 (1997) 405 - 420.

2. E.L.L. Sonhammer, S.R. Eddy, E. Birney, A. Bateman, R.Durbin - Pfam: Multiple Sequence alignments and HMM - profiles of protein domains - Nucleic Acids Res. 26(1)(1998) 320 - 322.

3. R.D. Finn, E.L.L. Sonhammer et al.- The Pfam protein families

database - Nucleic Acids Res. 36 ( 2008) D281 - D288.

4. A. Heger, L. Holm - Exhaustive ennumeration of Protein Domain families - J. Mol. Biol. 328 (2003) 749 - 767.

5. J. Liu, B. Rost - Domains, Motifs and Clusters in the Protein Universe - Current Opinion in Chemical Biology 7 (2003) 5 - 11.

6. S-R Tzeng, C.G. Kalodimos - Protein activity regulation by Conformational Entropy - Nature 488 (2012) 236 - 240.

7. J.A. Jadwin, M. Ogiue-Ikeda, K. Machida - The application of Modular Protein Domains in Proteomics - FEBS Letters 586 (2012) 2586 - 2596.

Keywords

Protein Domains, Protein Databases, Entropy Measures