Tags

Type your tag names separated by a space and hit enter

Accurate domain identification with structure-anchored hidden Markov models, saHMMs.
Proteins. 2009 Aug 01; 76(2):343-52.P

Abstract

The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence-based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three-dimensional structures of domains are much more conserved than their sequences. Based on structure-anchored multiple sequence alignments of low identity homologues we constructed 850 structure-anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI-BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E-value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled "unknown" in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/.

Authors+Show Affiliations

Umeå Centre for Molecular Pathogenesis, UCMP, Umeå University, Sweden.No affiliation info availableNo affiliation info available

Pub Type(s)

Journal Article
Research Support, Non-U.S. Gov't

Language

eng

PubMed ID

19173309

Citation

Tångrot, Jeanette E., et al. "Accurate Domain Identification With Structure-anchored Hidden Markov Models, SaHMMs." Proteins, vol. 76, no. 2, 2009, pp. 343-52.
Tångrot JE, Kågström B, Sauer UH. Accurate domain identification with structure-anchored hidden Markov models, saHMMs. Proteins. 2009;76(2):343-52.
Tångrot, J. E., Kågström, B., & Sauer, U. H. (2009). Accurate domain identification with structure-anchored hidden Markov models, saHMMs. Proteins, 76(2), 343-52. https://doi.org/10.1002/prot.22349
Tångrot JE, Kågström B, Sauer UH. Accurate Domain Identification With Structure-anchored Hidden Markov Models, SaHMMs. Proteins. 2009 Aug 1;76(2):343-52. PubMed PMID: 19173309.
* Article titles in AMA citation format should be in sentence-case
TY - JOUR T1 - Accurate domain identification with structure-anchored hidden Markov models, saHMMs. AU - Tångrot,Jeanette E, AU - Kågström,Bo, AU - Sauer,Uwe H, PY - 2009/1/29/entrez PY - 2009/1/29/pubmed PY - 2010/2/24/medline SP - 343 EP - 52 JF - Proteins JO - Proteins VL - 76 IS - 2 N2 - The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence-based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three-dimensional structures of domains are much more conserved than their sequences. Based on structure-anchored multiple sequence alignments of low identity homologues we constructed 850 structure-anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI-BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E-value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled "unknown" in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/. SN - 1097-0134 UR - https://www.unboundmedicine.com/medline/citation/19173309/Accurate_domain_identification_with_structure_anchored_hidden_Markov_models_saHMMs_ L2 - https://doi.org/10.1002/prot.22349 DB - PRIME DP - Unbound Medicine ER -