Structural characterization of the human proteome

Genome Res. 2002 Nov;12(11):1625-41. doi: 10.1101/gr.221202.

Abstract

This paper reports an analysis of the encoded proteins (the proteome) of the genomes of human, fly, worm, yeast, and representatives of bacteria and archaea in terms of the three-dimensional structures of their globular domains together with a general sequence-based study. We show that 39% of the human proteome can be assigned to known structures. We estimate that for 77% of the proteome, there is some functional annotation, but only 26% of the proteome can be assigned to standard sequence motifs that characterize function. Of the human protein sequences, 13% are transmembrane proteins, but only 3% of the residues in the proteome form membrane-spanning regions. There are substantial differences in the composition of globular domains of transmembrane proteins between the proteomes we have analyzed. Commonly occurring structural superfamilies are identified within the proteome. The frequencies of these superfamilies enable us to estimate that 98% of the human proteome evolved by domain duplication, with four of the 10 most duplicated superfamilies specific for multicellular organisms. The zinc-finger superfamily is massively duplicated in human compared to fly and worm, and occurrence of domains in repeats is more common in metazoa than in single cellular organisms. Structural superfamilies over- and underrepresented in human disease genes have been identified. Data and results can be downloaded and analyzed via web-based applications at http://www.sbg.bio.ic.ac.uk.

Publication types

  • Comparative Study

MeSH terms

  • Algorithms
  • Animals
  • Archaeal Proteins / chemistry
  • Archaeal Proteins / classification
  • Archaeal Proteins / genetics
  • Archaeal Proteins / physiology
  • Bacterial Proteins / chemistry
  • Bacterial Proteins / classification
  • Bacterial Proteins / genetics
  • Bacterial Proteins / physiology
  • Caenorhabditis elegans Proteins / chemistry
  • Caenorhabditis elegans Proteins / classification
  • Caenorhabditis elegans Proteins / genetics
  • Caenorhabditis elegans Proteins / physiology
  • Databases, Genetic / statistics & numerical data
  • Drosophila Proteins / chemistry
  • Drosophila Proteins / classification
  • Drosophila Proteins / genetics
  • Drosophila Proteins / physiology
  • Escherichia coli Proteins / chemistry
  • Escherichia coli Proteins / classification
  • Escherichia coli Proteins / genetics
  • Escherichia coli Proteins / physiology
  • Gene Duplication
  • Genetic Diseases, Inborn / genetics
  • Humans
  • Markov Chains
  • Membrane Proteins / chemistry
  • Membrane Proteins / classification
  • Membrane Proteins / genetics
  • Membrane Proteins / physiology
  • Online Systems / statistics & numerical data
  • Phylogeny
  • Protein Structure, Quaternary / genetics
  • Protein Structure, Quaternary / physiology
  • Proteome / chemistry*
  • Proteome / classification
  • Proteome / physiology
  • Saccharomyces cerevisiae Proteins / chemistry
  • Saccharomyces cerevisiae Proteins / classification
  • Saccharomyces cerevisiae Proteins / genetics
  • Saccharomyces cerevisiae Proteins / physiology

Substances

  • Archaeal Proteins
  • Bacterial Proteins
  • Caenorhabditis elegans Proteins
  • Drosophila Proteins
  • Escherichia coli Proteins
  • Membrane Proteins
  • Proteome
  • Saccharomyces cerevisiae Proteins