Numbers of distinct gene families versus numbers of predicted genes and their duplicated copies in H. influenzae, S. cerevisiae, C. elegans, and D. melanogaster

Range	Table - link
Organism	Various
Reference	Rubin GM et al., Comparative genomics of the eukaryotes. Science. 2000 Mar 24 287(5461):2204-15. p.2205 table 1PubMed ID10731134
Primary Source	See measurement method
Method	[Ref 63]: "Paralogous gene families (Table 1) were identified by running BLASTP. A version of NCBI-BLAST2 optimized for the Compaq Alpha architecture was used with the SEG filter and the effective search space length (Y option) set to 17,973,263. Each protein was used as a query against a database of all other proteins of that organism. A clustering algorithm was then used to extract protein families from these BLASTP results. Each protein sequence constitutes a vertex each HSP [High Scoring Pair] between protein sequences is an arc, weighted by the BLAST Expect value. The algorithm identifies protein families by first breaking all arcs with an E value greater than some user-defined value (1×10^-6 was used for all of the analyses reported here). The resulting graph is then split into subgraphs that contain at least two-thirds of all possible arcs between vertices. The algorithm is “greedy” that is, it arbitrarily chooses a starting sequence and adds new sequences to the subgraph as long as this criterion is met. An interesting property of this algorithm is that it inherently respects the multidomain nature of proteins: For example, two multidomain proteins may have significant similarity to one another but share only one or a few domains. In such a case, the two proteins will not be clustered if the unshared domains introduce a large number of other arcs."
Comments	P.2204 middle column bottom paragraph: "The “Core Proteome” How many distinct protein families are encoded in the genomes of D. melanogaster, C. elegans, and S. cerevisiae (ref 1), and how do these genomes compare with that of a simple prokaryote, Haemophilus influenzae? [Investigators] carried out an “all-against-all” comparison of protein sequences encoded by each genome using algorithms that aim to differentiate paralogs—highly similar proteins that occur in the same genome—from proteins that are uniquely represented (Table 1). Counting each set of paralogs as a unit reveals the “core proteome”: the number of distinct protein families in each organism. This operational definition does not include posttranslationally modifed forms of a protein or isoforms arising from alternate splicing." See note above table
Entered by	Uri M
ID	112751