Range |
Table - link
|
Organism |
Various |
Reference |
Rubin GM et al., Comparative genomics of the eukaryotes. Science. 2000 Mar 24 287(5461):2204-15. p.2205 table 1PubMed ID10731134
|
Primary Source |
See measurement method |
Method |
[Ref 63]: "Paralogous gene families (Table 1) were identified by running BLASTP. A version of NCBI-BLAST2 optimized for the Compaq Alpha architecture was used with the SEG filter and the effective search space length (Y option) set to 17,973,263. Each protein was used as a query against a database of all other proteins of that organism. A clustering algorithm was then used to extract protein families from these BLASTP results. Each protein sequence constitutes a vertex each HSP [High Scoring Pair] between protein sequences is an arc, weighted by the BLAST Expect value. The algorithm identifies protein families by first breaking all arcs with an E value greater than some user-defined value (1×10^-6 was used for all of the analyses reported here). The resulting graph is then split into subgraphs that contain at least two-thirds of all possible arcs between vertices. The algorithm is “greedy” that is, it arbitrarily chooses a starting sequence and adds new sequences
to the subgraph as long as this criterion is met. An interesting property of this algorithm is that it inherently respects the multidomain nature of proteins: For example, two multidomain proteins may have significant similarity to one another
but share only one or a few domains. In such a case, the two proteins will not be clustered if the unshared domains introduce a large number of other arcs." |
Comments |
P.2204 middle column bottom paragraph: "The “Core Proteome” How many distinct protein families are encoded in the genomes of D. melanogaster, C. elegans, and S. cerevisiae (ref 1), and how do these genomes compare with that of a simple prokaryote, Haemophilus influenzae? [Investigators] carried out an “all-against-all” comparison of protein sequences encoded by each genome using algorithms that aim to differentiate paralogs—highly similar proteins that occur in the same genome—from proteins that are uniquely represented (Table 1). Counting each set of paralogs as a unit reveals the “core proteome”: the number of distinct protein families in each organism. This operational definition does not include posttranslationally
modifed forms of a protein or isoforms arising from alternate splicing." See note above table |
Entered by |
Uri M |
ID |
112751 |