zo's little phylogenetic experiment
.. where I will infer evolutionary relationships by analyzing homologous DNA sequences
(because hey, the animals can't tell us themselves...:))

the animals:
human (Homo sapiens)
pig (Sus scrofa)
cow (Bos taurus)
mouse (Mus musculus)
frog (Xenopus laevis)
rat (Rattus norvegicus)
rainbow trout (Oncorhynchus mykiss)
zebrafish (Danio rerio)
fruit fly (Drosophila melanogaster)
a nematode (Caenorhabditis elegans)
mustard plant (Arabidopsis thaliana)
soybean (Glycine max)
baker's yeast (Saccharomyces cerevisiae)

(all from the NCBI taxonomy home)

And despite the pretense of academic neutrality, it is the author's sincere hope that upcoming phylogenetic structure computation will display sufficiently comforting divergence from r. norvegicus.


links:
The program I wrote to decipher the raw NCBI HomoloGene data file
and, the raw NCBI data file (warning: this is a *large* file)
and, the genes I am using


detail:
1 human gene had 20 nonhuman homologs (the winner: accession NM_002803.2: PSMC2 proteasome (prosome, macropain) 26S subunit, ATPase, 2)
5 human genes had at least 19 nonhuman homologs
6 human genes had at least 18 nonhuman homologs
8 human genes had at least 17 nonhuman homologs
8 human genes had at least 8 nonhuman homologs



the homologous genes: (with percent match. clicking on a gene id will take you to the corresponding NCBI GenBank page)

human pig cow mouse frog rat trout zebrafish fly c.elegans flower soybean yeast
1. NM_001694.2 CB477173.1 (89.44%) J03835.1 (83.37%) NM_009729.1 (81.57%) BC043805.1 (78.43%) NM_130823.1 (82.71%) CA356210.1 (82.87%) AY099523.1 (82.72%) NM_165475.1 (78.09%) NM_067787.2 (73.97%) NM_120052.1 (74.87%) 44496740 (73.66%) CUP5_6320808 (67.09%)
2. NM_005445.2 BF703451.1 (88.42%) AF072713.1 (94.62%) NM_007790.1 (84.72%) AJ535316.1 (80.56%) NM_031583.1 (83.53%) BX317925.1 (77.66%) BC044408.1 (78.6%) NM_078650.2 (71.52%) NM_067052.1 (73.24%) NM_128275.1 (73.35%) BU763204.1 (70.97%) SMC3_6322387 (74.55%)
3. NM_033301.1 CB286761.1 (89.22%) CB172061.1 (89.45%) NM_012053.1 (86.86%) BC043823.1 (79.97%) XM_216948.1 (86.66%) BX084619.1 (80.28%) AY130440.1 (81.5%) NM_167955.1 (76.17%) NM_075539.1 (73.54%) NM_127358.1 (74.41%) AJ404848.1 (73.64%) RPL2A_14318555 (69.57%)
4. NM_004526.1 BE232333.1 (91.96%) BE750275.1 (89.28%) NM_008564.1 (86.75%) BC046274.1 (78.94%) XM_232168.1 (84.83%) BX315929.1 (79.5%) BC048026.1 (79.79%) NM_057773.3 (73.71%) NM_064157.1 (71.22%) NM_103572.1 (73.02%) BI785598.1 (73.9%) MCM2_6319448 (72.26%)
5. NM_002790.2 BE032267.1 (93.64%) CB167145.1 (93.33%) NM_011967.1 (90.45%) BU910745.1 (80.68%) NM_017282.1 (91.48%) CA360681.1 (81.89%) 57048396 (81.11%) NM_057854.3 (75.67%) NM_060364.1 (74.3%) NM_104262.1 (72.62%) AF255338.1 (74.03%) PUP2_6321692 (76.64%)
6. NM_002080.1 M11732.1 (84%) Z25466.1 (88.64%) NM_010325.1 (79.23%) BQ884358.1 (78.16%) NM_013177.1 (79.69%) CA346608.1 (76.79%) BC049435.1 (77.61%) CG4233_24580970 (75.39%) NM_171702.1 (75.8%) NM_128651.1 (74%) L40579.1 (74.41%) AAT2_6323055 (74.66%)
7. NM_002804.3 CB469136.1 (93.57%) CB468704.1 (91.26%) NM_008948.1 (86.41%) BC046948.1 (81%) NM_031595.1 (86.63%) AF281342.1 (84.68%) CA496015.1 (81.68%) NM_079740.2 (78.23%) NM_059271.1 (73.01%) NM_111426.1 (75.73%) 46753656 (73.11%) RPT5_6324691 (72.66%)
8. NM_002803.2 CB478360.1 (87.72%) CB167140.1 (87.83%) XM_204224.2 (88.06%) X80157.1 (82.67%) NM_033236.1 (88.89%) CA355808.1 (81.01%) CA470417.1 (78.64%) NM_058125.3 (75.28%) NM_073604.1 (73.76%) NM_104252.1 (72.54%) BI942294.1 (76.83%) RPT1_6322704 (74.27%)




the DNA of the genes, (in FASTA format):     (used in NCBI HomoloGene requests)
FASTA       FASTA-modified       pearson format response
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8




the DNA of the genes, multiply aligned with each other:
Multi-FASTA       Phylip Newick interleaved       boxshade (pdf)       H. sapiens pairwise alignment plot (about this)
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8




phylogenetic trees (in text form)       PHYLIP files can be viewed with treeview
Mavid link       Phylip (Clustal-W)       Phylip (Mavid-Newick Format)       Neighbor Joining Tree (Newick Format)      
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
6 6 6 6
7 7 7 7
8 8 8 8




phylogenetic trees, quartet puzzled (in text form)       Maximum Likelihood, via Quartet Puzzling
Summary       Clock-like Distances       PHYLIP Tree      
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8




phylogenetic trees (in image form)
Clustal-W       Mavid       Mavid (trifurcated) ATV applet
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8




the consensus trees:       (created via PHYLIP CONSENSE)
      in text format             in PHYLIP format             in image (jpg) format      
Mavid non-rooted Mavid non-rooted
Mavid rooted Mavid rooted Mavid rooted
Quartet Puzzled Quartet Puzzled Quartet Puzzled
Combined Distance-ML Combined Distance-ML Combined Distance-ML






and so, the conclusion of this experiment is... (as viewed in treeview)



not terribly surprising! that's good! :)
the mouse and the rat are closely related.
the cow and the pig are close, and the human is close to both of them
the trout and the zebrafish (danio) are close, and the frog is close to both of them. makes sense.
the flower is off by itself, ok. very different from everything else. (being the only *plant* in the study)
the only mild surprise is that the fly and c.elegans are close?? well.. hmm. maybe attribute that to emergent consensus-building?
as neither the neighbor-joined tree nor the original quartet-puzzled tree have that mildly peculiar quality, it seems reasonable.





other generally useful links:
ClustalW
Phylip
MAVID
About Mavid Vista
ATV (Forester)
TreeTop
NCBI HomoloGene
bayes aligner
google phylogenetic fasta
MEGA (Molecular Evolutionary Genetics Analysis
The Tree Of Life Web Project (tolweb.org)
The Sanger Institute

European Bioinformatics Institute
Oxford Bioinformatics Centre
NBCI
Homologous Vertebrate Genes Database (HOVERGEN)





bibliography:
Duret, L., Mouchiroud, D. and Gouy, M. (1994) HOVERGEN, a database of homologous vertebrate genes. Nucleic Acids Res. 22, 2360-2365.

Galtier, N., Gouy, M. and Gautier, C. (1996) SeaView and Phylo_win, two graphic tools for sequence alignment and molecular phylogeny. Comput. Applic. Biosci., 12, 543-548.

Bray N, Dubchak I, Pachter L: AVID: A Global Alignment Program
Genome Research 2003 13: 97-102 (supplementary website).

Bray N, Pachter L: Maximum likelihood ancestral alignment of multiple large genomic regions, submitted.

Mayor C., Brudno M., Schwartz J. R., Poliakov A., Rubin E. M., Frazer K. A., Pachter L. S. and Dubchak I. (2000) VISTA: Visualizing Global DNA Sequence Alignments of Arbitrary Length. Bioinformatics, 16: 1046-1047.

Felsenstein, J. 1993. PHYLIP (Phylogeny Inference Package) version 3.5c. Distributed by the author. Department of Genetics, University of Washington, Seattle.

Felsenstein, J. 1989. PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.

Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling:A quartet maximum likelihood method for reconstructing tree topologies. Mol. Biol. Evol. 13: 964-969





any questions, comments, or suggestions, please email zo. (zoo at cs dot uchicago dot edu)