This project explores the process of obtaining minimum entropy phylogenetic trees. It contains a monte-carlo based entropy minimization algorithm and an experiment that tests the effects of minimizing entropy on the stability of trees with respect to small perturbations in the input data.
monte_carlo_entropy.py
takes a fasta file of aligned sequences and a phylogenetic tree given in the newick format, and proceeds to randomly move subtrees around, accepting changes that reduce the total tree entropy. It returns a locally-minimum entropy phylogeny tree.
The definition of tree entropy is not settled, but currently we treat each internal node as a cluster of the seqeuences in its descendant leaf nodes, and define tree entropy as the sum(entropy(cluster) * size(cluster))
over all clusters, where size denotes the number of descendant leaves. Entropy of a cluster of aligned sequences is computed column-wise on the nucleotide frequencies.
tree_stability.py
is an experiment to see how far away trees get from themselves when you randomly perturb a small percentage of nucleotides. It takes a fasta file and perturbs some of its nucleotides, then computes SPHERE and RAxML trees on both the original and perturbed sequences. Then, it runs monte_carlo_entropy.py
to minimize the entropy of each tree. Finally, it computes Robinson-Foulds distance between each pair of the trees, that is, between each tree of both methods, with and without perturbation, and before and after entropy minimization. It also reports the parsimony score of each tree.