We prove theorems, develop algorithms, and write code to help understand problems in biology. The group has special interest in evolution and phylogenetics.
File formats are boring. BO-RING! However, bad file formats can be a real pain, leading to real suffering and bitterness. We put our heads together with Alexandros Stamatakis to design a sane file format for phylogenetic placements. It’s based on old friends Newick and JSON, and is reasonably compact and extensible. Our paper about it just came out in PLoS ONE. This wasn’t the most interesting work, but establishing standards is a win for both users and developers.
The FHCRC has fantastic scientists, and it’s been a pleasure getting to know them. We’ve recently begun working with the labs of Harmit Malik and Michael Emerman. This week marked the first product of these collaborations, with a paper in Cell Host and Microbe on the evolution of the Vpr and and Vpx genes in primate lentiviruses (the viral genus containing HIV). These genes (sometimes, in a host-specific way) degrade a host factor, SAMHD1, that restricts HIV infection. This work gives evidence of an “arms race” between SAMHD1 and Vpr/Vpx, and a phylogenetic analysis suggests that the Vpr gene obtained SAMHD1 degradation capability prior to the birth of the Vpx gene. We had a small part in helping with the phylogenetic analysis for this paper. The paper was evaluated on the F1000 site.
Chris Small joins today! Chris has a degree in math from Evergreen; he spent a few years after graduation working at Luxel and was freelancing after that. He goes by metasoarous on github and will be developing new ways of comparing high-throughput data sets in a phylogenetic setting.
As some of you know, we are currently improving methods to perform taxonomic classification using phylogenetic placements. In doing so, we noticed something not-so-surprising right away: when the taxonomy and the phylogeny disagree, it makes taxonomic classification difficult. I starting looking around for tools to evaluate discordance between a taxonomy and a phylogeny, but I didn’t find anything.
There is an appropriate formulation in the CS literature, in terms of trees with colored leaves. The minimal convex uncoloring of such a tree is the smallest set of leaves whose colors have to removed such that each remaining color appears in exactly one non-overlapping region of the tree. We can formalize the concordance between a taxonomy and a phylogeny at a given rank by taking the coloring that is induced by the taxonomic classifications at that rank and finding the corresponding minimal uncoloring. The leaves that have to get uncolored are the minimal such set that don’t “fit” according to the phylogeny.
Moran and Snir made this definition and did the foundational work in this area, showing that the problem is NP-hard and fixed parameter tractable. For leaf colorings they have a very elegant dynamic program that uses the Hungarian algorithm to solve the problem. It’s exponential in the total number of colors violating convexity, and works best for trees of large degree.
For our application to taxonomy, the colors aren’t spread uniformly over the tree, but rather are localized (for the most part, they aren’t that far off). Thus it makes sense to develop an algorithm that is exponential in a measure of local nonconvexity, rather than in the total number of nonconvex colors. Also, we usually deal with binary trees. Aaron G and I developed an algorithm that’s specialized to this setting, and it includes a branch-and-bound that can save orders of magnitude of computation time and space. It’s available as rppr convexify in the dev branch of the pplacer suite.
This was a fun project, and I’m happy that we have a practically useful tool for this problem. It’s Aaron’s first paper, which makes me proud, and I definitely needed his coding skills for the rather involved dynamic program. If you are interested, the paper is up on the arXiv.
Dear Microbial Ecology community:
Have you ever been annoyed with how the axes of principal components plots don’t come with a clear interpretation? Ever wondered why a hierarchical clustering algorithm chose to do a specific merge on your data?
Steve Evans and I decided to take on these issues and develop some more transparent ordination and clustering methods for phylogenetic placement data. Our new methods are called edge principal components analysis (edge PCA) and squash clustering. In edge PCA, the principal components are weights on the edges of the “reference” phylogenetic tree; these weights can be visualized as a thickening and coloring of the edges. Thus, using our visualization tool, thick red edges are those that move points in the positive direction along the corresponding PCA axis, and thick blue edges are those that move points in the negative direction. Edge PCA also uses the structure of the tree to find consistent differences between nearby species, which can result in greater resolution than distance-based methods. In squash clustering, the internal nodes of the clustering tree correspond to distributions of mass on the reference tree, and distances between internal nodes are Kantorovich-Rubinstein (earth-mover’s) distances between those distributions. These mass distributions can be visualized so users can view what is driving a particular merge.
It took us a little while to get it the manuscript in its final version, and I’ve just put it up on the arXiv preprint server. You can also see an example tree showing the edge PCA axes.