- 26 Jun 2017 » Probabilistic Path Hamiltonian Monte Carlo
- 07 Jun 2017 » Smart proposals win for online phylogenetics using sequential Monte Carlo.
- 26 Oct 2016 » Incorporating new sequences into posterior distributions using SMC
- 22 Jul 2016 » Summer high school and undergraduate students 2016
- 11 Jul 2016 » Analysis of a slightly gentler discretization of time-trees
- 30 Jun 2016 » A time-optimal algorithm to build the SPR subgraph on a set of trees
- 05 Jun 2016 » An information-theoretic analysis of phylogenetic regularization
- 18 Apr 2016 » Postdoctoral position to develop next-generation Bayesian phylogenetic methods

Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions. HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state. These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.

Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down. When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction. We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal. Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the...
*(full post)*

Sometimes projects take years to bear fruit.

As I described previously, Aaron Darling and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for *online* Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences. In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting. The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”. These two projects have just recently resulted in intertwined submitted papers.

The SMC implementation, which we called `sts`

for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it. One of the things that kept us from wrapping it up was problems with *particle degeneracy*, which is as follows. I think of...
*(full post)*

The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular. In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.

The bummer about Bayesian computation? It takes so darn long for those chains to converge. And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics? *If you get a new sequence you have to start over from scratch.*

I’ve been thinking about this for several years with Aaron Darling, and in particular about Sequential Monte Carlo (SMC) for this application. SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection. You can think about it as a probabilistically...
*(full post)*

I definitely didn’t set out to have three high school and two undergrad students this summer.

But they’re fantastic, and making real contributions to our scientific work! Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s partis software. Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages. Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff. All of them are learning about science and coding.

These students rock, and I can’t wait to see what great things they bring into the world with their talent.

Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference. As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another. Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.

But what about time-trees? Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time. These are absolute times of the sort one could put on a calendar. To first approximation, working with time-trees is...
*(full post)*

We would like to better understand the subtree-prune-regraft (SPR) graph, which is a graph underlying most modern phylogenetic inference methods. The nodes of this graph are the set of leaf-labeled phylogenetic trees, and the edges connect pairs of trees that can be transformed from one to another by moving a subtree from one place to another. Phylogenetic methods implicitly move around this graph, whether to sample trees or find the most likely tree. The work to understand this graph has been led by Chris Whidden, including learning about how the graph structure influences Bayesian phylogenetic inference and learning about the overall structure of the graph.

These projects required us to reconstruct the subgraph of the full SPR graph induced by a subset of the nodes. In the course of our work we have been getting progressively better at constructing this graph efficiently. In our latest work we develop a time-optimal algorithm.

Chris’...
*(full post)*

How frequently are genes transferred horizontally? A popular means of addressing this question involves building phylogenetic trees on many genes, and looking for genes that end up in surprising places. For example, if we have a lineage B that got a gene from lineage A, then a tree for that gene will have B’s version of that gene descending from an ancestor of A, which may be on the other side of the tree.

Using this approach requires that we have accurate trees for the genes. That means doing a good job with our modeling and inference, but it also means having data with plenty of the mutations which give signal for tree building. Unfortunately, sometimes we don’t have such rich data, but we’d still like to do such an analysis.

A naïve approach is just to run the sequences we have through maximum-likelihood tree estimation software and take the best tree for each...
*(full post)*

*Although we have recently made a hire in this area, we continue to look for strong junior scientists to work on this and related projects.*

There is a lot more sequence data than in the early 2000’s, but inferential algorithms for Bayesian phylogenetic inference haven’t changed much since that time. There have definitely been advances, such as more clever proposal distributions, swapping out heated chains, and GPU-enabled likelihood calculations, but the core remains the same: propose a new state via a small branch length and/or tree structure perturbation, and accept or reject according to the Metropolis choice. Furthermore, the community has packed more and more complexity into priors and models, which would lead to a computational bottleneck even with a fixed number of sequences.

It’s time to improve inferential algorithms. Part of my inspiration lies in watching the revolution that has occurred in computational statistics in the last decade, in which the menu has...
*(full post)*

all posts