Bayesian phylogenetic inference without sampling trees

Most every description of Bayesian phylogenetics I’ve read proceeds as follows:

“Bayesian phylogenetic analyses are conducted using a simulation technique known as Markov chain Monte Carlo (MCMC).” (Alfaro & Holder, 2006)
“Posterior probabilities are obtained by exploring tree space using a sampling technique, called Markov chain Monte Carlo (MCMC).” (Lemey et al, The Phylogenetic Handbook)
“Once the biologist has decided on the data, model and prior, the next step is to obtain a sample from the posterior. This is done by using MCMC…” (Nascimento et al, 2017.)

With statements like these in popular (and otherwise excellent!) reviews, it’s not surprising that people confuse Bayesian phylogenetics and Markov chain Monte Carlo (MCMC). Well, let’s be clear.

MCMC is one way to approximate a Bayesian phylogenetic posterior distribution. It is not the only way.

In this post I’ll describe two of our recent papers that together give a systematic, rather than random, means of approximating a phylogenetic posterior distribution.

Without a doubt MCMC is the most popular means of approximating the posterior. MCMC is wonderfully simple. To implement a sampler (assuming you have a computable likelihood and prior) all you have to do is devise a proposal distribution that tries out a new tree and/or model parameter, and then accept/reject based on the Metropolis-Hastings ratio. If your proposal is reasonably good, then MCMC will converge (in the limit of a large number of samples) to the posterior distribution.

However, the simplicity of MCMC implies inherent limitations. It is not a “smart” algorithm. For example, one cannot easily adapt the proposal distribution to reflect what the MCMC is learning about the posterior.

This is problematic because the number of trees grows super-exponentially, and most of those trees are terrible explanations of a given data set. Thus, naive random tree modification proposals are likely to take us out of the high-posterior region. This is manifested in either timid tree proposal distributions, or as an unacceptably high rejection rate for proposals in an MCMC run.

Before we start talking about alternatives, I want to emphasize that people have done wonderful and important work within the MCMC framework. It has brought us all of the biological insights we have learned using Bayesian phylogenetic methodology, from deep divergence inference with complex models to genetic epidemiology. Methodologically, many authors have made this possible, from developing and testing tree proposals to leveraging Metropolis-Coupled Monte Carlo, which although not a panacea certainly improves topological mixing.

Let’s start discussing alternatives by backing up a bit.

What is Bayesian phylogenetics trying to do?

In the Bayesian phylogenetic framework, we are interested in the posterior distribution \(p(\tau, \theta \mid D)\) on tree topologies \(\tau\) and model parameters \(\theta\) (including branch lengths) given sequence data \(D\). Bayes’ rule tells us that the posterior is proportional to the likelihood times the prior:

\[p(\tau, \theta \mid D) \propto p(D \mid \tau, \theta) \, p(\tau, \theta)\]

Because this is a proportionality, we can easily evaluate ratios of posterior probabilities (such as to compute the Metropolis-Hastings ratio), but getting the true value of the posterior is intractable for phylogenetics.

Now, it’s common to be interested in the tree topologies \(\tau\) rather than the joint distribution on the topology and all of the associated continuous parameters. For instance, one might want to test monophyly of a given clade. That is, we would like to know \(p(\tau \mid D)\).

The common way to do this with MCMC is to run your chain, count the number of times you saw each topology \(\tau_i\), then divide by the number of samples from your chain. By ignoring the continuous parameters, we effectively marginalize them out.

In our work we wondered if we could develop an alternative means of getting \(p(\tau \mid D)\) without a sampling-based method such as MCMC. Specifically, we wanted to avoid any randomized movement through tree space.

Consider that

\[p(\tau_i \mid D) = \frac{p(D \mid \tau_i) \, p(\tau_i)}{\sum_j p(D \mid \tau_j) \, p(\tau_j)}\]

where the sum in the denominator of the ratio is over all trees \(\tau_j\), \(p(D \mid \tau_j)\) is the marginal likelihood over continuous parameters

\[p(D \mid \tau_j) = \int_\theta p(D \mid \tau_j, \theta) \, p(\theta),\]

and \(p(\tau_j)\) is the prior on tree topology \(\tau_j\).

Thus we can approximate the per-topology posterior distribution \(p(\tau_i \mid D)\) given

some systematic way of identifying a collection of “good” trees \(\tau_j\) that contain most of the posterior probability weight in the denominator of the ratio
some efficient way of estimating the marginal likelihood \(p(D \mid \tau)\).

The question is, then, can we obtain these two ingredients?

Ingredient 1: efficiently finding a “good” set of trees

Current maximum-likelihood and Bayesian phylogenetic algorithms are opposites in terms of objective and method. Maximum-likelihood algorithms systematically zoom up to the top of the likelihood surface with no regard for trees that serve as nearly-as-good explanations of the data. Bayesian algorithms explore tree space randomly, wasting effort by returning to the same trees many times, but given enough time do a good job of exploring the whole posterior region.

We decided to combine the systematic search of ML algorithms with the Bayesian objective, such that we would systematically find all of the “good” trees. To do so, we did the tree rearrangements that one usually does with these algorithms, but keeping track of all of the trees that were above some likelihood threshold rather that just allowing rearrangements that result in an improvement of likelihood.

Note that I said “likelihood” and not “posterior” here. In fact, by “likelihood” I mean the likelihood of the tree with the maximum-likelihood assignment of branch lengths (and other model parameters). One of the surprising results of our work is that this likelihood acts as a surprisingly good proxy for the posterior when looking for this “good” set of trees.

We find that this strategy works reasonably well if one uses a collection of starting points obtained by running RAxML starting at several hundred random trees. These starting points appear to cover all of the local peaks one finds in the posterior distribution. We implemented our algorithm using the libpll library from the Stamatakis group, which was a pleasant foundation. We added a multithreading strategy that allowed different workers to spread out across tree space. We report these results in Systematic Exploration of the High Likelihood Set of Phylogenetic Tree Topologies, with Whidden, Claywell, Fisher, Magee, and Fourment.

Ingredient 2: efficiently estimating the per-tree marginal likelihood

The other component we needed was a way to evaluate the per-tree marginal likelihood \(p(D \mid \tau)\). Please note that we are doing this marginal likelihood estimation with a single fixed tree topology at a time, which is in contrast to many applications of phylogenetic marginal likelihood estimation in which one is comparing one evolutionary model to another while integrating out the tree topology as well.

Because marginal likelihood in this formulation is a problem with continuous parameters only, there are many existing methods for estimating it. We also developed a few of our own specifically for this application, giving 19 methods in total. This immediately brought to mind the classic 1978 paper by Moler & Van Loan: Nineteen Dubious Ways to Compute the Exponential of a Matrix. Accordingly, we named our paper 19 Dubious Ways to Compute the Marginal Likelihood of a Phylogenetic Tree Topology with Fourment, Magee, Whidden, Bilge, and Minin.

We were surprised to find that some fast methods could be quite accurate, and some slow methods showed rather poor accuracy. One of the star algorithms here was a “Gamma Laplus” method devised by the group that used Laplace-like approximation to fit a Gamma distribution, which can then be used directly to compute a marginal likelihood. One can then boost the accuracy with a relatively small computational cost by adding an importance sampling step.

So, does it work?

Yes, combining these two strategies does work as a proof of principle. There are some caveats, though. The first caveat is that we used the Jukes-Cantor model, so we didn’t have to marginalize out model parameters other than branch lengths. This seems tractable: it would take another papers-worth of work with more interesting variational parameterizations, but I think we could deal with substitution-model and rate-variation parameters.

The second caveat is a more inherent issue for any method that attempts to individually explore every “good” tree. Sometimes tree posteriors are just really diffuse! For example, there are some data sets for which our extremely long “Golden” MrBayes runs never once sampled the same topology twice.

In fact, thinking about what to do with these very diffuse posterior distributions is what led us down the road of thinking more about density estimation on the set of phylogenetic trees, which then in turn led us to investigate full variational Bayes phylogenetic inference, which I’ll write about in an upcoming post.

This was a big and complex project that required a lot of hard work to pull off. For the first paper, I’d like to highlight the stamina of Chris Whidden and the programming prowess of Brian Claywell, as well as thank Thayer Fisher for starting off the project as a summer undergraduate project. For the second paper, I’d like to thank Mathieu Fourment, who implemented every one of the 19 methods with almost unbelievable gumption and skill, Andy Magee, who is the unsung hero of the project for contributing implementations and analysis, and Arman Bilge, who did important work early on for the Laplace-type methods. Vladimir was as usual a wonderful collaborator.