- 25 Nov 2015 » New results on the subtree prune regraft distance for unrooted trees
- 01 Sep 2015 » Postdoctoral position to study molecular evolution and phylogenetics of immune cells
- 27 Aug 2015 » High school students 2015
- 22 Jul 2015 » Tanglegrams!
- 15 Jul 2015 » New paper on the shape of the phylogenetic likelihood function
- 02 Apr 2015 » First paper on the curvature of tree space
- 23 Mar 2015 » New paper on annotation of BCR sequences
- 21 Mar 2015 » Welcome Vu Dinh!

In our previous work, Chris Whidden and I have been working to understand properties of phylogenetic Markov chain Monte Carlo (MCMC) by learning about the graph formed by all phylogenetic trees (as vertices) and tree rearrangements (as edges). These tree rearrangements are ways of modifying one tree to get another. For example, here we have a picture of modifying a tree via so-called unrooted SPR.

It’s natural to ask for the shortest path in this graph between two vertices, i.e. the number of unrooted SPR moves required to modify one tree to make another. The number of such moves is called the unrooted SPR (uSPR) distance. It turns out that the problem is hard in general but fixed parameter tractable, meaning that the complexity isn’t too bad for pairs of trees that aren’t too different from one another. The current best algorithm for unrooted SPR distance cannot compute distances larger than 7, or reliably compare...
*(full post)*

*Although we have recently made a hire in this area, we continue look for good people to work on this and related projects.*

Our adaptive immune systems continually update themselves to neutralize and destroy pathogens. The receptor sequences of antibody-making B cells undergo a Darwinian process of mutation and selection which improves their binding to antigen.^{1} It is now possible to sequence these B cell receptors (BCRs) in high throughput, giving a profound new perspective on how the immune system responds to infection.^{2} Although the elements of B cell affinity maturation are the same as molecular evolution in other settings, being based on recombination, point mutation, and selection, there are a many important differences. These differences, along with the volume of sequence data available, bring new challenges for phylogenetics and molecular evolution.

The translational medical consequences of improved methods will be significant. Improved methods will especially help in understanding the...
*(full post)*

This summer we had two high school students, Andrew and Kate. They were quite sharp, so we threw them in the deep end with some real science projects. Andrew taught himself shell programming and Docker, and learned enough B cell analysis to work on making Bioboxes to be used for validation of B cell sequence analysis software. Kate taught herself shell programming, Python, and learned some undergraduate abstract algebra to learn about the SPR graph by characterizing its distribution of pairwise distances. They were both very independent, and a pleasure to have in the office!

Say we care about a function on pairs of trees (such as subtree-prune-regraft distance) that doesn’t make reference to the labels as such, but simply uses them as markers to ensure that the leaves end up in the right place. We’d like to calculate this function for all trees of a certain type. However, doing so for *every* pair of labeled trees is a waste, because if we just relabel the two trees in the same way, we will get the same result.

So, how many computations do we actually need to do?

It turns out that we only need to do one calculation per *tanglegram*. A tanglegram is a pair of trees along with a bijection between the leaves of those trees. They have been investigated in coevolutionary analyses before, and there’s a considerable literature concerning how to draw them in the plane with the minimal number of crossings. However, the symmetries and number of...
*(full post)*

Imagine we have a tree, sequence data for the leaves of that tree, and some fixed mutation rate matrix. Then we fix all of the branch lengths of that tree except for one. The likelihood function restricted to that branch gives a function from the positive real numbers to the unit interval. Question: what is the shape of that function?

I asked Vu this question when he arrived. As described in our new paper on arXiv, the answer is rather interesting, and more complex than I would have thought. Vu did a fantastic job with this project, taking (surprisingly to me) an algebraic approach, defining the *characteristic polynomial* of a likelihood function, defining an algebraic structure on *conditional frequency patterns*, then using a result about path-connected subgroups.

To summarize, if the model is quite simple (JC, F81), then the likelihood has a single maximum. However, more complex models such as K2P can take on arbitrarily...
*(full post)*

Imagine a graph with vertices representing the trees of a given number of taxa, and edges connecting trees such that can be transformed to each other (see below for the “rSPR” example). All popular likelihood-based tree inference algorithms perform some traversal of this graph: Bayesian algorithms, for example, perform Markov chain Monte Carlo (MCMC) on it. In our recent Sys Bio paper, Chris Whidden and I demonstrated graph effects on phylogenetic MCMC: that the graph structure combined with the likelihood function led to bottlenecks in tree space where it was difficult to move from one peak of good trees to another.

These results motivated us to learn more about these graph structures. Consider the best-studied of these graphs, the rSPR graph, in which vertices represent *rooted* trees, and edges are rooted subtree-prune-regraft operations, in which a rooted tree is cut off of a tree and then reattached somewhere else in the tree with the same...
*(full post)*

The antigen binding properties of antibodies are determined by the sequences of their corresponding B cell receptors (BCRs). These BCR sequences are created in “draft” form by VDJ recombination, which randomly selects and trims the ends of V, D, and J genes, then joins them together with additional random nucleotides. If they pass initial screening and bind an antigen, these sequences then undergo an evolutionary process of mutation and selection, “revising” the BCR to improve binding to its cognate antigen.

Our first paper on BCRs concerned natural selection as part of the “revision” process, and when Duncan joined the group we got to work on the “drafting” part. Specifically, the first step was to work on the *annotation problem*: given a BCR sequence, which nucleotides came from which genes or non-templated insertions? We recently posted a paper on arXiv describing our approach. Like previous work, we use a hidden Markov model (HMM) for this...
*(full post)*

Vu Dinh joined as a postdoc in our group at the new year. Vu came to us from Purdue, where he got his PhD working in statistics and computational biology. He is especially interested in machine learning theory.

Vu has already been highly productive during his short time here. He has proven some theoretical results describing the shape of the phylogenetic likelihood function, and used these to prove convergence of our likelihood function approximation scheme. He has also made significant progress on effective sample size guarantees for online phylogenetic SMC. Stay tuned for upcoming arXiv submissions, and for other nice work from Vu!

all posts