- 01 Dec 2017 » Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data
- 14 Nov 2017 » Survival analysis of DNA mutation motifs with penalized proportional hazards
- 05 Sep 2017 » Using genotype abundance to improve phylogenetic inference
- 26 Jun 2017 » Probabilistic Path Hamiltonian Monte Carlo
- 07 Jun 2017 » Smart proposals win for online phylogenetics using sequential Monte Carlo.
- 26 Oct 2016 » Incorporating new sequences into posterior distributions using SMC
- 22 Jul 2016 » Summer high school and undergraduate students 2016
- 11 Jul 2016 » Analysis of a slightly gentler discretization of time-trees

Every B cell receptor sequence in a repertoire came from a V(D)J recombination of germline genes. Each individual has only certain alleles of these genes in their germline, and knowing this set improves the accuracy of all aspects of BCR sequence analysis, from alignment to phylogenetic ancestral sequence reconstruction. This germline allele set can be estimated directly from BCR sequence data, and it’s time to treat such estimation as part of standard BCR sequence analysis pipelines.

This central message is not new, but it’s worth emphasizing because doing germline set inference is not part of most current studies of B cell receptor (BCR) sequences.

Indeed, the most common way to annotate sequences is to align them one by one to the full set of alleles present in the IMGT database, which has hundreds of alleles. Each individual has only a fraction of these alleles in their genome.

Unsurprisingly, aligning sequences one by one to the...
*(full post)*

We are equipped with purpose-built molecular machinery to mutate our genome so that we can become immune to pathogens. This is truly a thing of wonder.

More specifically, I’m talking about mutations in B cells, the cells that make antibodies. Once a randomly-generated antibody expressed on the outside of the B cell finds something it’s good at binding, the cell boosts the mutation rate of its antibody-coding region by about one million fold. Those that have better binding are rewarded by stimulation to divide further. The result of this Darwinian mutation and selection process is antibodies with improved binding properties.

The mutation process is wonderfully complex and interesting. Being statisticians, we payed our highest tribute that we can to a process we think is beautiful: we developed a statistical model of it. This work was led by the dynamic duo of Jean Feng and David Shaw, while Vladimir Minin, Noah Simon...
*(full post)*

When doing computational biology, listen to biologists. I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.

In this particular case, the starting point was the stunningly beautiful work of Gabriel Victora’s lab visualizing germinal center dynamics in living mice. For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more. As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts. Such single-cell sequencing, and consequent abundance information, is now becoming commonplace. *How should we use this abundance information in phylogenetics?*

Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered. Indeed, they...
*(full post)*

Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions. HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state. These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.

Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down. When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction. We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal. Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the...
*(full post)*

Sometimes projects take years to bear fruit.

As I described previously, Aaron Darling and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for *online* Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences. In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting. The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”. These two projects have just recently resulted in intertwined submitted papers.

The SMC implementation, which we called `sts`

for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it. One of the things that kept us from wrapping it up was problems with *particle degeneracy*, which is as follows. I think of...
*(full post)*

The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular. In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.

The bummer about Bayesian computation? It takes so darn long for those chains to converge. And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics? *If you get a new sequence you have to start over from scratch.*

I’ve been thinking about this for several years with Aaron Darling, and in particular about Sequential Monte Carlo (SMC) for this application. SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection. You can think about it as a probabilistically...
*(full post)*

I definitely didn’t set out to have three high school and two undergrad students this summer.

But they’re fantastic, and making real contributions to our scientific work! Anna and Apurva (left and right) have written a new C++ front-end to the essential Smith-Waterman pre-alignment step for Duncan’s partis software. Andrew and Lola (center) are investigating the traces of maximum-likelihood phylogenetic inference software packages. Thayer (back left) is writing a multi-threaded C++ program to systematically search for all of the trees above a given likelihood cutoff. All of them are learning about science and coding.

These students rock, and I can’t wait to see what great things they bring into the world with their talent.

Inferring a good phylogenetic tree topology, i.e. a tree without branch lengths, is the primary challenge for efficient tree inference. As such, we and others think a lot about how algorithms move between topologies, typically formalizing this information as a path through a graph representing tree topologies as vertices and edges as moves from one tree to another. Removing all branch length information makes sense because algorithms are formulated in terms of these topologies: for classical unrooted tree inference, the set of trees that are tried from a specific tree is not determined by the branch lengths of the current tree.

But what about time-trees? Time-trees are rooted phylogenetic trees such that every event is given a time: each internal node is given a divergence time, and each leaf node is given a sampling time. These are absolute times of the sort one could put on a calendar. To first approximation, working with time-trees is...
*(full post)*

all posts