- 19 Apr 2018 » Predicting B cell receptor substitution profiles using public repertoire data
- 10 Jan 2018 » Postdoc opening to learn about antibody development during HIV superinfection
- 01 Dec 2017 » Per-sample immunoglobulin germline inference from B cell receptor deep sequencing data
- 14 Nov 2017 » Survival analysis of DNA mutation motifs with penalized proportional hazards
- 05 Sep 2017 » Using genotype abundance to improve phylogenetic inference
- 26 Jun 2017 » Probabilistic Path Hamiltonian Monte Carlo
- 07 Jun 2017 » Smart proposals win for online phylogenetics using sequential Monte Carlo.
- 26 Oct 2016 » Incorporating new sequences into posterior distributions using SMC

Can we predict how sites of an antibody will tolerate amino acid substitutions? Kristian Davidsen posed this question shortly after he arrived in my group, pointing out that being able to do such prediction would be quite useful. For example, engineered antibodies sometimes aggregate into clumps or have other properties that that make them useless for mass production. If we could figure out ways to change the amino acid sequence of an antibody without changing binding properties, that could help us avoid aggregation and make a more useful antibody.

How to start to address this complex and high-dimensional question? Although people have started to do deep mutational scanning on antibodies this type of data is hard to come by. On the other hand, B cell repertoire (i.e. antibody-coding) sequence data is becoming plentiful. B cells undergo affinity maturation to improve binding in collections of sequences called “clonal families” grouped by naive ancestor sequence (more...
*(full post)*

Please see https://b-t.cr/t/506 for details.

Every B cell receptor sequence in a repertoire came from a V(D)J recombination of germline genes. Each individual has only certain alleles of these genes in their germline, and knowing this set improves the accuracy of all aspects of BCR sequence analysis, from alignment to phylogenetic ancestral sequence reconstruction. This germline allele set can be estimated directly from BCR sequence data, and it’s time to treat such estimation as part of standard BCR sequence analysis pipelines.

This central message is not new, but it’s worth emphasizing because doing germline set inference is not part of most current studies of B cell receptor (BCR) sequences.

Indeed, the most common way to annotate sequences is to align them one by one to the full set of alleles present in the IMGT database, which has hundreds of alleles. Each individual has only a fraction of these alleles in their genome.

Unsurprisingly, aligning sequences one by one to the...
*(full post)*

We are equipped with purpose-built molecular machinery to mutate our genome so that we can become immune to pathogens. This is truly a thing of wonder.

More specifically, I’m talking about mutations in B cells, the cells that make antibodies. Once a randomly-generated antibody expressed on the outside of the B cell finds something it’s good at binding, the cell boosts the mutation rate of its antibody-coding region by about one million fold. Those that have better binding are rewarded by stimulation to divide further. The result of this Darwinian mutation and selection process is antibodies with improved binding properties.

The mutation process is wonderfully complex and interesting. Being statisticians, we payed our highest tribute that we can to a process we think is beautiful: we developed a statistical model of it. This work was led by the dynamic duo of Jean Feng and David Shaw, while Vladimir Minin, Noah Simon...
*(full post)*

When doing computational biology, listen to biologists. I have found them to have remarkable intuition; this can be a gold mine of opportunity for us computational types.

In this particular case, the starting point was the stunningly beautiful work of Gabriel Victora’s lab visualizing germinal center dynamics in living mice. For those not yet initiated into the beauty of B cell repertoire, germinal centers are crucibles of evolution, in which B cells compete in an antigen-binding contest such that the best binder reproduces more. As part of the Victora lab work, they did single-cell extraction and sequencing, which enabled them to quantify the frequency of each B cell genotype without PCR bias or other artifacts. Such single-cell sequencing, and consequent abundance information, is now becoming commonplace. *How should we use this abundance information in phylogenetics?*

Well, the Victora lab knew, even if their algorithm implementation is not one we would have considered. Indeed, they...
*(full post)*

Hamiltonian Monte Carlo (HMC) is an exciting approach for sampling Bayesian posterior distributions. HMC is distinguished from typical MCMC because proposals are derived such that acceptance probability is very high, even though proposals can be distant from the starting state. These lovely proposals come from approximating the path of a particle moving without friction on the posterior surface.

Indeed, the situation is analogous to a bar of soap sliding around a bathtub, where in this case the bathtub is the posterior surface flipped upside down. When the soap hits an incline, i.e. a region of bad posterior, it slows down and heads back in another direction. We give the soap some random momentum to start with, let it slide around for some period, and where it is at the end of this period is our proposal. Calculating these proposals requires integrating out the soap dynamics, which is done by numerically integrating physics equations (hence the...
*(full post)*

Sometimes projects take years to bear fruit.

As I described previously, Aaron Darling and I have been thinking for a good while about using Sequential Monte Carlo (SMC) for *online* Bayesian phylogenetic inference, in which an existing posterior on trees can be updated with additional sequences. In fact, we had a good enough proof-of-concept implementation in 2013 to give a talk at the Evolution meeting. The other talk from my group that year was about a surrogate function for likelihoods parameterized by a single branch length, which we called “lcfit”. These two projects have just recently resulted in intertwined submitted papers.

The SMC implementation, which we called `sts`

for Sequential Tree Sampler, lay dormant for a while after Connor McCoy left the group despite a few efforts to rekindle it. One of the things that kept us from wrapping it up was problems with *particle degeneracy*, which is as follows. I think of...
*(full post)*

The Bayesian approach is a beautiful means of statistical inference in general, and phylogenetic inference in particular. In addition to getting posterior distributions on trees and mutation model parameters, the Bayesian approach has been used to get posteriors on complex model parameters such as geographic location and ancestral population sizes of viruses, among other applications.

The bummer about Bayesian computation? It takes so darn long for those chains to converge. And what’s worse in this age of in-situ sequencing of viruses for rapidly unfolding epidemics? *If you get a new sequence you have to start over from scratch.*

I’ve been thinking about this for several years with Aaron Darling, and in particular about Sequential Monte Carlo (SMC) for this application. SMC, also called particle filtering, is a way to get a posterior estimate with a collection of “particles” in a sequential process of reproduction and selection. You can think about it as a probabilistically...
*(full post)*

all posts