(Hopefully) final version of Kantorovich-Rubinstein paper on arXiv

Steve Evans and I just finished a revision of our paper about how to compare probability distributions on a phylogenetic tree. (Note that the tree here acts as a metric space, and the probability distributions sit on top of that.) In it, we show that the Kantorovich-Rubinstein metric (KR, a.k.a. earth mover’s distance) has a closed form solution on a phylogenetic tree, and that the commonly-used randomization procedure to attach a p-value to an observed distance has a central limit theorem approximation as a Gaussian process. We also generalize KR to an L^p version. The p=2 case is especially nice: it has an ANOVA-type description and the Gaussian approximation ends up being a computable sum of chi squared random variables. We also prove the existence of and describe an algorithm for finding the center of mass of a probability distribution on a tree.

Now, why would one want to compare probability distributions on a tree? For the microbial ecology crowd out there, we have shown that the “weighted UniFrac” distance developed by Lozupone and Knight is actually a special case of this more general framework (with roots going back to the 18th century!). For those who don’t know this lingo, once we have mapped a collection of DNA sequences sampled from the environment to a phylogenetic tree using pplacer or something similar, we can think of those reads as a probability distribution on the tree. Then we want to compare them!

I’m happy with this paper. It’s applied, not 100% trivial, and working with Steve is always great fun.