The challenge
In the current SARS-CoV-2 outbreak, we have all learned the importance of epidemiology. Epidemiology is currently undergoing a revolution due to easy and cheap access to viral genetic sequences, available in real time as the epidemics unfold. This helps us understand viral spread because viruses mutate as they are passed between individuals, and by using shared mutations we can infer transmission history.
Such detective work using viral genetic sequences is beautifully expressed on the nextstrain.org platform. This platform allows people to understand viral transmission between geographic areas using the relationships among viral genomes. Specifically, it represents these relationships in terms of a phylogenetic tree, analogous to a “tree of life” but for a single viral outbreak.
Although nextstrain is wonderful, and perfect for some audiences, it has some limitations. Some of these limitations derive from the fact that it infers a single tree structure, when in fact each tree structure is a hypothesis with a certain level of support that we can obtain with our finite data. A preferable way to proceed is called Bayesian phylogenetics, in which we infer a so-called posterior distribution on trees: the ensemble of trees that can credibly explain the data, along with a probability that each one is correct.
The price for the rigor and flexibility of Bayesian phylogenetics is speed. The current (non-Bayesian) pipeline for processing sequence data and building SARS-CoV-2 trees on nextstrain is several hours. In contrast, traditional methods for Bayesian phylogenetics simply cannot handle this scale of data: even a reasonably-sized subsample would take months. Recent work uses strong tree constraints, which is a completely appropriate response to an avalanche of data. Our work will enable more flexible and probabilistic versions of such external guidance, as well as dynamic “online” approaches as sequences are added to data sets. We believe that the ensemble of all of these methods will make Bayesian inferences actionable for epidemic control.
Our project
My research group has been working diligently for 5 years on developing categorically faster methods for Bayesian phylogenetics. Our best new inferential framework uses a technique called variational Bayes. If you want to learn more about it, read on my blog here and here. We are implementing these algorithms in a Python-interface C++17 library located at https://github.com/phylovi/libsbn. This library aims to enable integration with packages such as TensorFlow and PyTorch for flexible modeling.
Responsibilities
We will work together to develop and implement new algorithms in our C++ library, as well as maintain/refactor our codebase as it evolves in response to what we learn in our research.
We will collaborate with research groups across the world, including the groups of
- Marc Suchard, a leading biostatistician especially known for his work in “phylodynamics”: the intersection of phylogenetics, immunology, and epidemiology
- Mathieu Fourment, who has led the development of fixed-tree variational inference for time-tree models
- Cheng Zhang (张成), who has led the development of flexible-tree variational inference.
We are also lucky to have a network of helpful C++ mentors.
Environment
This position is budgeted for a junior-level programmer, but more experienced developers are encouraged to get in touch and we will see what we can do. The environment is lively yet casual, with a strong emphasis on collaborative work. The Center is housed in a lovely campus on Lake Union a short walk from downtown, and a slightly longer walk from the University of Washington. The Matsen group is in the newly-remodeled Steam Plant building overlooking the lake. Powerful computing resources and helpful IT staff await. Ideally you’d want to be on campus (when that’s possible again) but long-term remote work is possible from these states: Alabama, Alaska, Arizona, California, Colorado, Hawaii, Idaho, Maryland, Minnesota, Montana, New York, Ohio, Oregon, South Carolina, and Texas. Remote work from outside of the USA isn’t inconceivable, but it would probably prove challenging: at minimum you would need to obtain a business license in Washington state, and we’d need to pay state and federal taxes.
We believe that science is for everyone. We have had researchers with a variety of backgrounds, including Latinx, Black, Asian, and Middle Eastern. We have had women, men, gay, and straight, and we welcome people of all sexual orientations and gender identities. We have had successful high schoolers, postdocs, people who were the first in their family to attend college, and one who had decided that college wasn’t for them. We have had researchers with backgrounds in biology, physics, statistics, math, and computer science.
We acknowledge the historical and present barriers for underrepresented groups, and work to increase diversity, equity and inclusion in computational biology. Members of underrepresented groups are especially encouraged to apply.
You can find out more about our group by visiting:
Qualifications
This position has no formal education or academic publication requirements. We welcome people to apply whatever their background.
Essential skills
- demonstrated coding ability
- desire to produce clean (clear, factored, and tested) code
- motivation to learn new topics and dive into a complex programming language (C++17)
- ability to work in a team and communicate
- ability to find solutions independently
Additional helpful skills
- experience with C++, and with modern C++ idioms
- background in probability and Bayesian inference
- experience with Python
- experience with a modern git-based workflow
- experience with PyTorch or TensorFlow
- experience with Docker and continuous integration
- experience developing in a Linux environment
Applying
If you are interested in this position, please submit the following materials:
- A CV summarizing your work experience so far.
- A code sample showing work that you are proud of. This has to be nontrivial, but doesn’t have to be long. Ideally it would be publicly accessible, e.g. on GitHub, but if that’s not possible an emailed attachment is fine too.
- The names and email addresses of three references.
Please send these materials to: if you’re interested.