The Mathematics of Evolution
A Special Session at the Fall Central Sectional Meeting of the AMS
Loyola University, Chicago, IL, 10/3-10/4
Room Assignment: TBA

The special session co-organizers are Ruth Davidson (email: redavid2@illinois.edu) and Ruriko Yoshida (email: ruriko.yoshida@uky.edu)

## Talk reminders

All of this info has been distributed in Email announcements, but here are some reminders: AMS special session talks are usually 20 minutes with 5 minutes for questions and 5 minutes for break and movement between sessions. We will be practicing soft enforcement of this: we will give 2 minute warning at 18 minutes, 0 minutes at 20 minutes, and at 24 minutes we will stand and expect you to finish. Please plan on stopping at 20 minutes to allow for questions.

The 4 minutes is to allow you to run over your alloted time- we hope a lively Q and A between participants in the audience after the speaker exits exits the stage" at 25 minutes. The 5 minute lapse between talks is to ensure successful technology switches between speakers, and also AMS tradition to allow people to take bio breaks and move between sessions.

We have been promised 1. a whiteboard, and a couple of white board pens; 2. a projector 3. a screen 4. a computer which runs Windows 7, and 5. ability to connect a laptop. We will also have a flash drive to upload talks directly to the computer during breaks if you prefer.

## Other location info

The Registration Desk and the AMS Book Exhibit will be located in the first floor lobby of Cuneo Hall, 6430 N. Kenmore Ave. Invited Addresses will be held in the Quinlan Life Sciences Building, 1050 W. Sheridan Road in the Quinlan Auditorium. These buildings are close to one another.

## Social Event

We have organized a social event on Saturday night (10/3) jointly with the Special Session on Algebraic Statistics.

This will be dinner at Niu Japanese Fusion Lounge at 7:30 pm. located at 332 E Illinois St, Chicago IL 60611. This is 4 blocks from a Red Line stop, and Ruth Davidson will lead a group on the CTA Red line from Loyola to the restaurant from a location at Loyola to be determined, departing at 6:20 SHARP from our home room, 403 Mundelein Center. They will be seating us in smaller groups to allow for individual forms of payment. There is no funding for this event.

## Plenary Talks

Full scheduling information about the invited addresses is available here. In particular, everyone in this Special Session will likely be interested in the plenary talk by Sebastien Roch at 11:05 a.m. on Sunday, 10/4: "Mathematics of the Tree of Life---From Genomes to Phylogenetic Trees and Beyond."- see the abstract here . This talk is included in the schedule below.

## Main Session Website

The main session website is here. This will take you to links about Registration (advance registration is recommended), hotels, and so on. Due to the strict deadline enforcement of the AMS abstract system, this webpage is the authoritative source for the content of our special session, but there is much other useful information available there.

## Schedule for 10/3 (Abstracts are Listed Below)

 Talk Number and Time Speaker Affiliation Title 1. 8:30 a.m. Tandy Warnow University of Illinois Urbana-Champaign Species Tree Estimation in the presence of Incomplete Lineage Sorting. 2. 9:00 a.m. Luay Nakleh Rice University Gene trees in phylogenetic networks. 3. 9:30 a.m. Andrew Francis University of Western Sydney Tree-based phylogenetic networks. 4. 10:00 a.m. James Degnan University of New Mexico Displayed Trees Do Not Determine Distinguishability Under the Network Multispecies Coalescent 5. 10:30 a.m. Noah Rosenberg Stanford University Coalescent histories for caterpillar and lodgepole families Plenary Talks and Lunch 6. 2:30 p.m. Amelia Taylor Colorado College Developing a Statistically Powerful Measure for Phylogenetic Tree Inference using Phylogenetic and Markov Invariants. 7. 3:00 p.m. Joe Rusinko Hobart and William Smith Colleges Nearest Point Phylogenetic Reconstruction using Numerical Algebraic Geometry. 8. 3:30 p.m. Colby Long North Carolina State University Tying Up Loose Strands: Defining Equations of the Strand Symmetric Model 9. 4:00 p.m. Jesús Fernández-Sánchez Universitat Politècnica de Catalunya CANCELLED 10. 4:30 p.m. Seth Sullivant North Carolina State University Statistically consistent k-mer methods for phylogenetic tree reconstruction

## Schedule for 10/4 (Abstracts are Listed Below)

 Talk Number and Time Speaker Affiliation Title 11. 8:30 a.m. Laszlo A. Szekely University of South Carolina Number of gene duplication episodes and Gallai’s min-max theorem on intervals. 12. 9:00 a.m. Jin Xie University of Kentucky Distributions of topological tree metrics between a species tree and a gene tree. 13. 9:30 a.m. Eric Stone North Carolina State University A Spectral Approach to Phylogenetic Reconstruction 14. 10:00 a.m. Bret Larget University of Wisconsin Recent Advances in Bayesian Concordance Analysis. 15. 10:30 a.m. Jing Xi North Carolina State University Stochastic safety radius on Neighbor-Joining method and Balanced Minimal Evolution on small trees. Plenary Talk: 11:05-11:55 a.m. Sebastien Roch University of Wisconsin-Madison Mathematics of the Tree of Life—From Genomes to Phylogenetic Trees and Beyond. 12:00 p.m.-1:30 p.m.: Lunch 16. 1:30 p.m. Stefan Forcey The University of Akron Facets of Balanced Minimal Evolution polytopes. 17. 2:00 p.m. Megan Owen CUNY-Lehman College Polyhedral subdivisions and a partial CLT for tree space. 18. 2:30 p.m. Katherine L. Thompson University of Kentucky A phylogenetic model for quantitative trait mapping with complex data sets. 19. 3:00 p.m. Laura Kubatko The Ohio State University Parameter Identifiability and Inference for Species Phylogenies Under the Coalescent.

## Abstracts Listed by Talk Number

1. (Tandy Warnow) Estimating the Tree of Life will likely involve a two-step procedure, where in the first step trees are estimated on many genes, and then the gene trees are combined into a tree on all the taxa. However, the true gene trees may not agree with with the species tree due to biological processes such as deep coalescence, gene duplication and loss, and horizontal gene transfer. Statistically consistent methods based on the multi-species coalescent model have been developed to estimate species trees in the presence of incomplete lineage sorting; however, the relative accuracy of these methods compared to the usual ”concatenation” approach is a matter of substantial debate within the research community. I will present results showing that coalescent-based estimation methods are impacted by gene tree estimation error, so that they can be less accurate than concatenation in many cases. I will also present weighted and unweighted statistical binning (see Mirarab et al., Science 2014, and Bayzid et al., PLOS One 2015), methods for improving gene tree estimation, and that enable more accurate estimations of species trees in the presence of gene tree conflict due to ILS.

2. (Luay Nakleh) Phylogenetic networks model reticulate evolutionary histories arising due to processes such as horizontal gene transfer in prokaryotes and hybridization in eukaryotes. The topology of a phylogenetic network (the ”evolutionary” version of networks) is a rooted, directed, acyclic graph, whose leaves are bijectively labeled by a set of taxa and in which each node (except for the root) has in-degree 1 or 2. Each edge in the network has parameters that include the population size and length in generations along that edge. Furthermore, given a gene, or genomic region of interest, we associate with each edge in the network a function to capture the probability that the gene has evolved inside that edge. This model constitutes a generative model of gene trees under the multi-species coalescent model with reticulation. In this talk, I will describe the model, the distribution of genes under it, and a maximum likelihood framework for inferring phylogenetic networks from genome-wide data. This is joint work with James Degnan, Jianrong Dong, Kevin Liu, and Yun Yu.

3. (Andrew Francis) A binary phylogenetic network may or may not be obtainable from a tree by the addition of directed edges (arcs) between tree arcs. In this talk I will present a precise and easily tested criterion that efficiently determines whether or not any given network can be realized in this way. The proof provides a polynomial-time algorithm for finding one or more trees (when they exist) on which the network can be based. I will also talk about a number of interesting consequences, and some further relevant questions and observations. Joint work with Mike Steel.

4. (James Degnan) Recent work in estimating species relationships from gene trees have included inferring networks assuming that past hybridization has occurred between species. Probabilistic models using the multispecies coalescent can be used in this framework for likelihood-based inference of both network topologies and parameters, including branch lengths and hybridization parameters. A difficulty for such methods is that it is not always clear whether, or to what extent, networks are identifiable — i.e., whether there could be two distinct networks that lead to the same distribution of gene trees. We present a new representation of the species network likelihood that represents the probability distribution of the gene tree topologies as a linear combination of gene tree distributions given a set of species trees. This representation makes it clear that in some cases in which two distinct networks give the same distribution of gene trees when sampling one allele per species, the two networks can be distinguished theoretically when multiple individuals are sampled per species. This result means that network identifiability is not only a function of the trees displayed by the networks.

5. (Noah Rosenberg) A coalescent history is an assignment of branches of a gene tree to branches of a species tree on which coalescences in the gene tree occur. The number of distinct coalescent histories for a pair consisting of a labeled gene tree topology and a labeled species tree topology on the same label set is important in gene tree probability computations, and more generally, in studying evolutionary possibilities for gene trees on species trees. A recursion can be used to compute the number of coalescent histories for an arbitrary choice of gene tree and species tree, and closed-form formulas are available in a few cases. We examine the asymptotic properties of the number of coalescent histories for a matching gene tree and species tree, for two special tree shapes: caterpillar-like families $T_n$, in which a sequence of $n$-taxon trees is constructed by replacing the $r$-taxon subtree of $n$-taxon caterpillar trees with a specific subtree $T_r$, and lodgepole families $\lambda_n$, in which lodgepole tree $\lambda_0$ has a single leaf and $\lambda_{n+1}$ is inductively defined by appending $\lambda_n$ and a pair of branches – a cherry – to a common root. The asymptotic behavior of the number of coalescent histories differs substantially between the two families, providing insight into the circumstances that produce large and small numbers of coalescent histories for matching gene trees and species trees. Joint work with Filippo Disanto.

6. (Amelia Taylor) In the late 1980’s Cavendar and Felsenstein and Lake introduced the idea of phylogenetic invariants; a class of polynomials useful in the study of phylogenetic trees. Allman and Rhodes renewed interest in these polynomials taking the point of view of algebraic geometry and giving a comprehensive description of the set of polynomials which lead to their use studying numerous analytical questions like identifiability. As part of this renaissance Casanellas and Fernandez – Sanchez provided one of the first simulation studies exploring the use of the polynomials for tree inference, leaving many open questions about using the polynomials directly for tree inference. Around the same time Sumner and coauthors suggested an alternative perspective using group representation theory. We briefly present the two perspectives for the two-state general Markov model on quartet trees and then describe our study of using polynomials from each perspective to build a statistically powerful measure for tree inference, and argue for one particular measure including simulation results.

7. (Joe Rusinko) We propose a phylogenetic reconstruction algorithm which uses the distance to the nearest point on the phylogenetic model to select the tree of best fit. Our implementation is currently for quartet trees which can be then used to reconstruct larger trees using a quartet amalgamation algorithm. This algorithm allows for data dependent hypothesis testing which helps identify when trees have been accurately reconstructed an important feature since quartet amalgamation algorithms are sensitive to error in input trees but do not require a complete set of quartet trees as inputs. We present initial findings for the Jukes-Cantor, Kimura 2 and 3 parameter models and the general Markov model of evolution.

8. (Colby Long) The strand symmetric model is a phylogenetic model designed to reflect the symmetry inherent in the double-stranded structure of DNA. We show that the set of known phylogenetic invariants for the general strand symmetric model of the three leaf claw tree entirely defines the ideal. This knowledge allows one to determine the vanishing ideal of the general strand symmetric model of any trivalent tree. Our proof of the main result is computational. We use the fact that the Zariski closure of the strand symmetric model is the secant variety of a toric variety to compute its dimension. We then show that the known equations generate a prime ideal of the correct dimension using elimination theory.

9. (Jesús Fernández-Sánchez) CANCELLED Evolutionary models are needed to study the evolution between nucleotide sequences. Some of the most usual models fit into the equivariant definition introduced in terms of the action of a permutation group in the set of nucleotides. Phylogenetic invariants are constraints satisfied by the joint probabilities of nucleotide patterns at the leaves of a phylogenetic tree evolving under a given evolutionary model. They have shown to be useful to characterize the model as well as to design methods for phylogenetic inference. We study and construct phylogenetic invariants of some well-known equivariant phylogenetic models and the general Markov Model. These invariants allow us to describe a (Zariski open) neighbourhood of the no-evolution points in the model as a complete intersection. In other words, we provide a minimal possible number of explicitly constructed phylogenetic invariants that determine the model at biologically meaningful points. Our work is inspired by previous inductive constructions of phylogenetic invariants. It is motivated mostly by applications, as the number of phylogenetic invariants we construct is much lower than the number needed to generate the ideal of the corresponding variety.

10. (Seth Sullivant) Algorithms based on k-mer distances are used to reconstruct phylogenetic trees without first constructing a multiple sequence alignment. We show that the standard method for reconstructing trees based on k-mer distances is statistically inconsistent (that is, they reconstruct the wrong tree even with increasing amounts of data) and we also derive statistically consistent model-based distance corrections in the case of sequences without gaps. We report on numerous simulations which show that the new formulas significantly out-perform older (statistically inconsistent) methods, even in sequences with gaps. These results also have implications for multiple sequences alignment, since many widely used multiple sequence alignment programs use the statistically inconsistent methods to construct a guide tree for multiple sequence alignment.

11.(Laszlo A. Szekely) In 1996, Guigo et al. [Mol. Phylogenet. Evol., 6 (1996), 189–203] posed the following problem: for a given species tree and a number of gene trees, what is the minimum number of duplication episodes, where several genes could have undergone duplication together to generate the observed situation. (Gene order is neglected, but duplication of genes could have happened only on certain segments that duplicated). We study two versions of this problem, one of which was algorithmically solved not long ago by Bansal and Eulenstein. We provide min-max theorems for both versions that generalize Gallai’s archetypal min-max theorem on intervals, allowing simplified proofs to the correctness of the algorithms (as it always happens with duality) and deeper understanding. An interesting feature of our approach is that its recursive nature requires a generality that bioinformaticians attempting to solve a particular problem usually avoid.

12. (Jin Xie) In order to conduct a statistical analysis on a given set of phylogenetic gene trees, we often use a distance measure between two trees. In a statistical distance-based method to analyze discordance between gene trees, it is a key to decide “biological meaningful” and “statistically well-distributed” distance between trees. Thus, in this paper, we study the distributions of the three tree distance metrics: the edge difference, the path difference, and the precise K interval cospeciation distance, between two trees: First, we focus on distributions of the three tree distances between two random unrooted trees with n leaves (n ≥ 4); and then we focus on the distributions the three tree distances between a fixed rooted species tree with n leaves and a random gene tree with n leaves generated under the coalescent process with given the species tree. We show some theoretical results as well as simulation study on these distributions.

13. (Eric Stone) In this talk, I introduce a graph-theoretical approach to reconstructing phylogenies from distance data. Toward that end, I discuss the use of eigendecomposition to learn about the features of a graph. The graphs I consider are not graphs in the usual sense; rather, they are "graphs" obtained after a subset of vertices have been removed in a prescribed way. For a phylogeny, this subset contains the extinct ancestral taxa for which no data is available. I show how spectral methods can be used to nevertheless ascertain the positions of these taxa on the tree.

14. (Bret Larget) Bayesian concordance analysis, as implemented in the software BUCKy, is a method to simultaneously estimate multiple gene trees with prior information that trees from different genes relating the same set of taxa are likely to be similar, if not identical. Information is shared between genes, but discordance among genes is allowed. Recent advances include a cluster model where discordant gene trees are expected to be topologically similar, as many biological processes that create gene tree discordance predict. We review recent advances in this research area.

15. (Jing Xi) A distance-based method to reconstruct a phylogenetic tree with n leaves takes a distance matrix, n×n symmetric matrix with 0s in the diagonal, as its input and reconstructs a tree with n leaves using tools in combinatorics. A safety radius is a radius from a tree metric (a distance matrix realizing a true tree) within which the input distance matrices must all lie in order to satisfy a precise combinatorial condition under which the distance-based method is guaranteed to return a correct tree. A stochastic safety radius is a safety radius under which the distance-based method is guaranteed to return a correct tree within a certain probability. In this paper we investigated stochastic safety radii for the neighbor-joining (NJ) method and balanced minimal evolution (BME) method for n=5.

Plenary Talk (Sebastian Roch) The reconstruction of the Tree of Life is an old problem in evolutionary biology which has benefited from various branches of mathematics, including probability, combinatorics, algebra, and geometry. Modern DNA sequencing technologies are producing a deluge of new data on a vast array of organisms—transforming how we view the Tree of Life and how it is reconstructed. I will survey recent progress on some mathematical and computational questions that arise in this context. No biology background will be assumed.

16. (Stefan Forcey) I’ll review how the balanced minimal evolution (BME) method reconstructs a phylogenetic tree from a given distance matrix, by performing the simplex algorithm for a simple example. New and improved algorithms might be available if we had enough facets of the BME polytope in order to pose a relaxed linear programming problem. So far we have found all the facets up to dimension 5 and several classes of large facets that extend to all dimensions. We’ll go over facets from caterpillars, necklaces, intersecting cherries, and big splits. We’ll mention some interesting connections to matching polytopes like the Birkhoff polytope, and list some open questions about counting and geometry.

17.(Megan Owen) The space of metric phylogenetic trees introduced by Billera, Holmes, and Vogtmann (2001) is a polyhedral cone complex. It is also non-positively curved, so there is is a unique shortest path (geodesic) between any two trees. I will show how the combinatorics of geodesics with a specified fixed endpoint give rise to a finer polyhedral subdivision, and how this subdivision can be used to prove a partial Central Limit Theorem on the tree space. This talk is a combination of joint work with Ezra Miller and Scott Provan, and Huiling Le and Dennis Barden.

18, (Katherine L. Thompson) Recent developments in association mapping methods along with improvements in sequencing technology have made it possible to link locations along the genome (single nucleotide polymorphisms, or SNPs) with quantitative traits. Although this goal is central in the biological sciences, progress has been limited by the inability of existing methods to consider complex, but relevant, scenarios such as the simultaneous influence of genetic and external effects on quantitative trait(s) under study. Regression-based methods are computationally feasible and perform well in detecting external effects, but may miss weaker genetic signals since they fail to consider uneven evolutionary relatedness among samples. Previous work has shown promise in improving detection of associated SNPs by using the uneven relatedness within a SNP to estimate the underlying covariance structure among trait values. Here, a method including phylogenetic analysis is proposed to search for genetic and external effects on a quantitative trait. Parameter estimates from mixed model and Bayesian approaches are compared. The proposed method aims to estimate and separate effects while remaining computationally feasible.

19. (Laura Kubatko) The rapid increase in availability of DNA sequence data coupled with gains in computational power have led to the use of increasingly complex models for inferring the evolutionary relationships among collections of species. These models have largely made use of the coalescent to capture the within-population dynamics of the speciation process along a phylogenetic tree. Though at least a dozen inference methods/software packages have been developed in this setting in the last 10 years, there has been little attention given to identifiability of model parameters. We have previously established identifiability of the species tree topology using techniques from algebraic statistics. We now extend our methodology to consider the estimation of parameters along that tree. In particular, we consider identifiability of the times of the speciation events along a fixed phylogeny. We apply our methods to both simulated and empirical data sets.