$ \def\R{\mathbb{R}} \def\Real{\mathbb{R}} \def\X{\mathbb{X}} \def\Y{\mathbb{Y}} \def\N{\mathbb{N}} \def\Z{\mathbb{Z}} \def\e{\varepsilon} \def\h{\mathsf{H}} \def\F{\mathbb{F}} \def\dgm{\mathrm{Dgm}} \def\pers{\mathrm{pers}} \def\Pers{\mathrm{Pers}} \def\lip{\mathrm{Lip}} \def\amp{\mathrm{Amp}} \def\meas{\mathcal{P}} \def\Fr{Fr\'echet } \def\Cech{{\rm Čech}} \def\Rips{{\rm Rips}} \def\nodes{\mathbf{N}} \def\image{\mathtt{Im}\,} \def\ker{\mathtt{Ker}\,} \def\ph{\mathbf{P}H} \def\bbr{\mathtt{Br}} \def\bm{\mathtt{B}} \def\bmv{\bm^v} \def\phpp{\mathbf{b}} \def\dimph{\mathrm{dim}_{\ph}} \def\class{\mathcal{F}} $

Tutorial on Topological Data Analysis

Yuliy Baryshnikov and Yuriy Mileyko

ISIT 2014


  • Motivation: Topological Data Analysis as a part of the data analysis pipeline
  • Background in Topology
  • Shape of the data and Persistence
  • Manifold learning
  • Stochastic Topology
  • Applications:
    • Klein Bottle in plain views
    • Persistence in chaos: detecting changes in dynamics
    • Local Homology or what is the dimension of the Internet

dimensionality reduction and manifold learning

From the high-level viewpoint, statistical theory starts with a model into which the data are being fit (non-parametric statistics incliding): one looks for a parametric family

$$ x_i=\sum a_{ij} y_j+\epsilon_i, $$
or some nonlinear generalizations, and seeks the values of the parameters fitting the data. (In this example, the parameters are the linear mappings between from $y$'s to $x$'s.)
More generally, one might want to look for the "laws" governing the data:

$$ \begin{array}{rl} f_1(x_1,\ldots,x_n)&=0,\\ \cdots&\\ f_m(x_1,\ldots,x_n)&=0\\ \end{array} $$

In any case, these approaches imply that we know in advance the rough "shape" of the data.

Do we?

The very promise of Topological Data Analysis is to be able to recover the underlying model - i.e. the parametric family to which the data conform - from the observations.

Topology, classically, deals with the properties of geometric objects invariant under the broadest classes of transformations ("coordinate changes").
Topological Data Analysis is the (traditionally, implicit) step in the data analysis which recovers these properties, constructing a "generally covariant" model.

We will give an overview of the relevant notions later, but start here with a motivating example.

Clustering is a classical example of topological data analysis: we reduce the data shapes to the connected components. The equivalencies between spaces are quite rough.
Clustering can be interpreted as a mapping with limited distortion into a discrete metric space.
In some sense, clustering is the data analysis with zero-dimensional model.

Kleinberg's paradox deals with natural axioms on clustering procedures on metric spaces: he proved that th ese axioms are incompatible.

Carlsson and Memoli countered that by considering families of clustering procedures parameterized by the scale.

One should think of this procedure as the archetypal for topological data analysis: one recovers the underlying topology of the data, taking into account the scale, in a uniform consistent fashion.
To proceed to more complicated models of shapes, we need first to introduce appropriate language.

background in topology

topological spaces

The underlying idea of topology is to reflect proximity between points. Hence, to consider a set $X$ as a topological space, we need to formalize the notion of "being close to each other".
Proximity comes intrinsically with the notion of scale; to account for all scales simultaneously one has to codify the subsets of $X$ of points close to $x\in X$ on different scales.
So, a neighborhood of $x$ is some subset of $X$. One might ask for some degree of stability: (friend of a friend...) - if $U$ is a neighborhood of $x$, and $y\in U$, then $U$ is also a neighborhood of $y$.
Such neighborhoods are called open, and fixing a collection of open neighborhoods (subject to some natural axioms) is exactly the notion of topology on $X$.

metric spaces

One natural way to think about proximities is by assigning a distance to pairs of points, and the notion of metric space is a convenient way to fix a topology.
The triangle inequality satisfied by the distance function is, again, a formalization of the "friend of my friend..." intuition.

Open balls form a natural basis for topology. The same topology can be encoded by different distance functions. So, if your algorithm for recovering a topological model results in a rigid metric, you might want to loosen your assumptions!

nice classes of topological spaces

General topological spaces can be very bad. The world surrounding us can be described using reasonable spaces. Useful classes are:

simplicial spaces - spaces assembled of simplices (with exactly the rules one could expect).

Simplicial spaces are nice because one can work them in a purely combinatorial way: just list all collections of points that are vertices of a simplex.
This is the key property that enables effective computing of the topological invariants.

nice classes of topological spaces

Manifolds - spaces locally modeled on the Euclidean spaces. There are two ways to define them: via charts and atlases and via implicit function theorem: a manifolds are given by systems of equations $$ M=\{F(x)=0\}, F\in C^k(\Real^m,\Real^n) $$ such that the Jacobian $DF$ has maximal rank on $M$.

If one allows degenerations, the zeros can be arbitrarily involved closed subsets of $\Real^m$ - but if one restricts oneself to polynomial equations - and inequalities - then the resulting sets (called semialgebraic) are good for computing: both manifolds and semialgebraic sets can be triangulated, that is look like simplicial complexes.


Given two topological spaces, when we deem them to be the same, i.e. representing in essence the same model? In the case of the metric spaces, one would ask for a bijection that preserves distances - this is obviously too restrictive. One might reduce this to biLipschitz bijections, or some other controls of the distortion.

For the topological spaces, one just asks for continuity: a bicontinuous bijection is called a homeomorphism.

homotopy equivalence

Homeomorphisms are sometimes still too rigid equivalence for a truly scale independent model building. One might want to identify circle and a solid torus if only the periodicity of the model is important.

To capture this, one introduces the notion of homotopy equivalence: Two (continuous) mappings $f,g:X\to Y$ are homotopy equivalent if there exist a continuous family of continuous mappings $H_t:X\to Y$ such that $H_0=f, H_1=g$.
We say that $X$ and $Y$ are homotopy equivalent if there exist mappings $$ X \underset{g}{\overset{f}{\rightleftarrows}} Y $$ such that $f\circ g$ is homotopy equivalent to $id_Y$ and $g\circ f$ is homotopy equivalent to $id_X$.
Homotopy equivalence is the most general equivalence between models we will be dealing with.


To select a topological space as a model, if the topological space itself is defined only up to some equivalence, requires some supply of "fingerprints" of topological spaces, some invariants that survive the equivalences.
Defining and computing the topological invariants is one of the key occupations of algebraic topologists.
Topological Data Analysis is in essence the craft of defining and computing topological invariants for point clouds and other data collections, in search of an underlying parametrization-free model.

$$\beta_0=1, \beta_1=10,\beta_2=1?$$ Surface of genus $5$!


Homology groups of a topological space $X$, roughly, account for the un-patchable holes. For that one takes hole-less subspaces (cycles), but declare immaterial all those that can be patched (boundaries).
It is instrumental to allow differences (and sums) of the cycles - hence the resulting object is naturally an Abelian group (or a vector space, for added convenience).


One needs quite a bit preparatory work to define homologies (and they come in quite a few sizes and flavors, agreeing though on reasonable spaces).
The resulting homology groups (one in each dimension) are denoted as $H_k(\cdot), k=0,1,\ldots$, and have many useful properties - in particular, they are homotopy invariant.

simplicial complexes and homologies

One important aspect of homologies on simplicial complexes is their completely combinatorial description.
$$ \partial_1=\left( \begin{array}{ccccc} -1&0&0&1&0\\ 1&-1&0&0&1\\ 0&1&-1&0&0\\ 0&0&1&-1&-1\\ \end{array} \right);\quad \partial_2=\left( \begin{array}{c} 0\\1\\1\\0\\1 \end{array} \right) $$