- Motivation: Topological Data Analysis as a part of the data analysis pipeline
- Background in Topology
- Shape of the data and Persistence
- Manifold learning
- Stochastic Topology
- Applications:
- Klein Bottle in plain views
- Persistence in chaos: detecting changes in dynamics
- Local Homology or what is the dimension of the Internet

From the high-level viewpoint, statistical theory starts with a model into which
the data are being fit (non-parametric statistics incliding):
one looks for a parametric family

$$
x_i=\sum a_{ij} y_j+\epsilon_i,
$$

or some nonlinear generalizations, and seeks the values of the
parameters fitting the data. (In this example, the
parameters are the linear mappings between from $y$'s to
$x$'s.)

More generally, one might want to look for the "laws"
governing the data:

$$
\begin{array}{rl}
f_1(x_1,\ldots,x_n)&=0,\\
\cdots&\\
f_m(x_1,\ldots,x_n)&=0\\
\end{array}
$$

In any case, these approaches imply that we know in advance the rough "shape" of the data.

Do we?

The very promise of Topological Data Analysis is to be
able to recover the underlying *model* - i.e. the
parametric family to which the data conform - from the
observations.

Topology, classically, deals with the properties of
geometric objects invariant under the broadest classes
of transformations ("coordinate changes").

Topological
Data Analysis is the (traditionally, implicit) step in
the data analysis which recovers these properties,
constructing a "generally covariant" model.

We will give an overview of the relevant notions later, but start here with a motivating example.

Clustering is a classical example of topological
data analysis: we reduce the data shapes to
the connected components. The equivalencies
between spaces are quite rough.

Clustering can be interpreted as a
mapping with limited distortion
into a discrete metric space.

In some sense, clustering is the data analysis with
zero-dimensional model.

Kleinberg's paradox deals with natural axioms on clustering procedures on metric spaces: he proved that th ese axioms are incompatible.

Carlsson and Memoli countered
that by considering families of clustering
procedures parameterized by the *scale*.

One should think of this procedure as the
archetypal for topological data analysis: one recovers
the underlying topology of the data, taking into
account the scale, in a uniform consistent fashion.

To proceed to more complicated models
of shapes, we need first to introduce appropriate
language.

The underlying idea of topology is to reflect proximity
between points. Hence, to consider a set $X$ as a
topological space, we need to formalize the notion of
"being close to each other".

Proximity comes intrinsically with the notion of scale;
to account for all scales simultaneously one has to
codify the subsets of $X$ of points close to $x\in X$ on
different scales.

So, a neighborhood of $x$ is some subset of $X$. One might
ask for some degree of stability: (friend of a friend...)
- if $U$ is a neighborhood of $x$, and $y\in U$, then $U$
is also a neighborhood of $y$.

Such neighborhoods are
called *open*, and fixing a collection of open neighborhoods
(subject to some natural axioms)
is exactly the notion of topology on $X$.

One natural way to think about proximities is by
assigning a *distance* to pairs of points, and
the notion of *metric space* is a convenient way to fix a
topology.

The triangle
inequality satisfied by the distance function is, again,
a formalization of the "friend of my
friend..." intuition.

Open balls form a natural basis for topology. The same topology can be encoded by different distance functions. So, if your algorithm for recovering a topological model results in a rigid metric, you might want to loosen your assumptions!

General topological spaces can be very bad. The world surrounding us can be described using reasonable spaces. Useful classes are:

simplicial spaces- spaces assembled of simplices (with exactly the rules one could expect).

Simplicial spaces are nice because one can work them
in a purely combinatorial way: just list all
collections of points that are vertices of a simplex.

This is the key property that enables effective
computing of the topological invariants.

Manifolds- spaces locally modeled on the Euclidean spaces. There are two ways to define them: via charts and atlases and via implicit function theorem: a manifolds are given by systems of equations $$ M=\{F(x)=0\}, F\in C^k(\Real^m,\Real^n) $$ such that the Jacobian $DF$ has maximal rank on $M$.

If one allows degenerations, the zeros can be arbitrarily involved closed subsets of $\Real^m$ - but if one restricts oneself to polynomial equations - and inequalities - then the resulting sets (called semialgebraic) are good for computing: both manifolds and semialgebraic sets can be triangulated, that is look like simplicial complexes.

Given two topological spaces, when we deem them to be the same, i.e. representing in essence the same model? In the case of the metric spaces, one would ask for a bijection that preserves distances - this is obviously too restrictive. One might reduce this to biLipschitz bijections, or some other controls of the distortion.

For the topological spaces, one just asks for continuity: a bicontinuous bijection is called a homeomorphism.

Homeomorphisms are sometimes still too rigid equivalence for a truly scale independent model building. One might want to identify circle and a solid torus if only the periodicity of the model is important.

To capture this, one introduces the notion of homotopy
equivalence:
Two (continuous) mappings $f,g:X\to Y$ are homotopy equivalent if
there exist a continuous family of continuous mappings
$H_t:X\to Y$ such that $H_0=f, H_1=g$.

We say that $X$ and $Y$ are homotopy equivalent if there
exist mappings
$$
X \underset{g}{\overset{f}{\rightleftarrows}} Y
$$
such that $f\circ g$ is homotopy equivalent to $id_Y$ and
$g\circ f$ is homotopy equivalent to $id_X$.

Homotopy equivalence is the most general equivalence
between models we will be dealing with.

To select a topological space as a model, if the
topological space itself is defined only up to some
equivalence, requires some supply of "fingerprints" of
topological spaces, some invariants that survive the
equivalences.

Defining and computing the topological invariants is one
of the key occupations of algebraic
topologists.

Topological Data Analysis is in essence the
craft of defining and computing topological invariants for
point clouds and other data collections, in search of an
underlying parametrization-free model.

Homology groups of a topological space $X$, roughly,
account for the un-patchable holes. For that one takes
hole-less subspaces (cycles), but declare immaterial all those that
can be patched (boundaries).

It is instrumental to allow differences (and sums) of the
cycles - hence the resulting object is naturally an
Abelian group (or a vector space, for added convenience).

One needs quite a bit preparatory work to define
homologies (and they come in quite a few sizes and
flavors, agreeing though on reasonable spaces).

The resulting homology groups (one in each dimension) are
denoted as
$H_k(\cdot), k=0,1,\ldots$,
and have many useful properties - in particular, they are
homotopy invariant.

One important aspect of homologies on simplicial
complexes is their completely combinatorial
description.

$$
\partial_1=\left(
\begin{array}{ccccc}
-1&0&0&1&0\\
1&-1&0&0&1\\
0&1&-1&0&0\\
0&0&1&-1&-1\\
\end{array}
\right);\quad \partial_2=\left(
\begin{array}{c}
0\\1\\1\\0\\1
\end{array}
\right)
$$

- J.M. Kleinberg. An impossibility theorem for clustering, in NIPS, ed. by S. Becker, S. Thrun, K. Obermayer (MIT Press, Cambridge, 2002), pp. 446-453.
- G. Carlsson, F. Memoli. Classifying Clustering Schemes, Foundations of Computational Mathematics, Volume 13, Issue 2, pp 221-252, 2013
- G. Bredon. Topology and geometry. Vol. 139. Springer, 1993.
- J.R.Munkres. Elements of algebraic topology. Vol. 2. Reading: Addison-Wesley, 1984.
- J.R.Munkres. Topology: a first course. Vol. 23. Englewood Cliffs, NJ: Prentice-Hall, 1975.