$ \def\Int{\mathbb{Z}} \def\R{\mathbb{R}} \def\Real{\mathbb{R}} \def\X{\mathbb{X}} \def\Y{\mathbb{Y}} \def\N{\mathbb{N}} \def\Z{\mathbb{Z}} \def\e{\varepsilon} \def\h{\mathsf{H}} \def\F{\mathbb{F}} \def\dgm{\mathrm{Dgm}} \def\pers{\mathrm{pers}} \def\Pers{\mathrm{Pers}} \def\lip{\mathrm{Lip}} \def\amp{\mathrm{Amp}} \def\meas{\mathcal{P}} \def\Fr{Fr\'echet } \def\Cech{{\rm Čech}} \def\Rips{{\rm Rips}} \def\nodes{\mathbf{N}} \def\image{\mathtt{Im}\,} \def\ker{\mathtt{Ker}\,} \def\ph{\mathbf{P}H} \def\bbr{\mathtt{Br}} \def\bm{\mathtt{B}} \def\bmv{\bm^v} \def\phpp{\mathbf{b}} \def\dimph{\mathrm{dim}_{\ph}} \def\class{\mathcal{F}} $

persistence on random sets and functions

persistence for random samples

To understand the performance of the algorithms of TDA, it is important to investigate the model cases, where the samples are random.
A typical example is the result by Niyogi-Smale-Weinberger, a stochastic version of the results on the homotopy equivalence of off-sets of samples approximating a manifold. In essence, this result asserts that for balls of small enough (compared to the feature size) radius $r$, the sample of size $$n=\Theta_{k,\tau}\left[r^{-k}\left( \log{r}+|\log\delta|\right)\right] $$ is enough to have the union of balls sampled from the Lebesgue measure on a smooth submanifold $M$ of $\Real^D$ homotopy equivalent to $M$, with probability at least $1-\delta$.

homological concentration

More precise results concerning the short bars in random samples from manifolds are nontrivial. One can show that most of the short-living classes in dimension $k$ have the birth-death times near $$ r_{k,d}(n)\approx c_{k,d} n^{-1/d} $$ with last spurious classes holding for a $\log{n}$ longer time (Adler-Bobrowski, Kahle ...).
Remarkably, it seems that those short-living homologies are coming in waves, in increasing dimensions.

One can show that for large $d$, the range of $r$ for which homologies are present behaves like $d^{-1/2}$.

Another witness to this conjectured behavior is the Euler curve describing the Euler characteristics (alternating sum of ranks of homology groups) in random unions of balls:

crackle: persisting noise in persistence homology

Samples from a manifold are known to faithfully reconstruct the topology of the underlying space, at least when the samples lie exactly on the manifold: the Čech complex at a carefully chosen scale has the homotopy type of the manifold. Adding noise complicates matters.
In some situations, these complications can be overcome: for example if the noise is Gaussian.
If not, a new phenomenon arises, the crackle:

crackle for 1D exponential noise

A special case is of $0$-dimensional manifolds, where the highly connected core emerges. On the surface of this core protuberances flare.

Altogether, one has the following

As $N\to \infty$, the $\mathbf{P}H_0$ (supported on $[0,\cdot]$) converged on any compact subset of $\Real_+$ to a point process with the density $$ {e^{-y}}{(1-e^{-y})^2} $$

crackle for noise with exponential tails

Assume that

  • A manifold $M^m$ is embedded smoothly into $\Real^D$;
  • The scale $s$ "inflates" the manifold to $sM$ (so that its volume grows as $s^m$);
  • The sample size $N\approx \mu s^m$, where $\mu$ grows, but not too fast, $\log{\mu}\ll s$.

In this case (Adler-Bobrowski-Weinberger)
$$ \frac{\beta_k([x,y])}{{\mathrm{vol}(M)} \log{\mu}^{D-m-1}s^m}\to b_k(x,y). $$
Here $\beta_k[x,y]$ is the rank of persistent $k$-homology, and $b_k$ is a function depending on the dimension and codimension of $M$ only.

Crackle happens also for subexponential distributions: even more scales are involved.

persistence for Brownian trajectories

Let \[ \bbr:[0,1]\to \Real \] be a sample trajectory of the Brownian bridge (Brownian motion conditioned on $\bm(0)=\bm(1)=0$, or, equivalently, Brownian motion on the unit circle).

fractal nature of persistence homology for Brownian motion

Denote by $N_k^f(a,b)$ the number of bars in $k$-th persistent homology of $f$ overlapping the interval $[a,b]$, and by $B(x,y)=\mathbb{E} N_o^\bbr (x,y)$. Then

for $0 \leq x \leq y $ one has $$ B(x,y)=\sum_{n\geq 1} e^{-2(n(y-x)+x)^2} $$

When $x$ is close to $y$, the number of the bars overlapping $[x,y], x \gt 0$ is close to
$$ (y-x)^{-1}\sqrt{\pi/2} \mathtt{erfc}(2x) $$

fractal nature of persistence homology for Brownian motion

In other words, the density $\beta_0$ of the persistence point process near the diagonal is exploding like
$$ \beta_0^{\bbr}(x,y)\sim (y-x)^{-3} $$

fractal nature, cont'd

Similarly, for the Brownian motion with drift, $$ \bmv_t=\sigma\bm_t+vt, $$ the number of bars overlapping any given interval is finite a.s. and is geometrically distributed with parameter $$ \exp(-v(y-x)/\sigma^2). $$

As a corollary, if $f$ is smooth, then for the small perturbations of $f$ by the Brownian motion one has $$ \lim_{\sigma\to 0} \frac{1}{\sigma^2} \lim_{\Delta\to 0} \Delta \beta_0^{f+\sigma \bm}(x, x+\Delta)=\sum_{s:f(s)=x} \frac{1}{f'(s)}=\frac{f_*(dx)}{dx}. $$

In particular, the point process of short bars of a small perturbation of a polynomial $f$ fix this polynomial up to shifts and reflections.

Persistence dimension

Somewhat more generally, one can define the persistence dimension of a function $f$ on a manifold $M$ as $$ \dimph(f):=\inf\{k: \langle (y-x)^k \cdot \phpp \rangle \lt\infty\}, $$ and the persistence dimension of a class of functions $\class$ as the $\sup_{f\in \class} \dimph(f)$.

There is a general bound on the total $k$-persistence for $d$-dimensional simplicial complexes implying for them $$ \dimph(f)\leq d. $$
The results for the Brownian excursion show that it has the persistence dimension $2$...

applications of TDA

classical example: image textures

Classifying textures based on material is a well known and important problem.

Three scales of each of ten materials in the KTH-TIPS data set

Exemplars from SIPI Rotated Texture data set

Four exemplars from each of the 25 classes in the UIUCTex data set

The 61 materials in the CUReT data set

Classification of textures is typically done by studying distributions of image patches.

The topology comes into play through the observation done during the study of patches of natural images

Examples from the van Hateren Natural Images data set

It turned out (Carlsson et al) that most of the $3\times3$ patches concentrate around a 1-dimensional singular set ("three circles") spanned by two-dimensional cells forming a two-dimensional complex that has the topology of a Klein bottle.

Animation of an immersoin of the Klein bottle into 3-space (by Konard Polthier)

A model of the Klein bottle.

A lattice of patches in the Klein bottle model.

Using this fact and an appropriate orthonormal basis for $L_2(K)$ one can obtain a vector of estimated of Fourier coefficients of the distribution density of patches from a particular texture (Perea-Carlsson).

First 44 estimated $K$-Fourier coefficients for the texture of Bricks

The $K$-Fourier coefficients can be made rotation invariant. Combining this with patches of different size one obtains a rotation invariant, multiscale descriptor amenable to classical machine learning analyses.

applications: local topology of time series

Even observing a single trajectory of a dynamical system leaves room for applications of topology: one can detect periodicity, or quasiperiodicity (toric behavior), or capture a strange attractor.
Topological data analysis helps dramatically when there is no model for the dynamics, and the scale at which the fluctuations are expected to manifest themselves is unknown.

A typical area where such phenomena arise are modern power grids, increasingly supporting large number of volatile active elements.
Below is an illustration oft this concept: a simulation of a 30+ bus power network, subject to a disruption is shown. The measurements represent an aggregated output of a collection of Phasor Measurement Units (PMUs)

Given a trace, how to detect emergence of oscillations, how to classify the oscillations into stable and unstable, how to detect bifurcations, of the scale of the noise is not known in advance? Persistence homology helps.
The first step, however, is to lift the observed traces using the Takens' embedding:

If $\{x(t)\}_{t=\ldots, -1,0,1,\ldots}, x(\cdot)\in\Real^d$ is a time series, the window $w$ Takens embedding of $x$ is the time series in $\Real^{wd}$ given by

$$ x_T(t):=(x(t), x(t+1),\ldots,x(t+w-1)). $$

In this way we might want to detect bifurcations on an undefined scale.

emergence of oscillations

period doubling

An implementation of the persistence homology for the Takens embeddings of noisy time series shows that one need not rely on the eye-balling (which is possible only for carefully chosen projections into low-dimensional spaces), like in the plots below:

200 points of Takens embedding progressing in time
200 points of Takens embedding progressing in time

Here, the plot on the left clearly shows emergence of a limit cycle in the traces of a process after a period of transition; the right plot exhibits the period doubling bifurcation.

Persistence diagrams evolving in time
Persistence diagrams evolving in time

applications: dimension of the internet

Internet as a metric space

The structure of the Internet has been studies for some time now.

ASN network

More specifically, it is typically the network of the Aunotomous System Nodes whose structure is considered. It is a large-scale network consisting of large players, like ISPs, Universities, large companies and various unknown entities.

There about $2.5\times 10^4$ (self-registered) nodes, interacting using BGP.

does the Internet live in the hyperbolic plane?

Over the past few years, several authors (Boguñá, Papadopoulos, Krioukov ...) suggested that the Internet can be modelled on a sample from a hyperbolic space - more specifically, from a large disk in the hyperbolic plane.

Topological data analysis suggests a natural test for this conjecture: one should be able to check the dimension of a manifold from a dense enough sample from this manifold. If the model is realistic, the dimension of the Internet were $2$.

There is a standard way to measure the dimension of a manifold via local homology.

Recall that the (reduced) homology group of a sphere $S^k$ are $$ \bar{H}_l(S^k,\Int)=\left\{\begin{array}{lll}\Int, &\mathrm{if}& k=l,\\ 0 &\mathrm{otherwise.}&\\ \end{array}\right. $$

This allows one to recover the dimension of a manifold $M^m$ via checking the homologies $$ \bar{H}_*(M, M-\{x\}): $$ it should be of rank $1$ precisely in dimension $m$ and nowhere else.

Note that the dimension is a function of a point - a separate theorem asserts that it is a (locally) constant function.

One can also recognize the boundary points (in a manifold with boundary or, more generally, with corners): if $x\in\partial X$, $$ \bar{H}_*(M,M-x)\cong 0 $$ in all dimensions.

To recover the dimension of a manifold $M$ at a point $x$, we need to have a dense enough sample in a small ball around $x$. A "stochastic version" of Hausmann-Latschev's theorem tells us that in this case, the local homology will be equivalent those of a sphere.

This tell us what to expect from what to expect from local dimensions of the Rips complexes of the samples from the hyperbolic spaces $H^d$ (or any other manifold):

  • If one computes the homologies of $$ \bar{H}_\cdot (R(\epsilon,Y), R(\epsilon,Y-B(y,k\epsilon)), $$ then
  • the points "in the bulk" will have reduced homologies of a $d$-sphere;
  • the points near the boundary will have trivial homologies;
  • there will be some number of pathologies (points with neither trivial homologies nor those of $S^d$), but their fraction will be small.

Here is an impressionistic view of the ASN

And here are some plots

Cumulative homology rank Node index

1-st Homology rank Node index
2-nd Homology rank Node index

It seems quite clear that the ASN networks are not sampled from the hyperbolic space - at least not from a regular enough density.

It seems safe to say that the Internet does not resemble a (homological) manifold.

It does not mean that the hyperbolic spaces cannot be used to embed networks - but they definitely are bad as a template for the underlying geometry.

On the other hand, the local homology is a brand new characteristic of a node - we should investigate those.

whither TDA?

Topological Data Analysis is a young discipline, with a lot of appeal and dynamic trajectory.

Does it have indeed a future as one of the essential components of modern data science?

  • TDA is has a fascinating intellectual appeal: many problems are at the interface of algebraic topology, manifold learning, image and shape analysis, stochastic geometry...
  • On the dark side there is a tendency to ever increasing degrees of generalizations.
  • From the viewpoint of applications, there is a tremendous demand for the baseline, topological base models.
  • The delivery is hindered however by often prohibitive costs of homological computations.
  • Still, a TDA-based startup, Ayadsi, got its first VC funding two years ago.
  • It is dynamic area: a thematic year on applied topology at IMA just finished (TDA was a major focus area); a journal is established; TDA has standing presence at SIAG conferences, SoCG, SODA and STOC.

  • give it a try!


the end