Topological Data Analysis (TDA) serves as the core toolkit utilized by mathematicians in the field of Data Science.
TDA is primarily driven by the belief that topology and geometry offer a robust approach to extracting qualitative and sometimes quantitative information about the underlying structure of data. The objective of TDA is to establish mathematically rigorous, statistical, and algorithmic methods for inferring, analyzing, and leveraging the intricate topological and geometric structures present in data, often represented as point clouds (each dot of the pot) in Euclidean or more general metric spaces.
![]() |
![]() |
---|---|
The Torus shape given as a point cloud | A Pot in the Euclidean (3D) Space |
A metric space (M, ρ) is a set M with a function
- ρ(x, y) ≥ 0 and ρ(x, y) = 0 if and only if x = y,
- ρ(x, y) = ρ(y, x) and,
- ρ(x, z) ≤ ρ(x, y) + ρ(y, z).
#1
The input data is considered to be a finite collection of points, and there exists a
measure of distance or similarity between these points. This distance can be determined by the metric of the surrounding space, such as the Euclidean metric if the data is embedded in
The specification of the metric for the data is typically provided as an input or guided by the specific application. However, it is crucial to acknowledge that the choice of metric plays a significant role in uncovering noteworthy topological and geometric characteristics of the data
#2a
A continuous shape is constructed to emphasize the underlying topology or geometry of the data
| ![Pot](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/0f1d5b70-8583-4b3f-932b-115ebc509e54){width=40%} | |:---:| | *Topology: This looks like a pot!* |#2b
This shape is typically a simplicial complex or a sequence of nested simplicial complexes known as a filtration, which captures the data's structure at various scales.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/b12b29db-8b84-4332-8183-e5d50c09325b){width=40%} | |:---:| | *simplicial complex* | Image:wikipedia#2c
This shape is typically a simplicial complex or a sequence of nested simplicial complexes known as a filtration, which captures the data's structure at various scales.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/0f793745-5849-4a5b-a2eb-8cbabf97fefc) | |:---:| | Filtration | Image: Cavanna, N.J., Jahanseir, M. and Sheehy, D.R., 2015. A geometric perspective on sparse filtrations. arXiv preprint arXiv:1506.03797.#2d
Simplicial complexes can be seen as higher-dimensional extensions of neighborhood graphs commonly used in standard data analysis and learning algorithms.
The task at hand is to define these structures in a way that is mathematically proven to reflect meaningful information about the data's structure and can be efficiently constructed and manipulated in practical applications.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/7e204b00-cb60-48d7-a14b-79115b6a90f9){wdith=20%} | |:---:| Image: Meng, Z. and Xia, K., 2021. Persistent spectral–based machine learning (PerSpect ML) for protein-ligand binding affinity prediction. Science advances, 7(19), p.eabc5329.#3
The structures constructed on the data provide a means to extract topological or geometric information.It may involve generating rough summaries or approximations that require specific methods, such as persistent homology, to extract relevant information.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/0f793745-5849-4a5b-a2eb-8cbabf97fefc) | |:---:| | *What do you see in this filtration?* |#4
The topological and geometric information extracted offers novel sets of features and descriptors for the data. These features can enhance the understanding of the data, especially through visualization techniques. They can also be combined with other types of features to enable more comprehensive analysis and facilitate machine learning tasks. Furthermore, this information can be utilized to design data analysis and machine learning models that are well-suited to the specific characteristics of the data. An essential consideration at this stage is demonstrating the added value and complementarity of the information provided by TDA tools compared to other features.
Understanding how the extracted information complements existing features and showcases its advantages is a key aspect of this process.
#Simplicial complexes - 1
Simplicial complexes can be regarded as an extension of graphs to higher dimensions. They are mathematical entities that possess both topological and combinatorial characteristics, which makes them particularly valuable for Topological Data Analysis.
Two important aspects of these mathematical objects:
- Topological characteristics: Simplicial complexes have a topological nature, meaning they capture information about the connectivity and spatial relationships between their constituent simplices (e.g., vertices, edges, higher-dimensional faces). They provide a framework to study topological properties such as connectedness, boundaries, holes, and higher-dimensional structures.
- Combinatorial characteristics: Simplicial complexes have combinatorial properties, which deal with the combinatorial arrangements and relationships between their simplices. These properties involve how the simplices are combined, their intersections, and the inclusion relations between different simplices.
#Simplicial Complexes - 2
Given a set
Affinely independent points refers to a set of points in a vector space that do not lie on the same affine subspace, except when the set consists of a single point. In other words, a set of points
In layman's terms can be understood as a set of points that are not in a straight line or a flat plane.
#Simplicial Complexes - 3
Given a set
The convex hull of a set of points is the smallest convex shape or region that completely encloses all the points. In simpler terms, it is like finding the "outer boundary" that wraps around a given set of points.
Mathematically, an n-dimensional simplex is defined as the convex hull of (n + 1) affinely independent points in n-dimensional space. In other words, it is the smallest convex shape or region that contains these points. For example, a 1-dimensional simplex is a line segment, a 2-dimensional simplex is a triangle, a 3-dimensional simplex is a tetrahedron, and so on.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/6515b9b0-26f1-4cdf-a0fe-a372f11dffa2){width=60%} | |:---:| Image: Robert Laurini#Simplicial Complexes - 4
Given a set
Mathematically, an n-dimensional simplex is defined as the convex hull of (n + 1) affinely independent points in n-dimensional space. In other words, it is the smallest convex shape or region that contains these points. For example, a 1-dimensional simplex is a line segment, a 2-dimensional simplex is a triangle, a 3-dimensional simplex is a tetrahedron, and so on.
| ![image](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/2f6433c5-29e0-4fee-9681-b127e14b3d92){width=60%} | |:---:| Image: https://velog.io/@shlee0125#Simplicial Complexes - 5
A geometric simplicial complex K in
- Every face of a simplex in K is also a simplex in K.
- The intersection of any two simplices in K is either empty or a common face shared by both simplices.
The face of a simplex refers to a lower-dimensional simplex that is a subset of the original simplex. In other words, it is a subset of the vertices of the simplex that defines a smaller simplex within it.
#Simplicial Complexes - 6
For example, in a 3-dimensional simplex (tetrahedron), the faces are triangles, and in a 4-dimensional simplex, the faces are tetrahedra.
The faces of a simplex are simplices themselves, but of lower dimension.
| ![](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/7f9a35de-dd0c-4805-b3f1-c989035f8d15){width=40%} |#Simplicial Complexes - 7
The collection of simplices in a simplicial complex K forms a subset of
- When we say that a simplicial complex K can be regarded as a topological space based on its underlying space , we mean that we can associate a topological structure to K by considering the topology inherited from its underlying space.
- The underlying space of K is a subset of Rd that contains all the points corresponding to the vertices, edges, faces, and higher-dimensional simplices of K.
#Building simplicial complexes from data
There exist many ways to build simplicial complexes. We present here a few classical examples that are widely used in practice. An initial example is an immediate extension of the concept of an a-neighboring graph.
Suppose we have a set of points, X, in a metric space (M, p), and a real number α ≥ 0. The Vietoris-Rips complex, denoted as
Each circle (called a closed ball) has a data point at its center, and the circle radius is the distance from the data point
![image](https://private-user-images.githubusercontent.com/88677758/247305333-7b961b85-311a-459c-ab7a-f8de1f7c22fa.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQ1ODMsIm5iZiI6MTczOTE3NDI4MywicGF0aCI6Ii84ODY3Nzc1OC8yNDczMDUzMzMtN2I5NjFiODUtMzExYS00NTljLWFiN2EtZjhkZTFmN2MyMmZhLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA3NTgwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTIyZWM4OWU1MTBjYWFlZTc5OTNhYzA1Njk0ZjEzZTI5Y2MxNTdkMjA5MDc1NmY4NjY1ZDdlNGFiNDExNmI5OTUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.iJRrzzwJKt4ZPDNwNzd1ymm5cBfWewugWDbaTgn1LCI)
However, it's important to note that in general, even when X is a finite subset of $R^d$, $Rips_a(X)$ may not have a geometric realization in $R^d$. In other words, it may have a dimension higher than d.Can you figure out why?
Closely connected to the Vietoris-Rips complex is the Čech complex, denoted as
#Main Directions in TDA - simplified classification
Persistent Homology (PH): study of topological shapes that can fit into your data
| ![](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/e4d9bb48-f3cc-4db7-ae83-368a6e4e8d71){width=50%} | |:---:|Image: De Lara, M.L.D., 2023. Persistent homology classification algorithm. PeerJ Computer Science, 9, p.e1195.
TDAMapper: the overall shape of your data
- Both have filtrations and distances, but Mapper studies about how your entire data is shaped and deduces insights from that shape.
#What is Homology?
Homology, in a nutshell, is a fundamental concept in algebraic topology that provides a powerful tool for formalizing and analyzing topological features of spaces or simplicial complexes in an algebraic manner. It enables us to capture and quantify the presence of different-dimensional "holes" within a given space.
In homology, each dimension k corresponds to a vector space
Homology provides a rigorous framework for understanding and characterizing topological structures by associating vector spaces to each dimension, giving us valuable information about the presence and properties of various topological features within a space or simplicial complex.
In simple terms, we are defining topological structures (i.e., k-dimensional holes) and counting them in a simplicial complex (SC).
What are these holes?
- The term "dimensional holes" refers to the topological features captured by the homology groups of a simplicial complex.
- These holes represent voids or empty spaces of different dimensions within the data.
- In the context of TDA, the term "hole" does not refer to a literal physical hole but rather to a topological concept.
- A dimensional hole can be understood as a region or subspace of the data that is missing or empty, characterized by the absence of certain simplices or simplicial structures.
O-dimensional holes are connected components in a SC:
-
$K_1$ has 12 0-dimensional holes -
$K_2$ has larger balls, but there are still 12 holes. -
$K_3$ now has edges between simplices. There are 6 0-dimensional holes. -
$K_4$ has just one connected component; 1 0-dimensional hole.
1-dimensional holes are loops in a SC:
-
$K_1$ does not have a 1-dimensional hole -
$K_2$ has larger balls, but no 1-dimensional holes. -
$K_3$ now has edges between simplices, but no 1-dimensional holes. -
$K_4$ has a loop now; it has a 1-dimensional hole.
#Homology
To extract summaries of such topological features at a mesoscopic level, we use Betti numbers. Betti-p number of a simplicial complex C of dimension d, denoted by
![image](https://private-user-images.githubusercontent.com/88677758/247307957-8891bcb0-ae6f-47b0-879a-662c20fcafec.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQ1ODMsIm5iZiI6MTczOTE3NDI4MywicGF0aCI6Ii84ODY3Nzc1OC8yNDczMDc5NTctODg5MWJjYjAtYWU2Zi00N2IwLTg3OWEtNjYyYzIwZmNhZmVjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA3NTgwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWU3NGUyY2IzMDFkOTBhZjE5YTJjYTZjZGNjMDc0MGJiYTk0YzViMDhkODExYjNmNjQ4ZmI1MWVmZGE5OGQyN2EmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.EaWVHGfdGTTXaq_JslCkW7fghciJf3I7Yk19QcTPQ50)
#Persistent Homology
We can increase k as much as we want, however we do not see k>2 holes often in our experiments.
Formal definitions for the holes can be found in works such as
- Edelsbrunner, H. and Harer, J., 2008. Persistent homology-a survey. Contemporary mathematics, 453(26), pp.257-282.
- Otter, N., Porter, M.A., Tillmann, U., Grindrod, P. and Harrington, H.A., 2017. A roadmap for the computation of persistent homology. EPJ Data Science, 6, pp.1-38.
#What is Persistent in PH?
Persistent homology in a nutshell:
- Given X data points in
$R^d$ - Find pairwise distances
-
Use a filtration threshold on distances from 0 to maximum
a. Create simplices and a simplicial complex out of them,
b. Identify certain topological features - what are they? We need to talk about shapes to answer this question.
c. Identify which holes persist (i.e., born and survive) - Use the vectorization in ML tasks
#What is filtration?
A filtration of a simplicial complex K is a collection of subcomplexes $(K_r){t\in T}$ that are nested, meaning that for any $r_0$ and rp in the set T, if r is less than or equal to $r_0$, then $K{r0}$ is a subset of Kro.
The entire complex K is obtained by taking the union of all the subcomplexes
#Persistence Diagrams
In practical situations, the parameter
We alternate between scale and distance to refer to r. We mostly use the term distance and use Є or α to refer to the distance threshold in our articles.
Filtration tracks the appearance of shapes along the distance threshold α.
The information gathered through filtration is stored in persistence diagrams (PDs), which consist of a multiset of points supported on the open half-plane
#PD for 1-dimension holes
| ![](https://github.com/SelimC06/TDA-on-Graphs/assets/88677758/76649d35-389d-413b-9608-c0276e6f39ca){width=60%} | |:---:|Krishnapriyan, A.S., Montoya, J., Haranczyk, M., Hummelshøj, J. and Morozov, D., 2021. Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks. Scientific reports, 11(1), p.8888.
#Vectorizations for 0 and 1-dimensional holes
![image](https://private-user-images.githubusercontent.com/88677758/247310452-98675fae-b2c1-40f3-9176-35e565de0ab9.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxNzQ1ODMsIm5iZiI6MTczOTE3NDI4MywicGF0aCI6Ii84ODY3Nzc1OC8yNDczMTA0NTItOTg2NzVmYWUtYjJjMS00MGYzLTkxNzYtMzVlNTY1ZGUwYWI5LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTAlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEwVDA3NTgwM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWJkNmNkNWJkOTY0MjVhYTlkODk5OTk5Y2Y5ZWJmYzEwZGI0YjE5MWE1YTQyN2VjYmQ2NThlNDU3MDg2Y2I4OTcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.0ge7xg1s0KyoAfUbNEQ3Ly-adbVWmALfNGk9QQBvz_I)
Next Part