This repository contains the code for the paper on application of the Cartesian quadtree for Barnes-Hut approximation for hyperbolic t-SNE. This is a fork of the original repository on the polar quadtree implementation distributed under MIT license and contains code unrelated to the Cartesian quadtree.
Perform the following steps:
- Install conda (we recommend using miniconda)
- Create environment:
conda create --name=htsne python=3.9.16
- Activate environment:
conda activate htsne
- Install dependencies with pip:
pip install -r requirements.txt
- Build Cython extensions:
python setup.py build_ext --inplace
- Install hyperbolic-tsne package:
pip install .
- To test installation run
python -c "from hyperbolicTSNE import HyperbolicTSNE"
. No errors should be raised and you should see the outputPlease note that 'empty_sequence' uses the KL divergence with Barnes-Hut approximation (angle=0.5) by default.
. - To experiments and pictures from the paper, run scripts from
experiments_and_plots
.
Note 1: On macOS, the build process of the Cython extensions might yield an error if it cannot find OpenMP. This error can be ignored and the package will still be correctly installed and able to run. The main consequence of this error is that the optimization iterations run slower.
In order to run either polar, cartesian polar_or_cartesian="polar"
should be set to "polar"
or "cartesian"
respectively.
Look at the examples in code.py
.
You can run hyperbolic TSNE on your high-dimensional data.
Nevertheless, the examples and experiments in this repository rely on specific datasets.
Below, we provide download links for each.
We recommend putting all datasets in a datasets
directory at the root of this repository.
The load_data
function expects this path (data_home
) to resolve the dataset.
Individual instructions per dataset:
There are two ways of getting started with the hyperbolicTSNE
package.
First, code.py
offers a step-by-step guide showing how to use the HyperbolicTSNE package to embed a high-dimensional dataset.
This folder contains three types of files:
- Scripts to generate experimental data via embedding different data sets into hyperbolic space. These are pre-fixed with "data generation".
- Scripts to create plots from the data, as they appear in the publication.
- Scripts to create tables from the data, as they appear in the publication.
The general workflow to reproduce the results from the paper is:
- Run the scripts to generate data.
- Run the scripts to plot the data.
- Run the scripts to generate tables.
Note that the data generation scripts assume a top-level folder, i.e., a folder next to "examples", "experiments", etc., called "datasets" that holds the datasets to be embedded.
The source code in this repository is released under the MIT License. However, all used third-party software libraries are governed by their respective licenses. Without the following libraries, this project would have been considerably harder: scipy, numpy, scikit-learn, hnswlib, pandas, anndata, seaborn, setuptools, Cython, tqdm, ipykernel.