This repository contains Jupyter notebooks to analyse the results of the experimental evaluation of kit-algo/distributed_clustering_thrill.
If you want to explore our data on your own you can download it here.
This archive contains a bunch of json
files in which the output from the experiments (runtimes, quality scores, comparison scores) is stored in a database-like way.
Our notebooks assume that the content of this archive live in a folder next to this repository, but you can change the paths if you put it somewhere else.
To run the notebooks cd
into the notebooks
folder and run jupyter notebook
(or jupyter lab
).
See our existing notebooks on how to work with the data.
We also plan to make the actual raw data of our experiments available (input graphs, obtained clusterings, raw output of the programs), but so far we haven't found a good place to host this 1TB blob of data 🙈. You can of course contact us to obtain it directly.
The notebooks use pandas
for data analysis.
We used v0.20.3 but anything above and some below should do as well.
For plotting, matplotlib 2.0
and seaborn 0.8
were used.
Finally, you need jupyter
in version 1.0.0
or above.
The graph generation script makes use of Networkit. To be able to write our binary graphs you will need at least version 4.6.
Our massive weak scaling graphs were generate using a external memory LFR generator. If you obtained it you can recreate our graphs like so:
for x in 5 6 7 8 9
do
n=$[10**x]
echo $n
./pa_lfr -b 100Gi -s 1165388768 -n $n -c $[n/50] -i 50 -a 10000 -x 50 -y 12000 -z -1 -m 0.4 -o <graph_output_path>/graph_50_10000_mu_0.4_${n}-sorted.bin -p <griund_truth_output_path>/LFR/part_50_10000_mu_0.4_${n}-sorted.bin 2>&1 > <log_path>/graph_50_10000_mu_0.4_${n}-sorted-seq.log
done
Smaller graphs can also be generated with networkit.