Based on the Hadoop cluster built on the virtual machine, this project implemented the K-nearest-neighbors classifier algorithm under the MapReduce framework in Java and verified its correctness on two small-scale datasets. For a detailed description of this project, please read the following documentation:
K-Nearest Neighbors Classifier is a parameterless model. Its basic idea is: given the training data set Train and a test case
The KNN classifier can be formulated as follows:
This formula shows that given
Since the KNN classifier needs to calculate the distance between test cases and all training cases, the time complexity is relatively high, and the algorithm has poor scalability to big data. With the help of the Hadoop distributed file system and parallel computing framework MapReduce, we can accelerate the KNN classification algorithm.
The basic idea of the MapReduce KNN classifier is to distribute the training data to each server and calculate the distance between the training instance and the test instance at the same time. Since the distance between different training instances and the test instance is calculated independently of each other, it conforms to the characteristics of the MapReduce framework. Thus it can achieve good acceleration and is easy to understand.
Maven is used as the construction tool in this project, and the construction process is automated. The command to build this project is as follows (taking the Ubuntu operating system as an example):
$ sudo apt-get install mvn
$ cd hadoop-knn-classifier && mvn package
In addition to building on the command line, this project also supports building with mainstream IDEs (such as Eclipse, IDEA, etc.), just importing them into the IDE as Maven projects.
For how to run the classifier and experiment of this project, please refer to the following scripts:
data/iris/run-demo.sh
data/iris/run-exp.sh
data/iris/run-finetune.sh
data/iris/upload-data.sh
If you find our work useful in your research, please cite us as:
@misc{cong_hadoop-based_2021,
title = {A {Hadoop}-based {MapReduce} \$k\$-nearest-neighbors {Classifier}},
shorttitle = {hadoop-knn-classifier},
url = {https://github.com/cgsdfc/hadoop-knn-classifier.git},
abstract = {We implemented the K-nearest-neighbors classifier algorithm under the MapReduce framework in Java and verified its correctness on two small-scale datasets. The basic idea of the MapReduce KNN classifier is to distribute the training data to each server and calculate the distance between the training instance and the test instance at the same time. Since the distance between different training instances and the test instance is calculated independently of each other, it conforms to the characteristics of the MapReduce framework. Thus it can achieve good acceleration and is easy to understand.},
author = {Cong, Feng},
year = {2021},
}