The dataset is available in the dataset
folder and is encoded in the ArnetMiner v8 format.
To unzip
cat merged-dataset-v8-splitted.z* > merged-dataset-v8-splitted-complete.zip
unzip merged-dataset-v8-splitted-complete.zip
Install dependencies
pip install -r requirements.txt
Download the nltk data:
a) open a python shell
python
b) import nktk and download lemmer data
import nltk
nltk.download('wordnet')
nltk.download('punkt')
Install python3-dev:
sudo apt-get install python3-dev
-
Complete the
config.py
file with requested information -
Ingest the dataset executing
data_ingestion.main()
-
Run the notebook
jupyter notebook notebook