This repository contains the code underlying our term paper Meta-Science & Evaluation: Trend Prediction.
The volume of scientific work and related publications has increased sharply in the last two decades. Floods of text can make it even more difficult for academia and industry to evaluate the quality and importance of individual papers and participating in this field of research. The ever-advancing power of NLP can help us to process these texts more efficiently than ever before. With this work, we contribute to the ongoing research of metascience and scientometric analysis. In a first step, we derive text embeddings to create unsupervised topic clusters of recent publications and use their citations counts to train a DNN that afterward will forecast relevant topic spaces for new or future research.
Create new virtual environment:
$ python -m venv venv
Activate environment:
$ source venv/bin/activate
Install required python packages:
$ pip install -r requirements.txt
To view Jupyter Notebooks (.ipynb
), run:
$ jupyter-notebook
Download from https://hessenbox.tu-darmstadt.de/getlink/fiB2mrQTRZTrjWcmCVySL58H/.
We provide a full data set with all results for papers of the ACL anthology from 1990 until 2020.
- Extract data.zip into the project folder. It includes the complete ACL anthology as well as the filtered anthology according to used years and conferences with all additional information as for example links to files and topics (anthology_conferences.csv).
- Extract pdfs.zip into the new data/ folder to add the full paper pdfs.
- Extract json.zip into the data/ folder to add the papers parsed with science-parse structured in JSON format.
- Extract embeddings.zip into the data/ folder to add all tested embeddings created with sentenceBERT and different pretrained models.
- Extract semantic_scholar.zip into the data/ folder to add information about papers and authors fetched from Semantic Scholar.
- Extract clusters.zip into the data/ folder to add the intermediate and final clustering results. Our best and final clustering is saved into final_best_onde_clustering.json.
The basis of the data is ACL Anthology. We further use additional sources to add information e.g. about topics of papers to this basis. Run the following scripts:
parse_data.ipynb
Downloads and filters anthology, downloads paper's pdf, and adds abstracts from parsed pdfs (see next point).parse_pdf.sh
Parses paper's pdfs with science-parse by allenai to get the abstracts.parse_semanticscholar.ipynb
Downloads Semantic Scholar information about paper and authors, and adds topics to anthology entries.parse_cso_classifier.ipynb
Add topics to each anthology entry based on the abstract using the python library of the CSO classifier.
embeddings.ipynb
Creates embeddings of titles and abstracts of the papers with SentenceBERT used for clustering.
We use clustering based on embeddings to group papers that share topics. The following steps describe our approach of finding the most appropriate algorithm. Finally, we use K-Means clustering with 20 clusters. The embeddings base on the pretrained model paraphrase-distilroberta-base-v2
with titles as input.
clustering.ipynb
Runs an extensive search on different clustering algorithms (seeclustering_algorithms.py
), runs an extensive search on filtered algorithm/configurations inclustering_evaluation.ipynb
, and runs the final best clustering.clustering_evaluation.ipynb
Manually filters best algorithm/configuration pairst after first and second extensive search using evaluation metrics defined inclustering_metrics.py
.cluster_presentation.ipynb
Here you can search for the most matching clusters of keywords/topics and create plots for the clusters regarding the development of citations and papers in the past and the predicted future by the DNN. Figures are stored in the folder figures/.
The model is able to predict the citation count of given papers for the next five years, taking
- the embedding (SentenceBERT,
paraphrase-distilroberta-base-v2
) - the age of the paper since publication
- the accumulated h-indices of all authors and
- the numer of authors.
To create the model,
data/semantic_scholar/papers/
must exist and contain one JSON file for each paper. Also, to assign the predictions per paper to the cluster,
data/clusters/final_best_one_clustering.json
has to contain a mapping from cluster index to paper index.
Run:
$ python predict_citations_next_few_years.py
This program writes intermediate results to the cache/
folder and the resulting model (keras.callbacks.ModelCheckpoint
) to model/best_model.hdf5
, alongside a figure showing the development of training and development loss values during the training process.