Custom Tokenization and NLI Performance

By:

Yme Boland
Shane Siwpersad
Eduard Köntöş
Hugo Voorheijen
Julius Bijkerk

Project Overview

This repository is part of a group project for the Logic and Language course at Utrecht Univerity, taught by Dr. Lasha Abzianidze PhD. We will be exploring the impact of different tokenization techniques on Natural Language Inference (NLI), using pre-trained models like BERT, RoBERTa, and other (foundational) models.

Objectives

Investigate how different tokenization strategies affect NLI model performance.
Use the SNLI dataset (and possibly othe NLI datasets, like MNLI) for testing and comparison.

Repository Structure

Main folder: LoLaTokenization

Getting Started

Different Jupyter Notebooks can be downloaded.

Contributions

Work on assigned tasks in personal branches. Push updates regularly for review in weekly meetings.

Notes

This README will be updated as the project progresses to reflect more specific details and instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.ipynb_checkpoints		.ipynb_checkpoints
plots		plots
results		results
tokenization_methods		tokenization_methods
undertrained		undertrained
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
accuracy.py		accuracy.py
custom_models.py		custom_models.py
custom_tokenizer_abstract.py		custom_tokenizer_abstract.py
eval.ipynb		eval.ipynb
eval_accuracy.ipynb		eval_accuracy.ipynb
eval_accuracy_greedy_length.ipynb		eval_accuracy_greedy_length.ipynb
eval_all.sh		eval_all.sh
eval_scripts.py		eval_scripts.py
greedy_results_snli.json		greedy_results_snli.json
nli-MiniLM2-L6-H768_multinli_1.0_dev_matched.pickle		nli-MiniLM2-L6-H768_multinli_1.0_dev_matched.pickle
nli-MiniLM2-L6-H768_multinli_1.0_dev_mismatched.pickle		nli-MiniLM2-L6-H768_multinli_1.0_dev_mismatched.pickle
nli-MiniLM2-L6-H768_snli_1.0_test.pickle		nli-MiniLM2-L6-H768_snli_1.0_test.pickle
nli-MiniLM2-L6-H768_snli_1.0_test_contradiction_.png		nli-MiniLM2-L6-H768_snli_1.0_test_contradiction_.png
nli-MiniLM2-L6-H768_snli_1.0_test_entailment_.png		nli-MiniLM2-L6-H768_snli_1.0_test_entailment_.png
nli-MiniLM2-L6-H768_snli_1.0_test_neutral_.png		nli-MiniLM2-L6-H768_snli_1.0_test_neutral_.png
nli-distilroberta-base_multinli_1.0_dev_matched.pickle		nli-distilroberta-base_multinli_1.0_dev_matched.pickle
nli-distilroberta-base_multinli_1.0_dev_mismatched.pickle		nli-distilroberta-base_multinli_1.0_dev_mismatched.pickle
nli-distilroberta-base_snli_1.0_test.pickle		nli-distilroberta-base_snli_1.0_test.pickle
nli-distilroberta-base_snli_1.0_test_contradiction_.png		nli-distilroberta-base_snli_1.0_test_contradiction_.png
nli-distilroberta-base_snli_1.0_test_entailment_.png		nli-distilroberta-base_snli_1.0_test_entailment_.png
nli-distilroberta-base_snli_1.0_test_neutral_.png		nli-distilroberta-base_snli_1.0_test_neutral_.png
prediction_utilities.py		prediction_utilities.py
robertasnligreedy:length.png		robertasnligreedy:length.png
shap_importance.ipynb		shap_importance.ipynb
test.ipynb		test.ipynb
tokenization_verification.ipynb		tokenization_verification.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Custom Tokenization and NLI Performance

Project Overview

Objectives

Repository Structure

Getting Started

Contributions

Notes

About

Releases

Packages

Contributors 4

Languages

Hugohuuts/LoLaTokenization

Folders and files

Latest commit

History

Repository files navigation

Custom Tokenization and NLI Performance

Project Overview

Objectives

Repository Structure

Getting Started

Contributions

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages