Skip to content

Hugohuuts/LoLaTokenization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Custom Tokenization and NLI Performance

By:

  • Yme Boland
  • Shane Siwpersad
  • Eduard Köntöş
  • Hugo Voorheijen
  • Julius Bijkerk

Project Overview

This repository is part of a group project for the Logic and Language course at Utrecht Univerity, taught by Dr. Lasha Abzianidze PhD. We will be exploring the impact of different tokenization techniques on Natural Language Inference (NLI), using pre-trained models like BERT, RoBERTa, and other (foundational) models.

Objectives

  • Investigate how different tokenization strategies affect NLI model performance.
  • Use the SNLI dataset (and possibly othe NLI datasets, like MNLI) for testing and comparison.

Repository Structure

Main folder: LoLaTokenization

Getting Started

Different Jupyter Notebooks can be downloaded.

Contributions

Work on assigned tasks in personal branches. Push updates regularly for review in weekly meetings.

Notes

This README will be updated as the project progresses to reflect more specific details and instructions.