The NER tool is designed to identify and classify named entities in text into predefined categories such as Person (PER), Location (LOC), Organization (ORG), and Date (DATE). This project implements a deep learning-based Named Entity Recognition (NER) tool for the Turkish language. The system allows users to perform NER tasks using three different models:
- BiLSTM
- BiGRU
- Fine-Tuned BERT
The system consists of three main components:
- Trained Models: Models trained on a specially prepared dataset.
- Server: Manages communication between the models and the user interface.
- User Interface: A web-based interface developed using Flutter.
The dataset used for training consists of Turkish Wikipedia articles (link to dataset), which have been re-labeled, cleaned, and augmented. The final dataset comprises columns for words, their corresponding labels, and sentence numbers. The labels include categories such as PERSON, LOC, ORG, and DATE, using the IOB2 format.
- Initial Dataset: The initial dataset included various entity types beyond PERSON, LOC, ORG, and DATE, such as events, artworks, currencies, and races. These were re-labeled to 'O' (Other) due to their infrequent occurrence.
- Sentence Numbering: Sentences were numbered to facilitate model input.
- Label Adjustment: Entities not relevant to PERSON, LOC, ORG, and DATE were re-labeled as 'O'.
- Labeling Format: Applying IOB2 tagging for entity representation.
- Manual Corrections: Incorrect labels and misspellings were manually corrected.
- Padding and Indexing: Sentences were padded to ensure uniform input length, and words were indexed using pre-trained embeddings or vocabularies derived from the dataset.
The final dataset consists of 26,403 words and 2,067 sentences, with entities labeled as B- (beginning) and I- (inside).
- Total Words: 26,403
- Total Sentences: 2,067
- Training Set: 20,959 words, 1,653 sentences
- Test Set: 5,444 words, 414 sentences
Entities | Train | Test | Total |
---|---|---|---|
B-PER | 1243 | 249 | 1492 |
I-PER | 792 | 166 | 954 |
B-LOC | 1137 | 350 | 1487 |
I-LOC | 309 | 127 | 436 |
B-ORG | 566 | 210 | 776 |
I-ORG | 573 | 224 | 797 |
B-DATE | 606 | 191 | 797 |
I-DATE | 462 | 166 | 628 |
O | 15271 | 3765 | 19036 |
- Word: "Türkiye"
- Label: "B-LOC"
-
Tokenization and Padding: Sentences were tokenized striped from punctiations and padded to ensure uniform input lengths for the models.
-
Embedding Indexing: Words were indexed using FastText embeddings for BiLSTM and a custom vocabulary for BiGRU.
Three models were developed and trained:
-
BiLSTM Model:
-
Embedding: Used FastText pre-trained word vectors. You can find it in this link.
-
Dimensionality Reduction: Applied PCA to reduce dimensionality from 300 to 100 dimensions.
-
-
BiGRU Model:
-
BERT Model:
The web application was developed using Flutter and comprises the following components:
- User Interface: Developed with Flutter using bloc-cubit state management with dependency injection, featuring input fields for text and model selection, and displaying the results.
- Server: Flask server handles requests and communicates with the models.
- API Communication: Uses Retrofit and Dio for network requests.
- Python Libraries: TensorFlow, Keras, Pandas, NumPy, Flask, pymongo, transformers, nltk
- Flutter Packages: Retrofit, Dio, Bloc-Cubit, Get_it
- MongoDB : For Cloud Database
- FastText: For word embeddings in the BiLSTM model
- BERT: Fine-tuned "dbmdz/bert-base-turkish-cased" model
The performance of each model was evaluated using the F1 metric:
-
BiLSTM:
-
BiGRU:
-
BERT:
This project was developed by Aslıhan Yoldaş and Zeynep Demirtaş.