IndoELECTRA

IndoELECTRA: Pre-Trained Language Model for Indonesian Language Understanding

Overview

ELECTRA is a new method for self-supervised language representation learning. This repository contains the pre-trained Electra Base model (tensorflow 1.15.0) trained in a Large Indonesian corpus (~16GB of raw text | ~2B indonesian words).

According to the author's description:

Inspired by generative adversarial networks (GANs), ELECTRA trains the model to distinguish between “real” and “fake” input data. Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT, our approach corrupts the input by replacing some input tokens with incorrect, but somewhat plausible, fakes. For example, in the below figure, the word “cooked” could be replaced with “ate”. While this makes a bit of sense, it doesn’t fit as well with the entire context. The pre-training task requires the model (i.e., the discriminator) to then determine which tokens from the original input have been replaced or kept the same.

Requirements

Python 3
TensorFlow 1.15 (although we hope to support TensorFlow 2.0 at a future date)
NumPy
scikit-learn and SciPy (for computing some evaluation metrics).

All models are trained using same tokenizer as BERT, BERT Tokenizer. Vocabulary file are built using WordPiece Library

IndoELECTRA Pre-Trained Models

For Tensorflow model can be downloaded here
For PyTorch model can be downloaded here or directly use from Transformers library provided by huggingface

Training

Please follow the root repository for training model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

IndoELECTRA

Overview

Requirements

IndoELECTRA Pre-Trained Models

Training

Files

README.md

Latest commit

History

README.md

File metadata and controls

IndoELECTRA

Overview

Requirements

IndoELECTRA Pre-Trained Models

Training