IndoELECTRA: Pre-Trained Language Model for Indonesian Language Understanding
ELECTRA is a new method for self-supervised language representation learning. This repository contains the pre-trained Electra Base model (tensorflow 1.15.0) trained in a Large Indonesian corpus (~16GB of raw text | ~2B indonesian words).
According to the author's description:
Inspired by generative adversarial networks (GANs), ELECTRA trains the model to distinguish between “real” and “fake” input data. Instead of corrupting the input by replacing tokens with “[MASK]” as in BERT, our approach corrupts the input by replacing some input tokens with incorrect, but somewhat plausible, fakes. For example, in the below figure, the word “cooked” could be replaced with “ate”. While this makes a bit of sense, it doesn’t fit as well with the entire context. The pre-training task requires the model (i.e., the discriminator) to then determine which tokens from the original input have been replaced or kept the same.
- Python 3
- TensorFlow 1.15 (although we hope to support TensorFlow 2.0 at a future date)
- NumPy
- scikit-learn and SciPy (for computing some evaluation metrics).
All models are trained using same tokenizer as BERT, BERT Tokenizer. Vocabulary file are built using WordPiece Library
- For Tensorflow model can be downloaded here
- For PyTorch model can be downloaded here or directly use from Transformers library provided by huggingface
Please follow the root repository for training model.