The goal of this experiment is to evaluate how well a large language model (LLM) can learn the conversion between English and Blissymbolics sentence structures. To leverage the LLM's knowledge of English, Blissymbolics sentences are composed using English words while adhering to the grammatical and syntactical rules of Blissymbolics. For instance, the English sentence "I slowly move towards the blue lake" would be expressed in Blissymbolics as "present: I move slowly towards lake blue". Without delving into the linguistic intricacies of Blissymbolics, it is essential to note that the language follows a specific ordering and structure to indicate verb tenses, as well as the relationships between verbs and adverbs, and nouns and adjectives.
The experiment uses the 7B parameter Llama2 model pretrained by Meta, converted for the seamless use of the Hugging Face Transformers format. This model is choosen as a starting point because it requires less training time and GPU resources compared to its larger counterparts, while it potentially sacrifies some capability. Additionally, the Hugging Face Transformers format is selected because of its extensive community support and standardized APIs.
This experiment is performed using Cedar clusters provided by Digital Research Alliance of Canada. See its technical documentation regarding the content of job scripts and job submission steps described below.
- Request access to Llama2 models on the Meta website;
- Followed the instructions on the Hugging Face website to request the access to its Llama2 model;
- Request a hugging face access token on this page;
- Login to the Cedar cluster;
- Create a "llama" directory and run these commands to download the model:
mkdir llama2
cd llama2
# Load git-lfs first for downloading via Git large file storage
module load StdEnv/2020
module load git-lfs/3.3.0
git lfs install
git clone https://{hugging_face_id}:{hugging_face_access_token}@huggingface.co/meta-llama/Llama-2-7b-hf
// Fetch git large files in the repo directory
cd Llama-2-7b-hf
git lfs fetch
- Copy the content of
requirements.txt
for setting up the Llama2 models into a new file namedrequirements-llama2.txt
in the "llama" directory.
In the jobs/Llama2/original_use
directory, there are two scripts:
- original_use_7b_hf.py: The script that loads the downloaded model and tokenizer to perform text generation, word predictions and making inferences
- job_original_use_7b_hf.sh: The job script submitted to Cedar to run
original_use_7b_hf.py
Note that the job script must be copied to the user's scratch
directory and is submitted from there using
the sbatch
command.
Use FTP to transfer the above scripts to the cedar cluster in the users llama2/original_use
directory. Run
the following command to submit the job.
cp llama2/original_use/job_original_use_7b_hf.sh scratch/.
cd scratch
sbatch job_original_use_7b_hf.sh
The result is written to the llama2/original_use/result.txt
.
In the jobs/Llama2/finetune
directory, there are these scripts:
- bliss.json: The dataset that converts English text to the structure in the Conceptual Bliss
- finetune_7b_hf.py: The script that fine-tunes the downloaded model
- job_finetune_7b_hf.sh: The job script submitted to Cedar to run
finetune_7b_hf.py
Use FTP to transfer the above scripts to the cedar cluster in the users llama2/finetune
directory. Run
the following command to submit the job.
cp llama2/finetune/job_finetune_7b_hf.sh scratch/.
cd scratch
sbatch job_finetune_7b_hf.sh
The fine-tuning script:
- Creates an instruction dataset using
bliss.json
. This dataset contains bi-directional conversion between English and Conceptual Bliss. - Uses the dataset to fine-tune the Llama2 model. See
finetune_7b_hf.py
about the fine-tuning parameters. - Evaluates the fine-tuned model by testing a few sentence conversions between the English and the Bliss languages.
Please note that due to the relatively small size of the dataset derived from bliss.json, the fine-tuning script was run four times, adjusting the epoch number in the script from 1 to 4. As a result, 4 models were generated corresponding to the different epoch counts.
This section describes how to evaluate a fine-tuned model with instructions and input sentences.
In the jobs/Llama2/finetune
directory, there are these scripts:
- eval_7b_hf.py: The script that fine-tunes the downloaded model. Common variables to adjust:
model_dir
: The location of the model directoryinstruction
: At the bottom of the script, define the instruction part in a promptinput
: At the bottom of the script, define the sentence to be converted
- job_eval_7b_hf.sh: The job script submitted to Cedar to run
eval_7b_hf.py
Use FTP to transfer the above scripts to the cedar cluster in the users llama2/finetune
directory. Run
the following command to submit the job.
cp llama2/finetune/job_eval_7b_hf.sh scratch/.
cd scratch
sbatch job_eval_7b_hf.sh
This section describes how to evaluate the generated sentences and compare them with original or expected sentences. It evaluates the generated sentence in these aspects:
- Semantic Coherence
- Novelty and Creativity
- Fluency and Readability
In the jobs/Llama2/finetune
directory, there are these scripts:
- eval_generated_sentence.py: The script that fine-tunes the downloaded model. Common variables to adjust:
sentence_orig
: The original sentencesentence_expected
: The expected sentencesentence_generated
: The sentence generated by the fine-tuned model
- job_eval_generated_sentence.sh: The job script submitted to Cedar to run
eval_generated_sentence.py
Use FTP to transfer the above scripts to the cedar cluster in the users llama2/finetune
directory. Run
the following command to submit the job.
cp llama2/finetune/job_eval_generated_sentence.sh scratch/.
cd scratch
sbatch job_eval_generated_sentence.sh
-
Diversified Dataset Expansion: Currently, the
bliss.json
dataset consists of 967 pairs of conversions between English and Bliss, focusing on specific ordering and sentence structures. To enhance the model's versatility, a key improvement is to enrich the dataset with a wider variety of sentence types and structures. -
Comprehensive Model Evaluation: The evaluation of the fine-tuned model is not comprehensive. While individual converted sentences are assessed, there's a need for a more thorough evaluation method. This includes comparing the expected and actual converted results using a percentage of the dataset, and assessing for underfitting or overfitting. Considering the fine-tuning runs from 1 to 4 epochs on a small dataset, overfitting risks may increase with more epochs, which reqires a robust evaluation process.
-
Understanding Bliss Language: The current fine-tuned model effectively responds to two fixed instructions, converting between English and Bliss. However, it lacks a deep understanding of the Bliss language itself. The next step involves fine-tuning a model that comprehends broader queries in Bliss, going beyond instructional conversion tasks. Tests show that while the model performs well in converting Bliss to English, likely because of its extensive knowledge of English. However, its performance in the reverse direction is not ideal. This difference suggests a need for additional fine-tuning, potentially by enhancing the model's understanding of the unique linguistic features of Bliss.
Although the fine-tuning uses a fairly small dataset, the fine-tuned model performs pretty well in converting English and Conceptual Bliss sentence structure, especially with the two-epochs and three-epochs models.