Fortran has been a widely used programming language for scientific computation since 1957. With technological advancements, modern languages like C++ have become preferable for some projects due to their greater flexibility and features. However, the lack of an accurate and comprehensive Fortran-to-C++ translation dataset means that existing large models, including GPT-4, often struggle to perform this task effectively, resulting in translations that may fail to compile or pass unit tests. Fortran2Cpp aims to address this issue.
This work builts on our previous work:
We fine-tuned several popular pre-trained models, including
- WizardCoder-15B-V1.0,
- CodeLlama-13b-Instruct-hf,
- starcoder,
- starcoder2,
- Magicoder-S-DS-6.7B, and
- deepseek-coder-33b-instruct.
After the fine-tuning, the deepseek-coder-33b-instruct shows the greatest improvement when checking with the CodeBLEU Score. Thus we finally use the deepseek-coder-33b-instruct as the backbone of Fortran2CPP.
The Model is available on Hugging Face: Fortran2Cpp
NOTE: Currently, the model is trained by using a dataset with paired Fortran (f90) and C++ code. We are still training the model. We will continue to update the Fortran2Cpp.
We compared with various models (WizardCoder-15B-V1.0, CodeLlama-13b-Instruct-hf, starcoder, Magicoder-S-DS-6.7B, deepseek-coder-33b-instruct and GPT-4) on HPC_Fortran_CPP. And compared the CodeBLEU Score of the generated results.
The CodeBLEU Score Comparison is shown in the figure below:
We recommend using virtual environment to set up the python environments and install required packages:
python3.9 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -r requirements.txt
- Enter into Evaluation folder
cd Evaluation
- To generate the results. Go the script
text_generation_pipline.py
.
You can modify things like
- the model that you want to test: defined between line 9 and line 14.
- the file path where you want to store your results: defined in line 59, default is log.txt .
Run:
export HUGGINGFACE_TOKEN="your_access_token_here"
python text_generation_pipline.py
# sample output of the text generation pipeline:
input_prompt:
Translate this Fortran code to C++:
program DRB096_doall2_taskloop_collapse_orig_no\n use omp_lib\n use DRB096\n implicit none\n\n integer :: len, i, j\n len = 100\n\n allocate (a(len,len))\n\n !$omp parallel\n !$omp single\n !$omp taskloop collapse(2)\n do i = 1, len\n do j = 1, len\n a(i,j) = a(i,j)+1\n end do\n end do\n !$omp end taskloop\n !$omp end single\n !$omp end parallel\n\n print 100, a(50,50)\n 100 format ('a(50,50) =',i3)\n\nend program
#1 Fortran Code has been translated.
Translated C++ Code:
DRB096_doall2_taskloop_collapse_orig_no
#include <omp.h>
#include <iostream>
int main() {
int len = 100;
double **a = new double*[len];
for (int i = 0; i < len; ++i) {
a[i] = new double[len];
}
#pragma omp parallel
{
#pragma omp single
{
#pragma omp taskloop collapse(2)
for (int i = 1; i <= len; ++i) {
for (int j = 1; j <= len; ++j) {
a[i-1][j-1] += 1;
}
}
}
}
std::cout << "a(50,50) = " << a[49][49] << std::endl;
for (int i = 0; i < len; ++i) {
delete[] a[i];
}
delete[] a;
return 0;
}
This C++ code does the same thing as the Fortran code. It allocates a 2D array `a`, then uses OpenMP to parallelize the task of incrementing each element of `a`. The `collapse(2)` clause is used to collapse the two loops into a single loop, which is then parallelized by OpenMP. The result is printed to the console.
This will generate the results and compress each result to one line for the further CodeBLEU Score test.
This script does the following
- model: Bin12345/F2C-Translator // configurable to use any other models
- dataset: Bin12345/HPC_Fortran_CPP
- Translate Fortran Code in the dataset to C++ code
- Write to a log
- Test CodeBLEU Score by using the following command
cd CodeBLEU
python calc_code_bleu.py --refs Fortran2Cpp/Evaluation/Groundtruth_C++.txt --hyp <path/to/your/results/txt/file> --lang cpp --params 0.25,0.25,0.25,0.25
- Run inference on Slurm Cluster: Your should use this script to start the inference:
sbatch <The/following/script>
#!/bin/bash
#SBATCH -N 1
#SBATCH -C gpu&hbm80g
#SBATCH -G 4
#SBATCH -q regular
#SBATCH -J model_training
#SBATCH --mail-user=<Your/Email>
#SBATCH --mail-type=ALL
#SBATCH -t 00:30:00
#SBATCH -A <Your/Project>
# Load conda
echo "loading conda..."
module load conda
conda activate <Your/Conda/env>
# Huggingface Setting
echo "Setting Huggingface..."
export HF_HOME=$SCRATCH/huggingface
export HF_TOKEN=<Your/HF/ToKen>
# OpenMP settings:
echo "Setting OMP..."
export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread
# Set CFLAGS and LDFLAGS and CUTLASS
export CFLAGS="-I/<Your/Conda/env>/include $CFLAGS"
export LDFLAGS="-L/<Your/Conda/env>/lib $LDFLAGS"
export CUTLASS_PATH=$HOME/cutlass
# run the application:
echo "Start to run the inference..."
chmod +x <Your/inference/file/path>
srun -n 1 -c 8 --cpu_bind=cores -G 4 --gpu-bind=none <Your/inference/file/path> > <Your/log/file/path> 2>&1
- Setup your OpenAI Key.
cd dataset_generation
export OPENAI_API_KEY="sk...."
Modify engine_F2C.py to customize the input dataset range, teacher model ID, output folder, etc. Default values are the following:
if __name__ == "__main__":
key = get_openai_api_key() # Obtain your OpenAI Key from an environment variable named OPENAI_API_KEY
# You can also use other datasets
Fortran_dataset = load_dataset("codeparrot/github-code", "FORTRAN-all")
data = Fortran_dataset["train"][78000:85000]
output_file = "" # Output Json file
generate_data(key, data, output_file, gpt_model="gpt-4o") # Need to change
- Start the dataset generation
python engine_F2C.py
A sample log file and json file are stored in
- dataset_generation/example_results
The dataset that we used is included in F2C-Translator/data/F2C_dialogue_2.5K.json
file.
The demo code is modified from OpenCodeInterpreter. Appreciate for their great project!
- Create conda and install packages
cd Web_demo
conda create -n demo python=3.10
conda activate demo
pip install -r requirements.txt
- Start the demo
python chatbot.py
- Install packages
deepspeed==0.12.3
pyarrow==13.0.0
torch==2.0.1
numpy==1.26.4
- Collect the data If you only want to use the code pairs. Put them into a two column csv file should be fine.
If you would like to fine-tune the multi-dialogue dataset. handle data like this:
<User1><Assistant1><User2><Assistant2><User3><Assistant3>
->
1. <User1> <Assistant1>
2. <User1><Assistant1><User2> <Assistant2>
3. <User1><Assistant1><User2><Assistant2><User3> <Assistant3>
Put the first part of message to the first colunm of csv file and the last part(<Assistant1,2,3>
) to the second column.
You can achieve that by runing this command python training/utils/data/splitting_multiturns_dialogue.py
Input file is data/F2C_dialogue_2.5K.json
The converted json file is shown in data/F2C_dialogue_2.5K_test.json
For example, input Json
[
{
"id": "conv1",
"messages": [
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello!"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "I'm good, thank you."}
]
}
]
will be split into the following output json
[
{
"id": "conv1",
"messages": [
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello!"}
]
},
{
"id": "conv1",
"messages": [
{"role": "user", "content": "Hi"},
{"role": "assistant", "content": "Hello!"},
{"role": "user", "content": "How are you?"},
{"role": "assistant", "content": "I'm good, thank you."}
]
}
]
- Start traning by running
export RANK=0
export WORLD_SIZE=1
export MASTER_ADDR="localhost"
export MASTER_PORT="12355"
export LOCAL_RANK=8 # GPU number
python NLP_task_training.py --model_name_or_path Qwen/Qwen2.5-72B-Instruct
NOTE: This demo will not use the interpreter function. This feature is a potential extension for this work.
We used 6 A100 GPUs with 80GB memory for the training. (Use Lora)
We used 2 A100 GPUs with 80GB memory for the inference.
If you have any inquiries, please feel free to raise an issue or reach out to leib2765@gmail.com.
We will complete the technical introduction paper before mid-May.
https://huggingface.co/datasets/Bin12345/HPC_Fortran_CPP seems to have 315 rows of data. However, Groundtruth_C++.txt has only 296 rows. The reason is that we filtered some long data samples.
Appreciation to Lawrence Livermore National Laboratory (Technical Contact: liao6@llnl.gov) for their financial support of this project.