
Solving the Enem exam with LLMs
Live Demo
This repository aims to run LLMs on the Enem, a Brazilian University Admission Exam.
It employs the approach from the paper Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams
with the dataset they have relased, named ENEM 2022
.
Evaluated models: GPT-3.5, GPT-4, Falcon 7B, LLaMA2 7B, and MariTalk.
The code was written aiming to have few dependencies and facilitate the use of LLMs other than OpenAI-based ones.
The ENEM 2022
dataset is available under the folder dataset/enem
in a processed format ready to use with the LLMs. The processing procedure was done taking into consideration the instructions given by the author with little modification. In order to replicate it, replace the original write_out.py
file with the dataset/enem/write_out.py
file.
The original Enem exam used to build the ENEM 2022
dataset can be downloaded here and here.
Note:
This project was developed using
Windows 11
withpython 3.10.9
.
Clone this repository, create a new environment (recommended) and install the dependencies:
pip install -r requirements.txt
Visit OpenAI to retrieve your API key and add to your environment variable.
On Windows:
$Env:OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
On Linux:
export OPENAI_API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
You can run with any model starting with gpt-3.5-turbo
and gpt-4
.
The results reported in this repository are the gpt-3.5-turbo-0613
and gpt-4-0613
versions.
For the dataset, the options are: Zero-shot
, Few-shot
, and Few-shot with Chain-of-Thought
, which evaluates the dataset files enem_2022_0_shot.json
, enem_2022_3_shot.json
, and enem_cot_2022_3_shot.json
, respectively.
python evaluator.py evaluate --models "['gpt-3.5-turbo-0613', 'gpt-4-0613']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"
The results will be placed in the reports
folder (beware, this will overwrite the current files).
To produce the results.html
file with a summary table as in the results section, run:
python evaluator.py build_results_table --models "['gpt-3.5-turbo-0613', 'gpt-4-0613']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "gpt_results.html"
MariTalk is currently free. Thus, my API key
was explicitly written in the code.
python evaluator.py evaluate --models "['MariTalk']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"
python evaluator.py build_results_table --models "['MariTalk']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "maritalk_results.html"
Select a LLM from the Hugging Face model hub.
This repository was tested with the following models:
- tiiuae/falcon-7b-instruct
- tiiuae/falcon-40b-instruct
- meta-llama/Llama-2-7b-chat-hf
- meta-llama/Llama-2-70b-chat-hf
Create an endpoint at the Hugging Face Inference Endpoint platform.
Visit the endpoint UI to retrieve your token, name and url, and add to your environment variable:
$Env:huggingface_token="hf_xxxxxxxxxxxxxxxxxx"
$Env:huggingface_namespace="xxxxxxxxxxxxxxxxxx"
Using the Falcon-7B
model as example, set the following environment variables using the following pattern:
$Env:huggingface_Falcon7B_name="xxxxxxxxxxxxxxxxxx"
$Env:huggingface_Falcon7B_url="https://xxxxxxxxxxxxxxxxxx.endpoints.huggingface.cloud"
python evaluator.py evaluate --models "['Falcon-7B', 'LLaMA-2-7B']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']"
python evaluator.py build_results_table --models "['Falcon-7B', 'LLaMA-2-7B']" --dataset_names "['Zero-shot', 'Few-shot', 'Few-shot with Chain-of-Thought']" --output_filename "falcon_llama_results.html"
The streamlit demo is available for MariTalk and the OpenAI models.
streamlit run streamlit_app.py
Evaluation on the ENEM 2022
dataset, with the models gpt-3.5-turbo-0613
and gpt-4-0613
:
Area | gpt-3.5-turbo-0613 | gpt-4-0613 | ||||
---|---|---|---|---|---|---|
zero-shot | three-shot | three-shot with CoT |
zero-shot | three-shot | three-shot with CoT |
|
Languages and Codes | 25/33 (75.76%) | 28/33 (84.85%) | 25/33 (75.76%) | 30/33 (90.91%) | 29/33 (87.88%) | 30/33 (90.91%) |
Human Sciences | 34/37 (91.89%) | 33/37 (89.19%) | 33/37 (89.19%) | 35/37 (94.59%) | 36/37 (97.30%) | 35/37 (94.59%) |
Natural Sciences | 19/26 (73.08%) | 19/26 (73.08%) | 19/26 (73.08%) | 20/26 (76.92%) | 22/26 (84.62%) | 21/26 (80.77%) |
Mathematics | 11/22 (50.00%) | 3/22 (13.64%) | 6/22 (27.27%) | 8/22 (36.36%) | 10/22 (45.45%) | 16/22 (72.73%) |
Total | 89/118 (75.42%) | 83/118 (70.34%) | 83/118 (70.34%) | 93/118 (78.81%) | 97/118 (82.20%) | 102/118 (86.44%) |
Detailed results can be seen in the reports
folder.
Evaluation on the ENEM 2022
dataset, with the model MariTalk
:
Area | MariTalk | ||
---|---|---|---|
zero-shot | three-shot | three-shot with CoT |
|
Languages and Codes | 15/33 (45.45%) | 20/33 (60.61%) | 18/33 (54.55%) |
Human Sciences | 22/37 (59.46%) | 22/37 (59.46%) | 31/37 (83.78%) |
Natural Sciences | 15/26 (57.69%) | 10/26 (38.46%) | 15/26 (57.69%) |
Mathematics | 6/22 (27.27%) | 1/22 (4.55%) | 5/22 (22.73%) |
Total | 58/118 (49.15%) | 53/118 (44.92%) | 69/118 (58.47%) |
Detailed results can be seen in the reports
folder.
The evaluation on the ENEM 2022
dataset, with the models Falcon-7B
and LLaMA-2-7B
, was done using the Hugging Face Inference Endpoints. These models require a further investigation on how to build better prompts and how to automate the interpretation of their outputs.
As can be seen in the detailed reports
folder, there are several issues, such as mixing english with portuguese, answering with gibberish text, and badly formatted answers. Thus, the results
table should not be considered, being kept in the repository for informational purposes only.
If you use the ENEM 2022
dataset in your research, even the processed version released in this repository, please cite the original work.
Also, if you use this code or the results published in this repository in your research, please cite:
@misc{arruda2023,
author = {Vinicius Arruda},
title = {Solving the Enem exam with LLMs},
year = {2013},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/viniciusarruda/llm-enem}},
}