Given a LaTeX expression, train an LLM to generate valid python code.
This project focuses on developing a model to convert LaTeX mathematical expressions into executable Python code. The project was part of a competition where participants trained language models to accurately translate LaTeX expressions into valid Python code that maintains the intended functionality.
The primary challenge is to generate functionally correct Python code from LaTeX expressions. Participants were provided with a dataset containing LaTeX-Python code pairs, which served as the training data for their models. The goal is to ensure the generated Python code correctly implements the operations described by the LaTeX input.
Participants in the competition were tasked to:
- Train language models using the provided LaTeX-Python pairs.
- Generate Python code for a set of test LaTeX inputs.
- Execute the generated Python code on specific test inputs and save the results along with task IDs in a CSV file.
- Submit the CSV file for evaluation.
The dataset includes a wide variety of equations across 14 types:
- Algebraic
- Multivariable Algebraic
- Trigonometric
- Logarithmic
- Exponential
- Exponential Decay
- Geometric
- Diophantine
- Summation
- Rational
- Fractional
- Derivative
- Integration
- Differential Equations
Equation type: Logarithmic
Mathematical Expression
3log(10x) + 10
LaTeX Expression:
3 \log{\left(10 x \right)} + 10
Solution (Python Code):
String: "from sympy import log\n\ndef logrithmic_function(x):\n return 3log(10x) + 10\n"
from sympy import log
def logrithmic_function(x):
return 3*log(10*x) + 10
Participants were provided with sample code to demonstrate how to run their code strings. They could use their own code to execute these strings.
Test cases included expected outputs under the key "output" in the training set but not in the test set. This allowed participants to validate their models using their own validation or training sets without relying on the leaderboard.
"test_cases":
[
{
"input": {
"x": 9.044000633870175
},
"output": 23.51406015249141
},
{
"input": {
"x": 1.6090229862443897
},
"output": 18.334636740837986
}
]
Submissions were evaluated based on the accuracy of the generated Python code. The custom scoring function, model_code_accuracy, compared the output of the generated code with the expected output for each test case. The accuracy determined the ranking of the submissions.
This metric evaluated the accuracy of each generated code and then calculated the mean of those accuracies. For each problem/task_id (one LaTeX expression), say there are N test cases and the code gives the correct output for C of them. Then, the accuracy for this task_id is:
Then, model_code_accuracy is:
Score Function doc string
def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
'''
This metric calculates the mean accuracy of predictions compared to expected outputs.
Inputs:
- solution ==> df ==> Columns: row_id_column, expected_outputs, usage
row_id_column ==> unique identifiers for each problem/task_id
expected_outputs ==> Each row is a list of length num_test_cases. Each item in the list can either be
int, float or string (for complex outputs).These are Ground Truth expected outputs.
usage ==> "Public" or "Private" depending on whether this row will contribute to public or private leaderboard
- submisssion ==> df ==> Columns: row_id_column, ouputs
row_id_column ==> unique identifiers for each problem/task_id
outputs ==> Each row is a list of length num_test_cases. Each item in the list can either be
int, float or string (for complex outputs).
These are predicted outputs generated by LLM generated codes given the unit test inputs.
'''
# rest of the score function code
For each task ID in the test set, participants returned a list under the column "outputs". The submission file had to be a CSV with the following format:
id,outputs
bdfdc594,"[1,2,3,4,5]"
adfdc512,"[1.1,2.2,inf,30.2,2.2]"
bffac789,"[1.34,2.43,1.24,1.111,1.245]"
At the end of the competition, participants submitted the following for evaluation on the private leaderboard:
- Saved model
- tokenizer
- model weights checkpoint
-
Training script used to train the model
-
Training data used along with split so training can be reproduced.
-
A script which can load tokenizer and saved model, generate python code for given latex expression from private_test_data.json (user just needs to enter the path - format similar to public_test_no_sol_no_out.json) and run this code to get outputs on test cases from this data json.
- a) A function to extract python function string from generated output of the model
- b) generated_code_lists ==> list of lists ==> [[python_code_string_task_id_1], [python_code_string_task_id_2], …] in the order that you find it in the private_test_data.json.
- c) compile_code and run_code to take test_inputs from the private_test_data.json and return the output