Skip to content

Latest commit

 

History

History
153 lines (116 loc) · 5.77 KB

File metadata and controls

153 lines (116 loc) · 5.77 KB

TrainingLLMs: LaTeX MathematicalExpression To Valid PythonCode

Given a LaTeX expression, train an LLM to generate valid python code.

Description

This project focuses on developing a model to convert LaTeX mathematical expressions into executable Python code. The project was part of a competition where participants trained language models to accurately translate LaTeX expressions into valid Python code that maintains the intended functionality.

Problem Statement

The primary challenge is to generate functionally correct Python code from LaTeX expressions. Participants were provided with a dataset containing LaTeX-Python code pairs, which served as the training data for their models. The goal is to ensure the generated Python code correctly implements the operations described by the LaTeX input.

Objectives

Participants in the competition were tasked to:

  • Train language models using the provided LaTeX-Python pairs.
  • Generate Python code for a set of test LaTeX inputs.
  • Execute the generated Python code on specific test inputs and save the results along with task IDs in a CSV file.
  • Submit the CSV file for evaluation.

Types of Equations

The dataset includes a wide variety of equations across 14 types:

  1. Algebraic
  2. Multivariable Algebraic
  3. Trigonometric
  4. Logarithmic
  5. Exponential
  6. Exponential Decay
  7. Geometric
  8. Diophantine
  9. Summation
  10. Rational
  11. Fractional
  12. Derivative
  13. Integration
  14. Differential Equations

Example Conversion

Equation type: Logarithmic

Mathematical Expression

3log(10x) + 10

LaTeX Expression:

3 \log{\left(10 x \right)} + 10

Solution (Python Code):

String: "from sympy import log\n\ndef logrithmic_function(x):\n return 3log(10x) + 10\n"

from sympy import log

def logrithmic_function(x):
    return 3*log(10*x) + 10

Execution of Code String

Participants were provided with sample code to demonstrate how to run their code strings. They could use their own code to execute these strings.

Test Cases (2 Sample Cases)

Test cases included expected outputs under the key "output" in the training set but not in the test set. This allowed participants to validate their models using their own validation or training sets without relying on the leaderboard.

"test_cases": 
[
    {
        "input": {
            "x": 9.044000633870175
        },
        "output": 23.51406015249141
    },
    {
        "input": {
            "x": 1.6090229862443897
        },
        "output": 18.334636740837986
    }
]

Evaluation

Submissions were evaluated based on the accuracy of the generated Python code. The custom scoring function, model_code_accuracy, compared the output of the generated code with the expected output for each test case. The accuracy determined the ranking of the submissions.

Model Code Accuracy

This metric evaluated the accuracy of each generated code and then calculated the mean of those accuracies. For each problem/task_id (one LaTeX expression), say there are N test cases and the code gives the correct output for C of them. Then, the accuracy for this task_id is:

$$ \text{code accuracy} = \frac{\text{Passed test cases}}{\text{Total test cases}} = \frac{C}{N} $$

Then, model_code_accuracy is:

$$ \text{model code accuracy} = \frac{\sum \text{code accuracies}}{\text{number of problems}} $$

Score Function doc string

def score(solution: pd.DataFrame, submission: pd.DataFrame, row_id_column_name: str) -> float:
    '''
    This metric calculates the mean accuracy of predictions compared to expected outputs.
    Inputs:
    - solution ==> df ==> Columns: row_id_column, expected_outputs, usage
        row_id_column    ==> unique identifiers for each problem/task_id
        expected_outputs ==> Each row is a list of length num_test_cases. Each item in the list can either be
                             int, float or string (for complex outputs).These are Ground Truth expected outputs. 
        usage            ==> "Public" or "Private" depending on whether this row will contribute to public or private leaderboard

    - submisssion ==> df ==> Columns: row_id_column, ouputs
        row_id_column    ==> unique identifiers for each problem/task_id
        outputs          ==> Each row is a list of length num_test_cases. Each item in the list can either be
                             int, float or string (for complex outputs).
                             These are predicted outputs generated by LLM generated codes given the unit test inputs. 
    '''
# rest of the score function code

Submission File

For each task ID in the test set, participants returned a list under the column "outputs". The submission file had to be a CSV with the following format:

id,outputs
bdfdc594,"[1,2,3,4,5]"
adfdc512,"[1.1,2.2,inf,30.2,2.2]"
bffac789,"[1.34,2.43,1.24,1.111,1.245]"

Final Project Submissions

At the end of the competition, participants submitted the following for evaluation on the private leaderboard:

  1. Saved model
  • tokenizer
  • model weights checkpoint
  1. Training script used to train the model

  2. Training data used along with split so training can be reproduced.

  3. A script which can load tokenizer and saved model, generate python code for given latex expression from private_test_data.json (user just needs to enter the path - format similar to public_test_no_sol_no_out.json) and run this code to get outputs on test cases from this data json.

Key parts of the script pipeline:

  • a) A function to extract python function string from generated output of the model
  • b) generated_code_lists ==> list of lists ==> [[python_code_string_task_id_1], [python_code_string_task_id_2], …] in the order that you find it in the private_test_data.json.
  • c) compile_code and run_code to take test_inputs from the private_test_data.json and return the output