Using LLMs to Predict on Tabular Data

Finetuning LLMs to predict on some of the popular Kaggle datasets.

Refer to the [pre-requisites](AyeshaAmjad0828/LLMs-Table-Predictions: Training LLMs on some of the popular kaggle datasets (github.com)) section for the environment set up prior to starting this experiment.

Here are the results of this experiment on fraud detection data.

Technology Stack

Hugging Face	Axolotl	Modal	Llama	Weights&Biases	OpenAI	FLAML

Overview

The objective of this project is to assess the quality of LLM training and predictions results on tabular datasets. The inspiration is drawn from clinicalml/TabLLM (github.com) which performed few-shot classification of tabular data with LLMs.

The project consists of two experiments:

Finetuning an LLM on serialized training data and using it to generate predictions on test dataset.
Generating vector embeddings on serialized training data and using the vectors as features for an AutoML algorithm(FLAML).

Let's go over the design and execution of each experiment in a step by step manner.

Experiment 1 - Finetuning an LLM

Here is an a diagram showing a high-level set up of the first experiment. All the related code is in src folder.

This experiment contains three steps:

Serializing the data (both train and test) from table to text. (list serialization, text template, manual template)
Finetuning an LLMs (Llama-7B, Llama-13b, Llama 70b). All logic related to finetuning is in src/finetune.py
Perform inference/prediction on test data using finetuned model. All logic related to inference is in src/inference.py

Table to Text

Three ways are used for converting tabular data into a text strings. The most promising results were given by manual conversion and bloom-560m-finetuned-totto-table-to-text.

The text sentences are then manually labeled (using expressions) as input for features and output for target variable value compatible with a jsonl file structure. The subset of final dataset is added to my_data.jsonl.

The actual fraud.jsonl used for the training contains 1,296,675 records.

Finetuning Llama7B/13b/70b

The src/finetune.py runs three business modal functions in cloud:

launch prepares a new folder in the /runs volume with the training config and data for a new training job. It also ensures the base model is downloaded from HuggingFace.
train takes a prepared folder and performs the training job using the finalconfig.yml and my_data.jsonl
Inference.completion can spawn a vLLM inference container for any pre-trained or fine-tuned model from a previous training job.

The rest of the code are helpers for calling these three functions.

Config

Training job uses finalconfig.yml where training configurations are defined. Here is a breakdown of this file:

Base model

The base_model value is changed for experimenting with different Llama models.

base_model: meta-llama/Llama-2-7b-chat-hf    #meta-llama/Llama-2-13b-chat-hf, #meta-llama/Llama-2-70b-chat-hf
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
is_llama_derived_model: true

Dataset

The following format is set according to the layout of my_data.jsonl and the nature of the experiment. Different formats supported by axolotl can be found here.

datasets:
  - path: my_data.jsonl
    ds_type: json
    type:
      # JSONL file contains data, and output fields per line.
      # This gets mapped to input, output axolotl tags.
      field_input: data
      field_output: output
      # Format is used by axolotl to generate the prompt.
      format: |-
        [INST] For the given data values of trans_date_trans_time, cc_num, merchant, category, amt, first, last, gender, street, city, state, zip, lat, long, city_pop, job, dob, trans_num, unix_time, merch_lat, merch_long in input, what is the output value of is_fraud?
        {input}
        [/INST]

Adapter

adapter: lora
lora_model_dir:
lora_r: 16
lora_alpha: 32 # alpha = 2 x rank is a good starting point.
lora_dropout: 0.05
lora_target_linear: true # target all linear layers
lora_fan_in_fan_out:

Weights&Biases

This will enable the tracking of all runs with weights and biases. During the training, a clickable link is generated for W&B dashboard.
```
wandb_project: tabllm-7b-prediction-output
wandb_watch: all
wandb_entity:
wandb_run_id:
```
Multi-GPU training

Used the recommended DeepSpeed for multi-GPU training, which is easy to set up. Axolotl provides several default deepspeed JSON configurations and Modal makes it easy to attach multiple GPUs of any type in code.
```
deepspeed: /root/axolotl/deepspeed/zero3.json
```

Using CLI for finetuning

Navigate to the location containing src folder in CLI and start the finetuning with a simple command.

modal run --detach src.finetune

--detach lets the app continue running even if the client disconnects.

The script reads two local files: finalconfig.yml and my_data.jsonl. The contents passed as arguments to the remote launch function, which will write them to the /runs volume. Next, train will read the config and data from the new folder for reproducible training runs.

The default configuration fine-tunes meta-llama/Llama-2-7b-chat-hf with 20 epochs. It uses DeepSpeed ZeRO-3 to shard the model state across 2 A100s.

The command will first install all the required packages. Followed by creating objects for the function defined in finetune.py.

A link to app logs maintained by modal is provided which stores information on CPU and GPU consumption. Here is an example screenshot.

A link to weights and biases dashboard is also provided along with the info passed to it.

Using CLI for Inference

Once the model finetuning completes, check the modal volume for the latest filename. This is LLM weights file. In CLI, perform inference using a simple command.

modal run -q src.inference::inference_main --run-folder /runs/axo-2024-01-08-15-39-45-c7ef

/runs/axo-2024-01-08-15-39-45-c7ef is the path in my volumes that I can fetch with following command in CLI, where tablellm-runs-vol is defined in the common.py.

modal volume ls tablellm-runs-vol

Here is an screenshot from the experiment:

Experiment 2 - LLM Vector Embeddings for AutoML

Here is an a diagram showing a high-level set up of the second experiment. All the related code is in serialization folder.

This experiment involves 3 steps

Data Serialization
- Method: Manual Template
- Description: Data is serialized using the Manual Template method. Sample file is manual_template_serialization.jsonl
Embedding Vectors Generation
- Model: OpenAI embedding model text-embedding-ada-002
- Description: Embedding vectors are generated using the specified OpenAI model.
AutoML Evaluation
- Models: AutoML class of flaml
- Description: Embedded data is passed to the AutoML class for assessing the best-performing estimator and hyperparameter configuration.

Table to Text Serialization

Method: Manual Template
Description: Sentence-like strings of tabular data are created using the Manual Template method.

Embedding Vectors for ML Models

Model: OpenAI embedding model text-embedding-ada-002
Description: Sentences are embedded using the OpenAI model for ML model prediction.

AutoML Assessment

Class: AutoML class of flaml
Description: The embedded data is evaluated to determine the best estimator and hyperparameter configuration.

Pre-requisites

Setting up accounts on OpenAI, Modal, Hugging Face, W&B,

Create and account on OpenAI and get generate an API.
Save this API as an environment variable under the name OPENAI_API_KEY.
Create an account on Modal.
Create an account on Hugging face and agree to the terms and conditions for accessing Llama models.
Get the hugging face access token.
Create a new secret for hugging face in your modal account. This secret is a way to mask hugging face access token.

Once created, your keys will be displayed in the same location.
Install modal in your current python environment pip install modal.
Open cmd, navigate to python scripts folder ...\AppData\Local\Programs\Python\Python310\Scripts
Set up modal token in your python environment modal setup.
(Optional) To monitor LLM finetuning performance visually, set up a weights and biases account , get its authorize key, and create its secret in the same way as hugging face secret on modal.

Install weights and biases library in your current python environment pip install wandb

Add your wandb config to your config.yml script (you will find this in my finalconfig.yml)

wandb_project: tabllm-7b-prediction-output
wandb_watch: all
wandb_entity:
wandb_run_id:

you may have to perform modal setup again in your python environment as shown in step 7.

Initializing Axolotl, Huggingface, and W&B applications

Script for initializing the required applications in modal is common.py. This is where we define application name to be appeared in modal.

APP_NAME = "tablefinetune-axolotl"

Followed by axolotl image that fetches relevant hugging face model.

axolotl_image = (
    Image.from_registry(f"winglian/axolotl@sha256:{AXOLOTL_REGISTRY_SHA}")
    .run_commands(
        "git clone https://github.com/OpenAccess-AI-Collective/axolotl /root/axolotl",
        "cd /root/axolotl && git checkout a581e9f8f66e14c22ec914ee792dd4fe073e62f6",
    )
    .pip_install("huggingface_hub==0.19.4", "hf-transfer==0.1.4")
    .pip_install(
        f"transformers @ git+https://github.com/huggingface/transformers.git@{TRANSFORMERS_SHA}",
        "--force-reinstall",
    )
    .env(dict(HUGGINGFACE_HUB_CACHE="/pretrained", HF_HUB_ENABLE_HF_TRANSFER="1"))
)

The stubs initialization for hugging face and w&b secret keys.

stub = Stub(APP_NAME, secrets=[Secret.from_name("my-huggingface-secret1"), Secret.from_name("my-wandb-secret1")])

Finally, it defines volumes for pre-trained models and training runs.

pretrained_volume = Volume.persisted("example-pretrained-vol")
runs_volume = Volume.persisted("example-runs-vol")
VOLUME_CONFIG: dict[str | os.PathLike, Volume] = {
    "/pretrained": pretrained_volume,
    "/runs": runs_volume,
}

Results on Llama2-7B for Fraud Detection

Here is a link to the weights and biases report on the results of finetuning Llama2-7B on the fraud detection data:

https://api.wandb.ai/links/tab-llm-finetuning/r8wimmyv

Find the detailed log of finetuning in runs folder.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
README.assets		README.assets
configs		configs
ignore		ignore
runs		runs
serialization		serialization
src		src
src2		src2
uploads		uploads
.gitattributes		.gitattributes
README.md		README.md
subprocess-dependencies.py		subprocess-dependencies.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using LLMs to Predict on Tabular Data

Finetuning LLMs to predict on some of the popular Kaggle datasets.

Technology Stack

Overview

Experiment 1 - Finetuning an LLM

Table to Text

Finetuning Llama7B/13b/70b

Config

Using CLI for finetuning

Using CLI for Inference

Experiment 2 - LLM Vector Embeddings for AutoML

Table to Text Serialization

Embedding Vectors for ML Models

AutoML Assessment

Pre-requisites

Setting up accounts on OpenAI, Modal, Hugging Face, W&B,

Initializing Axolotl, Huggingface, and W&B applications

Results on Llama2-7B for Fraud Detection

About

Releases

Packages

Contributors 2

Languages

AyeshaAmjad0828/LLMs-Table-Predictions

Folders and files

Latest commit

History

Repository files navigation

Using LLMs to Predict on Tabular Data

Finetuning LLMs to predict on some of the popular Kaggle datasets.

Technology Stack

Overview

Experiment 1 - Finetuning an LLM

Table to Text

Finetuning Llama7B/13b/70b

Config

Using CLI for finetuning

Using CLI for Inference

Experiment 2 - LLM Vector Embeddings for AutoML

Table to Text Serialization

Embedding Vectors for ML Models

AutoML Assessment

Pre-requisites

Setting up accounts on OpenAI, Modal, Hugging Face, W&B,

Initializing Axolotl, Huggingface, and W&B applications

Results on Llama2-7B for Fraud Detection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages