Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FineTuning BLIP2 - various issues #376

Closed
iliasmiraoui opened this issue Apr 27, 2023 · 9 comments
Closed

FineTuning BLIP2 - various issues #376

iliasmiraoui opened this issue Apr 27, 2023 · 9 comments

Comments

@iliasmiraoui
Copy link

iliasmiraoui commented Apr 27, 2023

Hello,

Thank you again for the fantastic work on this library and all the examples you are including !!
Big up @younesbelkada for all the support as well...

I have been trying to play around with BLIP2 and PEFT using the example notebook (https://colab.research.google.com/drive/16XbIysCzgpAld7Kd9-xz-23VPWmqdWmW?usp=sharing#scrollTo=6cCVhsmJxxjH) and a few things came up and I was hoping to get your help:

  1. When trying to finetune with "salesforce/blip2-flan-t5-xl", I got a ton of issues:
    config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules=["q_proj", "k_proj"]) 

The q_proj and k_proj layers don't exist and so I used "q","v" or tried to use just the default values and it made the loss converge to 0 extremely quickly. However, the model was really just outputting gibberish so I'm likely not using the right target_modules... How are you supposed to tweak this parameter? In general too, is there a heuristic for these such as T5 -> q,v , OPT -> q_proj,k_proj and is that different for the regular model vs BLIP2?

  • I tried using a bigger OPT (i.e. "ybelkada/blip2-opt-2.7b-fp16-sharded" or "ybelkada/blip2-opt-2.7b-fp16-sharded") and that just made the loss train with "nan" all the time regardless of what I tried.
  1. Something seemed really odd in the training loop, specifically: outputs = model(input_ids=input_ids, pixel_values=pixel_values, labels=input_ids)
  • From my understanding, this would imply that we are already passing the label into the model that we want to predict as an input?

  • I also tried to modify the notebook to go beyond just image captioning and try to train a VQA model by modifying the following:

class ImageCaptioningDataset(Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        encoding = self.processor(images=item["image"],text=item['prompt'], padding="max_length", return_tensors="pt")
        # remove batch dimension
        encoding = {k: v.squeeze() for k, v in encoding.items()}
        encoding["text"] = item["text"]
        return encoding

def collate_fn(batch):
    # pad the input_ids and attention_mask
    processed_batch = {}
    for key in batch[0].keys():
        if key in ["pixel_values",'input_ids']:
            processed_batch[key] = torch.stack([example[key] for example in batch])
        elif key == 'text':
            text_inputs = processor.tokenizer(
                [example["text"] for example in batch], padding=True, return_tensors="pt"
            )
            processed_batch["input_ids_label"] = text_inputs["input_ids"]
            processed_batch["attention_mask_label"] = text_inputs["attention_mask"]
    return processed_batch

    input_ids = batch.pop("input_ids").to(device)
    input_ids_label = batch.pop("input_ids_label").to(device)
    pixel_values = batch.pop("pixel_values").to(device, torch.float16)

    outputs = model(input_ids=input_ids,
                    pixel_values=pixel_values,
                    labels=input_ids_label)

But then it didn't really seem to converge as well as the regular image captioning despite always having the same prompt throughout my dataset... Anything I could be doing wrong?

Thanks in advance!

@betterftr
Copy link

I have tried messing around with blip 2 t5 xxl with same settings for LoraConfig (blip opt 6.7 was working fine) it outputs jibberish and converges to 0 waaay to quickly

@betterftr
Copy link

Figured it out, the T5 model expects input_ids as instructions, and labels (decoder_input_ids) as your captions

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this as completed Jul 7, 2023
@bryanchiaws
Copy link

Figured it out, the T5 model expects input_ids as instructions, and labels (decoder_input_ids) as your captions

I am getting the following error (only when I use peft):

TypeError: forward() got an unexpected keyword argument 'inputs_embeds'

I was wondering if you knew what might be the issue?

Or do you have an example notebook I could look at?

@z3ugma
Copy link

z3ugma commented Dec 4, 2023

I'm also getting the error that the loss ends up being all nan after an epoch or two of training , I documented it in huggingface/notebooks#454

@NielsRogge
Copy link
Contributor

pinging @younesbelkada here

@z3ugma
Copy link

z3ugma commented Dec 24, 2023

Still an issue for me after trying various versions of PEFT and PyTorch. A currently non-working system setup:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
Torch 2.0.1+cu117
Datasets 2.16.0
Python 3.11.7 | packaged by conda-forge | (main, Dec 15 2023, 08:38:37) [GCC 12.3.0]
PEFT 0.5.0

@pribadihcr
Copy link

Hi @z3ugma , Have you any solution now?

@ChristopheYe
Copy link

Hi @bryanchiaws ,
I have the same error, did you figure out a way to fix that ?
Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants