Unleashing the Power of Phi-3-mini's 4K Context Window with LoRA Distillation (and a Trick for Budget GPUs!) #781

JosefAlbers · 2024-05-15T15:28:22Z

JosefAlbers
May 15, 2024

Introduction

Phi-3-mini has been a game-changer, packing impressive performance into a tiny model. The 128K context window is especially exciting (although not yet supported by the MLX library), opening up the possibilities for many-shot learning that recent research has shown can be incredibly effective. But let's face it, most of us don't have access to high-end GPUs with massive VRAM to fully leverage that massive context window.
So, here's my take on a potential workaround: distilling the knowledge from those long contexts into LoRA adapters.

Why This Matters

Huge RAM Requirements: Phi-3-mini's full 128K context alone can easily eat up many gigabytes of RAM for its KV cache, putting it out of reach for many users.
Many-Shot Learning Potential: Research suggests that cramming tons of examples into your prompt can supercharge model performance.
LoRA to the Rescue: LoRA offers a way to capture the knowledge from those many-shot prompts in a much smaller package.

How It Works (The Short Version)

Teacher: Take a Phi-3-mini model and feed it a massive dataset of context information for the task of interest.
Logits: The logits generated by the "teacher" for the response part of the input prompt is collected into an npz file.
Student: The Phi-3-mini model is then input tasks same as the "teacher" but without the massive context part and is trained to still output logits that closely match the teacher's output logits.

Results (So Far)

I've been experimenting with this on a few different tasks, and the initial results are promising! For example, here's a comparison (admittedly cherry-picked for its impressiveness) between zero-shot, n-shot, and LoRA zero-shot performance on a medical question answering task.

Benefits

Reduced Memory Usage
Efficient Inference
Knowledge Transfer
Customization

Potential Use Case

Picture this: a library of LoRA adapters on your hard drive, each fine-tuned to enhance a specific skillset of your base LLM model. One adapter turns it into an MLX library guru, while another equips it with expert knowledge of ICHD-3 headache classifications. This modular approach enables efficient, granular updates, ensuring your AI's expertise remains current without retraining the entire model. Moreover, with a sLLM like Phi-3-mini, this can be on an everyday cellphone or a Raspberry Pi!

Github

I'm sharing my code and initial findings on Github, and I would love to hear your thoughts, ideas, and feedback!

awni · 2024-05-15T20:19:01Z

awni
May 15, 2024
Maintainer

That is very cool, thanks for sharing! It makes intuitive sense to me that you can trade off in-context learning (e.g. from a long prompt) with fine-tuning. There's a lot of very interesting questions there that I think are not yet explored (or at least I don't know the answers to) around which is more efficient, which works better, and in general in what settings you should prefer one over the other.

0 replies

JosefAlbers · 2024-05-16T09:25:57Z

JosefAlbers
May 16, 2024
Author

Oh wow, thanks for the kind words, @awni! I'm a big fan of MLX and stoked to see how this can push things forward. I totally agree, there's so much more to explore here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unleashing the Power of Phi-3-mini's 4K Context Window with LoRA Distillation (and a Trick for Budget GPUs!) #781

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Unleashing the Power of Phi-3-mini's 4K Context Window with LoRA Distillation (and a Trick for Budget GPUs!) #781

JosefAlbers May 15, 2024

Replies: 2 comments

awni May 15, 2024 Maintainer

JosefAlbers May 16, 2024 Author

JosefAlbers
May 15, 2024

awni
May 15, 2024
Maintainer

JosefAlbers
May 16, 2024
Author