Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VLM support for image and video processing with SmolVLM support #206

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

cyrilzakka
Copy link

Hey all,

@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!

cyrilzakka and others added 30 commits February 12, 2025 10:21
Text inputs, with hardcoded values and considering a single image.
Image patching still not done.

You need to define HF_TOKEN in the environment to be able to download
the model.
I believe pre-processing matches transformers', but inference fails
because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames
untouched :)
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
@awni awni requested a review from davidkoski February 20, 2025 14:24
@awni
Copy link
Member

awni commented Feb 20, 2025

Wow awesome PR! Thanks! @davidkoski is out for a few more days so apologies for the delay reviewing and getting this merged but we'll definitely get it landed as soon as possible.

@pcuenca
Copy link
Contributor

pcuenca commented Feb 20, 2025

No rush! Happy to iterate when David is back!

@chenemii
Copy link

Came from hugging face blog, very cool! Tried on 13 pro max, works for some videos but crashes a lot. Is there a device requirement?

@pcuenca
Copy link
Contributor

pcuenca commented Feb 22, 2025

Hi @chenemii! We have tested on iPhone 14 to 16, and haven't had time to work on much optimization yet. It probably crashes on your iPhone because of peak RAM use while processing video. The problem is not the amount of RAM in the device, but the per-process limits enforced by iOS, which vary per model family.

We'll run more tests when we open the Test Flight beta, if you want you can sign up here.

@pcuenca
Copy link
Contributor

pcuenca commented Feb 22, 2025

@chenemii A couple of ideas though:

  • Limit the max total number of frames to something like 20 here. The default configuration uses 64.
  • Limit Metal memory with something like the following. It could result in slower execution:
                let maxMetalMemory = Int(round(0.82 * Double(os_proc_available_memory())))
                MLX.GPU.set(memoryLimit: maxMetalMemory, relaxed: false)

@chenemii
Copy link

chenemii commented Feb 22, 2025

@pcuenca Good point, signed up for testing. I can help validate for the 13

@davidkoski
Copy link
Collaborator

I am back -- I will look at this today or tomorrow. Very exciting!

var maxProcessingImageSize: CGFloat { CGFloat(config.size.longestEdge) } // 2048
var fixedImageSize: CGFloat { CGFloat(config.maxImageSize.longestEdge) } // 384 for big models, 512 for small models (200-500M)
var imageSequenceLength: Int { config.imageSequenceLength }
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } // Limited to reduce memory consumption on phones

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ugly. I think the property should reflect the configuration, but then we'd need to be able to override it when needed.

@davidkoski
Copy link
Collaborator

I am back -- I will look at this today or tomorrow. Very exciting!

Apologies, I have been tied up but I will get to this as soon as I can.

@davidkoski
Copy link
Collaborator

I synced it up with main and trying it out now.

let videoSystemPrompt =
"Focus only on describing the key dramatic action or notable event occurring in this video segment. Skip general context or scene-setting details unless they are crucial to understanding the main action."
let imageSystemPrompt =
"You are an image understanding model capable of describing the salient features of any image."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are fine for now, but this + the message formatting needs to be figured out (later :-) )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@@ -205,6 +210,9 @@ struct ContentView: View {
.disabled(llm.running)
}
}
.onAppear {
selectedVideoURL = Bundle.main.url(forResource: "test", withExtension: "mp4")!
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is nice for testing but I think we should probably remove the example asset for the example -- force people use their own images & videos. Also, I don't know the license on this video :-)

On the other hand this is meant as an example for developers to build on and maybe it is good to have something ready to go? Anyone have any thoughts on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine either way. cc @cyrilzakka on the video rights (but also for opinion) :)

@@ -322,10 +330,10 @@ class VLMEvaluator {

/// This controls which model loads. `qwen2VL2BInstruct4Bit` is one of the smaller ones, so this will fit on
/// more devices.
let modelConfiguration = ModelRegistry.qwen2VL2BInstruct4Bit
let modelConfiguration = ModelRegistry.smolvlm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should revert this before merging or if we think this is a better default model, update the comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will revert, this was meant for our own testing.

let totalDuration: String
}

// TODO: verify working color space, rendering color space
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea. I think the python processing code is roughly equivalent to the colorspace of the input, no conversion to linear, and what is called "device RGB" (don't touch my colors). In other words it isn't managed color, but that is what we have here.

We could certainly do something like use the non-linear form of the input colorspace and output to the same. In practice I am not sure it matters that much. These models are probably trained on consistent colorspace inputs (though sRGB is likely, displayP3 from iPhone images is pretty likely, and videos are much more diverse).

Maybe this should turn into an issue?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said: I don't think we should try to replicate the unmanaged colorspace of the python version. I think we should pick a colorspace (sRGB or displayP3) and be consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense to turn into an issue. I also think that Python pre-processing is mostly oblivious to colorspace.

@@ -87,6 +94,35 @@ public enum MediaProcessing {
return rescaled.cropped(to: CGRect(origin: .zero, size: size))
}

/// Resample the image using Lanczos interpolation.
static public func resampleLanczos(_ image: CIImage, to size: CGSize) -> CIImage {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smol uses Lanczos? I agree it is the better resampling method for humans, but the sinc it simulates has an edge strengthening effect -- I am surprised to see it used here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it does, I was surprised too when I saw it but didn't follow up with the team. cc @mfarre, just curious if there's any insight :)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is inherited from the Idefics3 image processor :)

// set the aspect ratio to match the aspect ratio of the target
let inputAspectRatio = extent.width / extent.height
let desiredAspectRatio = size.width / size.height
filter.aspectRatio = Float(1 / inputAspectRatio * desiredAspectRatio)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this size/aspect ratio code should be refactored to be shared between the resampling methods?

@@ -199,4 +236,108 @@ public enum MediaProcessing {

return ciImages
}

static public func asCIImageSequence(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty similar to the method above it. I wonder if we should have a VideoParameters struct that had these values and we had one method that took that. This method should be concerned with timing and any properties needed to read out the video (e.g. if we wanted to convert pixelFormats).

See also #223 -- maybe we can factor this part out as some of the other PRs would make use of it.

public var ropeTraditional: Bool { _ropeTraditional ?? false }
public let tieWordEmbeddings: Bool
public var tieWordEmbeddings: Bool { _tieWordEmbeddings ?? false }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No change required here, but I saw there were some macro packages that help with defaults values and Codable -- maybe we should adopt.

}
}

public class SmolVLMProcessor: UserInputProcessor {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't that small (smol?) -- I wonder if the Processor (and its config) belongs in its own file? I see how it uses the same model implementation (via VLMModelFactory) but it might be more clear to people browsing if it had its own file named after Smol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it grew a bit more than we expected and I was hesitant whether to split.


/// parameters controlling the output
let generateParameters = MLXLMCommon.GenerateParameters(temperature: 0.6)
let generateParameters = MLXLMCommon.GenerateParameters(temperature: 0.7, topP: 0.9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters are also smolvlm-specific.

@@ -663,9 +672,12 @@ public class Idefics3: Module, VLMModel, KVCacheDimensionProvider {
return final
}

// inputs_merger
// TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants