-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VLM support for image and video processing with SmolVLM support #206
base: main
Are you sure you want to change the base?
Conversation
Video/image fixes
Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.
I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.
The configuration fixes that make this work have been applied.
Generation (single image) works now 🔥
Also changed the input type to `image` to keep the sequence of frames untouched :)
smolvlm processing
Some cleanup
Additional smolvlm changes and adjustments
Images are always upscaled, so always tiled.
Fix single image pre-processing
Wow awesome PR! Thanks! @davidkoski is out for a few more days so apologies for the delay reviewing and getting this merged but we'll definitely get it landed as soon as possible. |
No rush! Happy to iterate when David is back! |
Came from hugging face blog, very cool! Tried on 13 pro max, works for some videos but crashes a lot. Is there a device requirement? |
Hi @chenemii! We have tested on iPhone 14 to 16, and haven't had time to work on much optimization yet. It probably crashes on your iPhone because of peak RAM use while processing video. The problem is not the amount of RAM in the device, but the per-process limits enforced by iOS, which vary per model family. We'll run more tests when we open the Test Flight beta, if you want you can sign up here. |
@chenemii A couple of ideas though:
let maxMetalMemory = Int(round(0.82 * Double(os_proc_available_memory())))
MLX.GPU.set(memoryLimit: maxMetalMemory, relaxed: false) |
@pcuenca Good point, signed up for testing. I can help validate for the 13 |
I am back -- I will look at this today or tomorrow. Very exciting! |
var maxProcessingImageSize: CGFloat { CGFloat(config.size.longestEdge) } // 2048 | ||
var fixedImageSize: CGFloat { CGFloat(config.maxImageSize.longestEdge) } // 384 for big models, 512 for small models (200-500M) | ||
var imageSequenceLength: Int { config.imageSequenceLength } | ||
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } | |
var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } // Limited to reduce memory consumption on phones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ugly. I think the property should reflect the configuration, but then we'd need to be able to override it when needed.
Apologies, I have been tied up but I will get to this as soon as I can. |
I synced it up with |
let videoSystemPrompt = | ||
"Focus only on describing the key dramatic action or notable event occurring in this video segment. Skip general context or scene-setting details unless they are crucial to understanding the main action." | ||
let imageSystemPrompt = | ||
"You are an image understanding model capable of describing the salient features of any image." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are fine for now, but this + the message formatting needs to be figured out (later :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
@@ -205,6 +210,9 @@ struct ContentView: View { | |||
.disabled(llm.running) | |||
} | |||
} | |||
.onAppear { | |||
selectedVideoURL = Bundle.main.url(forResource: "test", withExtension: "mp4")! | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is nice for testing but I think we should probably remove the example asset for the example -- force people use their own images & videos. Also, I don't know the license on this video :-)
On the other hand this is meant as an example for developers to build on and maybe it is good to have something ready to go? Anyone have any thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way. cc @cyrilzakka on the video rights (but also for opinion) :)
@@ -322,10 +330,10 @@ class VLMEvaluator { | |||
|
|||
/// This controls which model loads. `qwen2VL2BInstruct4Bit` is one of the smaller ones, so this will fit on | |||
/// more devices. | |||
let modelConfiguration = ModelRegistry.qwen2VL2BInstruct4Bit | |||
let modelConfiguration = ModelRegistry.smolvlm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should revert this before merging or if we think this is a better default model, update the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will revert, this was meant for our own testing.
let totalDuration: String | ||
} | ||
|
||
// TODO: verify working color space, rendering color space |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea. I think the python processing code is roughly equivalent to the colorspace of the input, no conversion to linear, and what is called "device RGB" (don't touch my colors). In other words it isn't managed color, but that is what we have here.
We could certainly do something like use the non-linear form of the input colorspace and output to the same. In practice I am not sure it matters that much. These models are probably trained on consistent colorspace inputs (though sRGB is likely, displayP3 from iPhone images is pretty likely, and videos are much more diverse).
Maybe this should turn into an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said: I don't think we should try to replicate the unmanaged colorspace of the python version. I think we should pick a colorspace (sRGB or displayP3) and be consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense to turn into an issue. I also think that Python pre-processing is mostly oblivious to colorspace.
@@ -87,6 +94,35 @@ public enum MediaProcessing { | |||
return rescaled.cropped(to: CGRect(origin: .zero, size: size)) | |||
} | |||
|
|||
/// Resample the image using Lanczos interpolation. | |||
static public func resampleLanczos(_ image: CIImage, to size: CGSize) -> CIImage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smol uses Lanczos? I agree it is the better resampling method for humans, but the sinc it simulates has an edge strengthening effect -- I am surprised to see it used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it does, I was surprised too when I saw it but didn't follow up with the team. cc @mfarre, just curious if there's any insight :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is inherited from the Idefics3 image processor :)
// set the aspect ratio to match the aspect ratio of the target | ||
let inputAspectRatio = extent.width / extent.height | ||
let desiredAspectRatio = size.width / size.height | ||
filter.aspectRatio = Float(1 / inputAspectRatio * desiredAspectRatio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this size/aspect ratio code should be refactored to be shared between the resampling methods?
@@ -199,4 +236,108 @@ public enum MediaProcessing { | |||
|
|||
return ciImages | |||
} | |||
|
|||
static public func asCIImageSequence( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty similar to the method above it. I wonder if we should have a VideoParameters
struct that had these values and we had one method that took that. This method should be concerned with timing and any properties needed to read out the video (e.g. if we wanted to convert pixelFormats).
See also #223 -- maybe we can factor this part out as some of the other PRs would make use of it.
public var ropeTraditional: Bool { _ropeTraditional ?? false } | ||
public let tieWordEmbeddings: Bool | ||
public var tieWordEmbeddings: Bool { _tieWordEmbeddings ?? false } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No change required here, but I saw there were some macro packages that help with defaults values and Codable -- maybe we should adopt.
} | ||
} | ||
|
||
public class SmolVLMProcessor: UserInputProcessor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't that small (smol?) -- I wonder if the Processor (and its config) belongs in its own file? I see how it uses the same model implementation (via VLMModelFactory) but it might be more clear to people browsing if it had its own file named after Smol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it grew a bit more than we expected and I was hesitant whether to split.
|
||
/// parameters controlling the output | ||
let generateParameters = MLXLMCommon.GenerateParameters(temperature: 0.6) | ||
let generateParameters = MLXLMCommon.GenerateParameters(temperature: 0.7, topP: 0.9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These parameters are also smolvlm-specific.
@@ -663,9 +672,12 @@ public class Idefics3: Module, VLMModel, KVCacheDimensionProvider { | |||
return final | |||
} | |||
|
|||
// inputs_merger | |||
// TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.
Hey all,
@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!