VLM support for image and video processing with SmolVLM support #206

cyrilzakka · 2025-02-18T20:08:40Z

Hey all,

@pcuenca and I are submitting a PR to add support for image and video inference along with built in support for smolVLM. Would love a second pair of eyes on this!

Video/image fixes

Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.

I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.

The configuration fixes that make this work have been applied.

Generation (single image) works now 🔥

Also changed the input type to `image` to keep the sequence of frames untouched :)

smolvlm processing

Some cleanup

Additional smolvlm changes and adjustments

Images are always upscaled, so always tiled.

Fix single image pre-processing

awni · 2025-02-20T14:26:21Z

Wow awesome PR! Thanks! @davidkoski is out for a few more days so apologies for the delay reviewing and getting this merged but we'll definitely get it landed as soon as possible.

pcuenca · 2025-02-20T14:54:42Z

No rush! Happy to iterate when David is back!

chenemii · 2025-02-22T16:21:00Z

Came from hugging face blog, very cool! Tried on 13 pro max, works for some videos but crashes a lot. Is there a device requirement?

pcuenca · 2025-02-22T18:39:45Z

Hi @chenemii! We have tested on iPhone 14 to 16, and haven't had time to work on much optimization yet. It probably crashes on your iPhone because of peak RAM use while processing video. The problem is not the amount of RAM in the device, but the per-process limits enforced by iOS, which vary per model family.

We'll run more tests when we open the Test Flight beta, if you want you can sign up here.

pcuenca · 2025-02-22T18:45:37Z

@chenemii A couple of ideas though:

Limit the max total number of frames to something like 20 here. The default configuration uses 64.
Limit Metal memory with something like the following. It could result in slower execution:

                let maxMetalMemory = Int(round(0.82 * Double(os_proc_available_memory())))
                MLX.GPU.set(memoryLimit: maxMetalMemory, relaxed: false)

chenemii · 2025-02-22T18:47:56Z

@pcuenca Good point, signed up for testing. I can help validate for the 13

Style

davidkoski · 2025-02-25T19:40:39Z

I am back -- I will look at this today or tomorrow. Very exciting!

pcuenca · 2025-02-27T13:23:20Z

Libraries/MLXVLM/Models/Idefics3.swift

+    var maxProcessingImageSize: CGFloat { CGFloat(config.size.longestEdge) }  // 2048
+    var fixedImageSize: CGFloat { CGFloat(config.maxImageSize.longestEdge) }  // 384 for big models, 512 for small models (200-500M)
+    var imageSequenceLength: Int { config.imageSequenceLength }
+    var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }


Suggested change

var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ }

var maxVideoFrames: Int { 20 /*config.videoSampling.maxFrames*/ } // Limited to reduce memory consumption on phones

This is ugly. I think the property should reflect the configuration, but then we'd need to be able to override it when needed.

davidkoski · 2025-03-04T01:33:09Z

I am back -- I will look at this today or tomorrow. Very exciting!

Apologies, I have been tied up but I will get to this as soon as I can.

davidkoski · 2025-03-04T22:02:05Z

I synced it up with main and trying it out now.

davidkoski · 2025-03-04T22:21:40Z

Applications/VLMEval/ContentView.swift

+let videoSystemPrompt =
+    "Focus only on describing the key dramatic action or notable event occurring in this video segment. Skip general context or scene-setting details unless they are crucial to understanding the main action."
+let imageSystemPrompt =
+    "You are an image understanding model capable of describing the salient features of any image."


I think these are fine for now, but this + the message formatting needs to be figured out (later :-) )

davidkoski · 2025-03-04T22:24:06Z

Applications/VLMEval/ContentView.swift

@@ -205,6 +210,9 @@ struct ContentView: View {
                    .disabled(llm.running)
            }
        }
+        .onAppear {
+            selectedVideoURL = Bundle.main.url(forResource: "test", withExtension: "mp4")!
+        }


This is nice for testing but I think we should probably remove the example asset for the example -- force people use their own images & videos. Also, I don't know the license on this video :-)

On the other hand this is meant as an example for developers to build on and maybe it is good to have something ready to go? Anyone have any thoughts on this?

I'm fine either way. cc @cyrilzakka on the video rights (but also for opinion) :)

davidkoski · 2025-03-04T22:24:42Z

Applications/VLMEval/ContentView.swift

@@ -322,10 +330,10 @@ class VLMEvaluator {

    /// This controls which model loads. `qwen2VL2BInstruct4Bit` is one of the smaller ones, so this will fit on
    /// more devices.
-    let modelConfiguration = ModelRegistry.qwen2VL2BInstruct4Bit
+    let modelConfiguration = ModelRegistry.smolvlm


We should revert this before merging or if we think this is a better default model, update the comment.

Sure, will revert, this was meant for our own testing.

davidkoski · 2025-03-04T22:28:50Z

Libraries/MLXVLM/MediaProcessing.swift

+    let totalDuration: String
+}
+
+// TODO: verify working color space, rendering color space


This is a good idea. I think the python processing code is roughly equivalent to the colorspace of the input, no conversion to linear, and what is called "device RGB" (don't touch my colors). In other words it isn't managed color, but that is what we have here.

We could certainly do something like use the non-linear form of the input colorspace and output to the same. In practice I am not sure it matters that much. These models are probably trained on consistent colorspace inputs (though sRGB is likely, displayP3 from iPhone images is pretty likely, and videos are much more diverse).

Maybe this should turn into an issue?

That said: I don't think we should try to replicate the unmanaged colorspace of the python version. I think we should pick a colorspace (sRGB or displayP3) and be consistent.

Yes, makes sense to turn into an issue. I also think that Python pre-processing is mostly oblivious to colorspace.

davidkoski · 2025-03-04T22:30:19Z

Libraries/MLXVLM/MediaProcessing.swift

@@ -87,6 +94,35 @@ public enum MediaProcessing {
        return rescaled.cropped(to: CGRect(origin: .zero, size: size))
    }

+    /// Resample the image using Lanczos interpolation.
+    static public func resampleLanczos(_ image: CIImage, to size: CGSize) -> CIImage {


Smol uses Lanczos? I agree it is the better resampling method for humans, but the sinc it simulates has an edge strengthening effect -- I am surprised to see it used here.

Yes, it does, I was surprised too when I saw it but didn't follow up with the team. cc @mfarre, just curious if there's any insight :)

this is inherited from the Idefics3 image processor :)

davidkoski · 2025-03-04T22:30:51Z

Libraries/MLXVLM/MediaProcessing.swift

+        // set the aspect ratio to match the aspect ratio of the target
+        let inputAspectRatio = extent.width / extent.height
+        let desiredAspectRatio = size.width / size.height
+        filter.aspectRatio = Float(1 / inputAspectRatio * desiredAspectRatio)


I wonder if this size/aspect ratio code should be refactored to be shared between the resampling methods?

davidkoski · 2025-03-05T00:41:20Z

Libraries/MLXVLM/MediaProcessing.swift

@@ -199,4 +236,108 @@ public enum MediaProcessing {

        return ciImages
    }
+
+    static public func asCIImageSequence(


This is pretty similar to the method above it. I wonder if we should have a VideoParameters struct that had these values and we had one method that took that. This method should be concerned with timing and any properties needed to read out the video (e.g. if we wanted to convert pixelFormats).

See also #223 -- maybe we can factor this part out as some of the other PRs would make use of it.

davidkoski · 2025-03-05T00:43:35Z

Libraries/MLXVLM/Models/Idefics3.swift

        public var ropeTraditional: Bool { _ropeTraditional ?? false }
-        public let tieWordEmbeddings: Bool
+        public var tieWordEmbeddings: Bool { _tieWordEmbeddings ?? false }


No change required here, but I saw there were some macro packages that help with defaults values and Codable -- maybe we should adopt.

davidkoski · 2025-03-05T00:45:49Z

Libraries/MLXVLM/Models/Idefics3.swift

+    }
+}
+
+public class SmolVLMProcessor: UserInputProcessor {


This isn't that small (smol?) -- I wonder if the Processor (and its config) belongs in its own file? I see how it uses the same model implementation (via VLMModelFactory) but it might be more clear to people browsing if it had its own file named after Smol.

Yes, it grew a bit more than we expected and I was hesitant whether to split.

pcuenca · 2025-03-05T18:07:22Z

Applications/VLMEval/ContentView.swift


    /// parameters controlling the output
-    let generateParameters = MLXLMCommon.GenerateParameters(temperature: 0.6)
+    let generateParameters = MLXLMCommon.GenerateParameters(temperature: 0.7, topP: 0.9)


These parameters are also smolvlm-specific.

pcuenca · 2025-03-05T18:18:05Z

Libraries/MLXVLM/Models/Idefics3.swift

@@ -663,9 +672,12 @@ public class Idefics3: Module, VLMModel, KVCacheDimensionProvider {
        return final
    }

+    // inputs_merger
+    // TODO: why did we need to do changes here? Do we need a new modelling class, or did this never work (for tiling)?


This is actually a pending to-do for Idefics3. We can remove the comment here, but revisit whether this works for the previous smolvlm.

cyrilzakka and others added 30 commits February 12, 2025 10:21

Update MediaProcessing.swift

b7c61ac

Merge pull request #1 from ml-explore/main

cc31f91

Video/image fixes

smolvlm processing

1ba603c

Text inputs, with hardcoded values and considering a single image. Image patching still not done. You need to define HF_TOKEN in the environment to be able to download the model.

Fix arch name

610a457

Optional config values

113f93d

Perform image tiling

dc6b71f

I believe pre-processing matches transformers', but inference fails because of some dimension mismatch.

Inference runs, but generations are random.

1cf5906

Restore Idefics3 processor, add SmolVLMProcessor

da3f80f

The configuration fixes that make this work have been applied.

clean

ad6f05c

Reorder to unhardcode rows, cols

7be743d

Remove unused var

6b63fcf

Fix typo lol

521b927

Generation (single image) works now 🔥

Initial support for image and video processing

0eaab62

Added global token to video prompt

76707dd

Merge remote-tracking branch 'cyril/main' into smolvlm-processing

5f39269

Fix prompt handling in video

ee2cd3b

Also changed the input type to `image` to keep the sequence of frames untouched :)

Merge pull request #2 from pcuenca/smolvlm-processing

cb22d0d

smolvlm processing

Small cleanup

a94f419

Add preprocessor configuration

0fe3a46

Unhardcode some values from config

a831a12

Remove prints

7a6d2c6

Chaining API -> some vars are now lets.

c73bfe3

Merge pull request #3 from pcuenca/smolvlm-changes

2807b88

Some cleanup

Change llm-tool to follow the smolvlm format

d407259

Fix system prompt, use prompts by Miquel

b86cdf2

Add system prompt

97ed22b

Update Applications/VLMEval/ContentView.swift

08b1e8c

Merge pull request #4 from pcuenca/smolvlm-changes

6f5e2f4

Additional smolvlm changes and adjustments

Fix single image pre-processing

9d7ad6e

Images are always upscaled, so always tiled.

Merge pull request #5 from pcuenca/image-preprocessing

ac482a3

Fix single image pre-processing

awni requested a review from davidkoski February 20, 2025 14:24

pcuenca and others added 3 commits February 22, 2025 19:49

swift-format

00394f1

Temporarily reduce max frames to 20

7d1934d

Merge pull request #7 from pcuenca/style

61a95b9

Style

davidkoski mentioned this pull request Feb 26, 2025

Add SmolVLMProcessor for SmolVLM Model support #211

Open

pcuenca reviewed Feb 27, 2025

View reviewed changes

davidkoski and others added 2 commits March 4, 2025 13:10

un-set development team

45fcaf8

Merge branch 'main' into main

261fd98

davidkoski reviewed Mar 4, 2025

View reviewed changes

This was referenced Mar 5, 2025

Add Qwen 2.5 VL #222

Open

MediaProcessing.asCIImageSequence should produce an array of downsampled frames #223

Open

davidkoski reviewed Mar 5, 2025

View reviewed changes

pcuenca reviewed Mar 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLM support for image and video processing with SmolVLM support #206

VLM support for image and video processing with SmolVLM support #206

cyrilzakka commented Feb 18, 2025

awni commented Feb 20, 2025

pcuenca commented Feb 20, 2025

chenemii commented Feb 22, 2025

pcuenca commented Feb 22, 2025

pcuenca commented Feb 22, 2025 •

edited

Loading

chenemii commented Feb 22, 2025 •

edited

Loading

davidkoski commented Feb 25, 2025

pcuenca Feb 27, 2025

pcuenca Mar 5, 2025

davidkoski commented Mar 4, 2025

davidkoski commented Mar 4, 2025

davidkoski Mar 4, 2025

pcuenca Mar 5, 2025

davidkoski Mar 4, 2025

pcuenca Mar 5, 2025

davidkoski Mar 4, 2025

pcuenca Mar 5, 2025

davidkoski Mar 4, 2025

davidkoski Mar 5, 2025

pcuenca Mar 5, 2025

davidkoski Mar 4, 2025

pcuenca Mar 5, 2025

mfarre Mar 6, 2025

davidkoski Mar 4, 2025

davidkoski Mar 5, 2025

davidkoski Mar 5, 2025

davidkoski Mar 5, 2025

pcuenca Mar 5, 2025

pcuenca Mar 5, 2025

pcuenca Mar 5, 2025

	var maxVideoFrames: Int { 20 /config.videoSampling.maxFrames/ }
	var maxVideoFrames: Int { 20 /config.videoSampling.maxFrames/ } // Limited to reduce memory consumption on phones

VLM support for image and video processing with SmolVLM support #206

Are you sure you want to change the base?

VLM support for image and video processing with SmolVLM support #206

Conversation

cyrilzakka commented Feb 18, 2025

awni commented Feb 20, 2025

pcuenca commented Feb 20, 2025

chenemii commented Feb 22, 2025

pcuenca commented Feb 22, 2025

pcuenca commented Feb 22, 2025 • edited Loading

chenemii commented Feb 22, 2025 • edited Loading

davidkoski commented Feb 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidkoski commented Mar 4, 2025

davidkoski commented Mar 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pcuenca commented Feb 22, 2025 •

edited

Loading

chenemii commented Feb 22, 2025 •

edited

Loading