Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental: tokenizers with and without templates #168

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pcuenca
Copy link
Member

@pcuenca pcuenca commented Jan 30, 2025

This builds on top of #166 by @greenrazer.

The goal is to be able to opt-in to the chat template feature, which carries the Jinja dependency, which in turn depends on swift-collections and whatnot. It works in its current state, but it's ugly – I found way more cross-module quirks than I was expecting. I wanted to make it as easy as possible for consumers and allow to use either version (with or without templates) from their SPM manifests.

Opinions, corrections, and alternative ideas are encouraged and most welcome!


How to use:

  • If you want to use the core Tokenizers functionality, without the chat templates:

Add the following dependency to your target, as usual (except Tokenizers is now a product, no need to use the full Transformers lib). This applies to projects such as WhisperKit.

            dependencies: [
                .product(name: "Tokenizers", package: "swift-transformers"),
            ],
  • To opt-in to using chat-templates:

This applies to projects such as mlx-swift-examples.

            dependencies: [
                .product(name: "Tokenizers", package: "swift-transformers"),
+               .product(name: "TokenizersTemplates", package: "swift-transformers"),
            ],

That's it, you don't need to import TokenizersTemplates or do anything other than just declaring it as a dependency.


Known issues:

  • I haven't looked at tests yet, they probably won't compile.
  • I know there are conflicts because of last night's tools PR. Let's agree on the general direction before addressing them.

try applyChatTemplate(messages: messages, chatTemplate: .literal(chatTemplate), addGenerationPrompt: true, truncation: false, maxLength: nil, tools: nil)
}
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in TokenizersTemplates

]

open class PreTrainedTokenizerWithTemplates : PreTrainedTokenizer {
// I don't know why these need to be here. They are implemented in the protocol, **and** in the superclass.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if these overrides don't exist, the linker can't find the implementations.

import Foundation
import Hub

@_exported import TokenizersCore
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good chunk of the magic. The new Tokenizers implementation is just this wrapper file, which exposes the imported TokenizersCore types.

Comment on lines +6 to +9
#if canImport(TokenizersTemplates)
import TokenizersTemplates
public typealias PreTrainedTokenizer = PreTrainedTokenizerWithTemplates
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if TokenizersTemplates is available (because users have declared it as a dependency), we override the definition so the factory below uses the subclass.

}

// See https://github.com/xenova/transformers.js/blob/1a9964fb09b8f54fcbeac46dc6aae8d76795809d/src/tokenizers.js#L3203 for these exceptions
class LlamaPreTrainedTokenizer: PreTrainedTokenizer {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be moved back to TokenizersCore, but then we'd need a typealias here as well (and a subclass).

@pcuenca
Copy link
Member Author

pcuenca commented Jan 30, 2025

cc @greenrazer @FL33TW00D @Vaibhavs10 for opinions and feedback.

@greenrazer
Copy link
Collaborator

I like this solution. however, I cannot get the canImport approach in TokenizersWrapper to work for the tests due to how Swift Package Manager handles module dependencies and conditional imports within the same package. All the tests work great with almost no modification (besides changing Tokenizers to TokenizersCore) except for the ChatTemplatesTests.swift.

I've tried a few different options:

  • The best option: Just adding TokenizersTemplates as a dependency to the test, similar to how you would for an external package. This doesn’t work because it doesn’t trigger recompilation for the test.
  • Adding swiftSettings: [.define("USE_TEMPLATES")] to the test and then adding another build condition to TokenizersWrapper. Same issue as above—it doesn’t trigger recompilation for the test.
  • Creating two targets pointing to the same TokenizersWrapper source file, with one including USE_TEMPLATES. This doesn’t work because you can't have two targets referencing the same source file.
  • Adding .target(name: "TokenizersTemplates", condition: .when(platforms: nil)) as a dependency to hopefully run only during testing. This doesn’t work in newer Swift versions.
  • Adding swiftSettings: [.define("ENABLE_TEMPLATES", .when(configuration: .debug))] to the Tokenizers target. This almost works, but TokenizersTemplates must then be a dependency of Tokenizers, which defeats the whole purpose.

Since users must explicitly import TokenizersTemplates anyway, I think the best approach is to create two wrappers and two targets. That way, users only need to import either TokenizersTemplates or Tokenizers, depending on their needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants