Skip to content

Grammars compiled from JSON schemas accept invalid JSON input #286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sfc-gh-azawlocki opened this issue Apr 1, 2025 · 1 comment
Open

Comments

@sfc-gh-azawlocki
Copy link

sfc-gh-azawlocki commented Apr 1, 2025

I'm not sure if this is a bug or intended behavior.

JSON specification forbids control characters (Unicode characters U+0000 to U+001f) in strings. For example, this is not a valid JSON:

{"text": "\tab char is illegal here"}

A grammar compiled using GrammarCompiler.compile_builtin_json_grammar() correctly rejects this input. But a grammar compiled with GrammarCompiler.compile_json_schema() accepts it as a valid JSON.

Standalone code snippet, tested with xgrammar-0.1.17:

import xgrammar
from transformers import AutoTokenizer, AutoConfig

INVALID_INPUT = '{"text": "\tab char is illegal here"}'


def check_if_rejects_invalid_json(grammar: xgrammar.CompiledGrammar) -> None:
    matcher = xgrammar.GrammarMatcher(grammar, terminate_without_stop_token=True)
    assert not matcher._debug_accept_string(INVALID_INPUT, debug_print=True)


if __name__ == "__main__":
    tokenizer = AutoTokenizer.from_pretrained("google-t5/t5-small")
    tokenizer_info = xgrammar.TokenizerInfo.from_huggingface(tokenizer)

    grammar_compiler = xgrammar.GrammarCompiler(tokenizer_info)

    # The builtin JSON grammar correctly rejects the INVALID_INPUT
    builtin_json_grammar = grammar_compiler.compile_builtin_json_grammar()
    check_if_rejects_invalid_json(builtin_json_grammar)
    # prints: [11:13:09] /Users/runner/work/xgrammar/xgrammar/cpp/grammar_matcher_base.cc:301: Character 9 "\t" Rejected

    # A grammar compiled from a JSON schema accepts it
    json_schema_grammar = grammar_compiler.compile_json_schema(
        {
            "type": "object",
            "properties": {
                "text": {"type": "string"},
            },
            "required": ["text"],
        }
    )
    check_if_rejects_invalid_json(json_schema_grammar)
    # raises AssertionError
@Ubospica
Copy link
Collaborator

Ubospica commented Apr 4, 2025

@sfc-gh-azawlocki Thanks for mentioning that! We were not aware of this previously. We will try to support it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants