Skip to content

Lexer generator for Rust

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

0x2a-42/herring

Repository files navigation

Herring

Crates.io MIT/Apache 2.0 Crates.io Rust

Herring (Highly Efficient Rust Regex-based lexer ImplementatioN Generator) is a lexer generator for Rust implementing a subset of the Logos API.

The key differences compared to Logos are the following.

  • The Herring trait must be derived on a unit-only enum.
  • Only Result<TokenType, ErrorType> is allowed for a regex or token callback return type.
  • Lexer modes (morph method) are not supported (use callbacks and extras instead).
  • There is no ignore(ascii_case), only ignore(case).

These changes are mostly due to the use case as a lexer for the Lelwel parser generator, where unit-only enums are required. There are also additional features that are not available in Logos.

  • There is an ignore callback that can be used to skip input before it is passed to the automaton (e.g. for lexing escaped newlines in C).
  • There is an initial callback that can be used for generating tokens without consuming input (e.g. for indent and dedent tokens in Python).
  • A callback with unit return type can be specified for skip regexes.

Warning

At the moment you should almost certainly use Logos instead of Herring, as it is more mature and provides better performance. There are however also valid reasons for using Herring. There are currently some bugs in Logos related to backtracking and stack overflows that are not present in Herring. You may also use Herring if you require the ignore or initial callbacks.

Performance

Runtime

Herring implements some of the optimizations also present in Logos. These are jump tables for states with many outgoing transitions and stacked lookup tables for transitions with complicated byte ranges. It does currently not implement some other optimizations such as loop unrolling or transitioning on strings.

DFA Jump Threading

Unlike Logos, which encodes its state machine with mutually tail recursive functions, Herring uses a match inside of a loop. This has the advantage, that the state machine will not overflow the stack on long inputs, if it is compiled without optimizations. On the other hand a disadvantage is, that Rust currently does not optimize away the extra jumps (see rust-lang/rust#80630 and rust-lang/rfcs#3720). By passing -enable-dfa-jump-thread to LLVM it is however possible to enable this optimization on the LLVM level. This can be achieved by adding a .cargo/config.toml file to your crate.

Caution

Be aware, that adding LLVM passes to the rustc optimization pipeline is probably not well tested and may increase the probability for running into compiler bugs.

Benchmark

The results of the Logos benchmark on an Intel Core i7-8550U CPU are shown in the following table. Currently in most cases Logos is faster than Herring with DFA jump threading.

Benchmark Without DFA Jump Threading With DFA Jump Threading Logos
iterate/identifiers 1.2541 µs (592.40 MiB/s) 824.49 ns (901.06 MiB/s) 951.53 ns (780.75 MiB/s)
iterate/keywords_operators_and_punctators 6.2495 µs (325.19 MiB/s) 2.9077 µs (698.94 MiB/s) 2.6915 µs (755.08 MiB/s)
iterate/strings 1.0770 µs (771.29 MiB/s) 830.49 ns (1000.2 MiB/s) 717.31 ns (1.1309 GiB/s)

Compile Time

Using complex unicode ranges or finite repetitions may result in large NFAs, for which the transformation to a minimized DFA may take a noticeable amount of time. To improve build times you can add the following to your Cargo.toml file, so the procedural macro code is optimized.

[profile.dev.build-override]
opt-level = 3

[profile.release.build-override]
opt-level = 3

Example

The Herring API is almost identical to the Logos API, so it can be used as a drop in replacement, if the above mentioned restrictions apply.

The following example shows a lexer for JSON (string escape sequences are not validated, as this is better handled after lexing, to avoid tripping up the parser with a missing string token).

use herring::{Herring, Lexer};

#[derive(Debug, Clone, PartialEq, Default)]
pub enum LexerError {
    #[default]
    Invalid,
    UnterminatedString,
}
fn parse_string(lexer: &mut Lexer<'_, Token>) -> Result<Token, LexerError> {
    let mut it = lexer.remainder().chars();
    while let Some(c) = it.next() {
        match c {
            '"' => {
                lexer.bump(1);
                return Ok(Token::String);
            }
            '\\' => {
                lexer.bump(1);
                if let Some(c) = it.next() {
                    lexer.bump(c.len_utf8());
                }
            }
            c => lexer.bump(c.len_utf8()),
        }
    }
    Err(LexerError::UnterminatedString)
}
#[derive(Herring, Debug, PartialEq, Copy, Clone)]
#[herring(error = LexerError)]
pub enum Token {
    #[regex("[\u{0020}\u{000A}\u{000D}\u{0009}]+")]
    Whitespace,
    #[token("true")]
    True,
    #[token("false")]
    False,
    #[token("null")]
    Null,
    #[token("{")]
    LBrace,
    #[token("}")]
    RBrace,
    #[token("[")]
    LBrak,
    #[token("]")]
    RBrak,
    #[token(",")]
    Comma,
    #[token(":")]
    Colon,
    #[regex("\"", parse_string)]
    String,
    #[regex(r"-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?")]
    Number,
}
fn main() {
    for tok in Token::lexer(r#"{"test": [1,2,3]}"#) {
        println!("{tok:?}");
    }
}

Debugging

You can inspect the generated code by running cargo expand.

You can generate Graphviz or Mermaid graphs by setting the environment variable HERRING_DEBUG to either graphviz or mermaid during the execution of the procedural macro. This generates a file for the NFA (nfa.dot or nfa.mmd), the subset construction DFA (dfa.dot or dfa.mmd), and the minimized DFA (min.dot or min.mmd) in the specified format. Graphviz is useful when debugging locally with tools like xdot. Mermaid is useful for directly embedding graphs in Github issues.

Setting HERRING_DEBUG to log makes the lexer write all visited states to stderr.

Note

When changing the HERRING_DEBUG environment variable you have to make sure that the build of the lexer is not skipped by cargo build. If you made no changes since the last build you can safe the containing file without any changes to trigger a rebuild.

JSON Example

NFA

flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "ε" --> 1
0 -- "ε" --> 3
0 -- "ε" --> 8
0 -- "ε" --> 14
0 -- "ε" --> 19
0 -- "ε" --> 21
0 -- "ε" --> 23
0 -- "ε" --> 25
0 -- "ε" --> 27
0 -- "ε" --> 29
0 -- "ε" --> 31
0 -- "ε" --> 33
1@{shape: circ}
1 -- "{0x9-0xA, 0xD, 0x20}" --> 2
2@{shape: dbl-circ}
2 -- "ε" --> 1
3@{shape: circ}
3 -- "{'t'}" --> 4
4@{shape: circ}
4 -- "{'r'}" --> 5
5@{shape: circ}
5 -- "{'u'}" --> 6
6@{shape: circ}
6 -- "{'e'}" --> 7
7@{shape: dbl-circ}
8@{shape: circ}
8 -- "{'f'}" --> 9
9@{shape: circ}
9 -- "{'a'}" --> 10
10@{shape: circ}
10 -- "{'l'}" --> 11
11@{shape: circ}
11 -- "{'s'}" --> 12
12@{shape: circ}
12 -- "{'e'}" --> 13
13@{shape: dbl-circ}
14@{shape: circ}
14 -- "{'n'}" --> 15
15@{shape: circ}
15 -- "{'u'}" --> 16
16@{shape: circ}
16 -- "{'l'}" --> 17
17@{shape: circ}
17 -- "{'l'}" --> 18
18@{shape: dbl-circ}
19@{shape: circ}
19 -- "{'{'}" --> 20
20@{shape: dbl-circ}
21@{shape: circ}
21 -- "{'}'}" --> 22
22@{shape: dbl-circ}
23@{shape: circ}
23 -- "{'['}" --> 24
24@{shape: dbl-circ}
25@{shape: circ}
25 -- "{']'}" --> 26
26@{shape: dbl-circ}
27@{shape: circ}
27 -- "{','}" --> 28
28@{shape: dbl-circ}
29@{shape: circ}
29 -- "{':'}" --> 30
30@{shape: dbl-circ}
31@{shape: circ}
31 -- "{'#34;'}" --> 32
32@{shape: dbl-circ}
33@{shape: circ}
33 -- "{'-'}" --> 34
33 -- "ε" --> 34
34@{shape: circ}
34 -- "ε" --> 35
35@{shape: circ}
35 -- "ε" --> 36
35 -- "ε" --> 38
36@{shape: circ}
36 -- "{'0'}" --> 37
37@{shape: circ}
37 -- "ε" --> 44
38@{shape: circ}
38 -- "{'1'-'9'}" --> 39
39@{shape: circ}
39 -- "ε" --> 40
40@{shape: circ}
40 -- "ε" --> 41
40 -- "ε" --> 43
41@{shape: circ}
41 -- "{'0'-'9'}" --> 42
42@{shape: circ}
42 -- "ε" --> 41
42 -- "ε" --> 43
43@{shape: circ}
43 -- "ε" --> 44
44@{shape: circ}
44 -- "ε" --> 45
45@{shape: circ}
45 -- "{'.'}" --> 46
45 -- "ε" --> 48
46@{shape: circ}
46 -- "ε" --> 47
47@{shape: circ}
47 -- "{'0'-'9'}" --> 48
48@{shape: circ}
48 -- "ε" --> 47
48 -- "ε" --> 49
49@{shape: circ}
49 -- "{'E', 'e'}" --> 50
49 -- "ε" --> 54
50@{shape: circ}
50 -- "ε" --> 51
51@{shape: circ}
51 -- "{'+', '-'}" --> 52
51 -- "ε" --> 52
52@{shape: circ}
52 -- "ε" --> 53
53@{shape: circ}
53 -- "{'0'-'9'}" --> 54
54@{shape: dbl-circ}
54 -- "ε" --> 53
String_0[String]@{shape: rect}
32 .-> String_0
Colon_0[Colon]@{shape: rect}
30 .-> Colon_0
Number_0[Number]@{shape: rect}
54 .-> Number_0
LBrace_0[LBrace]@{shape: rect}
20 .-> LBrace_0
LBrak_0[LBrak]@{shape: rect}
24 .-> LBrak_0
RBrak_0[RBrak]@{shape: rect}
26 .-> RBrak_0
Null_0[Null]@{shape: rect}
18 .-> Null_0
False_0[False]@{shape: rect}
13 .-> False_0
True_0[True]@{shape: rect}
7 .-> True_0
Whitespace_0[Whitespace]@{shape: rect}
2 .-> Whitespace_0
RBrace_0[RBrace]@{shape: rect}
22 .-> RBrace_0
Comma_0[Comma]@{shape: rect}
28 .-> Comma_0
Loading

Subset Construction DFA

flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x9-0xA, 0xD, 0x20}" --> 1
0 -- "{'#34;'}" --> 2
0 -- "{','}" --> 3
0 -- "{'-'}" --> 4
0 -- "{'0'}" --> 5
0 -- "{'1'-'9'}" --> 6
0 -- "{':'}" --> 7
0 -- "{'['}" --> 8
0 -- "{']'}" --> 9
0 -- "{'f'}" --> 10
0 -- "{'n'}" --> 11
0 -- "{'t'}" --> 12
0 -- "{'{'}" --> 13
0 -- "{'}'}" --> 14
1@{shape: dbl-circ}
1 -- "{0x9-0xA, 0xD, 0x20}" --> 1
2@{shape: dbl-circ}
3@{shape: dbl-circ}
4@{shape: circ}
4 -- "{'0'}" --> 5
4 -- "{'1'-'9'}" --> 6
5@{shape: dbl-circ}
5 -- "{'.'}" --> 25
5 -- "{'0'-'9'}" --> 30
5 -- "{'E', 'e'}" --> 27
6@{shape: dbl-circ}
6 -- "{'.'}" --> 25
6 -- "{'0'-'9'}" --> 26
6 -- "{'E', 'e'}" --> 27
7@{shape: dbl-circ}
8@{shape: dbl-circ}
9@{shape: dbl-circ}
10@{shape: circ}
10 -- "{'a'}" --> 21
11@{shape: circ}
11 -- "{'u'}" --> 18
12@{shape: circ}
12 -- "{'r'}" --> 15
13@{shape: dbl-circ}
14@{shape: dbl-circ}
15@{shape: circ}
15 -- "{'u'}" --> 16
16@{shape: circ}
16 -- "{'e'}" --> 17
17@{shape: dbl-circ}
18@{shape: circ}
18 -- "{'l'}" --> 19
19@{shape: circ}
19 -- "{'l'}" --> 20
20@{shape: dbl-circ}
21@{shape: circ}
21 -- "{'l'}" --> 22
22@{shape: circ}
22 -- "{'s'}" --> 23
23@{shape: circ}
23 -- "{'e'}" --> 24
24@{shape: dbl-circ}
25@{shape: circ}
25 -- "{'0'-'9'}" --> 30
26@{shape: dbl-circ}
26 -- "{'.'}" --> 25
26 -- "{'0'-'9'}" --> 26
26 -- "{'E', 'e'}" --> 27
27@{shape: circ}
27 -- "{'+', '-'}" --> 28
27 -- "{'0'-'9'}" --> 29
28@{shape: circ}
28 -- "{'0'-'9'}" --> 29
29@{shape: dbl-circ}
29 -- "{'0'-'9'}" --> 29
30@{shape: dbl-circ}
30 -- "{'0'-'9'}" --> 30
30 -- "{'E', 'e'}" --> 27
LBrak_0[LBrak]@{shape: rect}
8 .-> LBrak_0
String_0[String]@{shape: rect}
2 .-> String_0
Number_0[Number]@{shape: rect}
30 .-> Number_0
Colon_0[Colon]@{shape: rect}
7 .-> Colon_0
Number_0[Number]@{shape: rect}
6 .-> Number_0
True_0[True]@{shape: rect}
17 .-> True_0
Null_0[Null]@{shape: rect}
20 .-> Null_0
False_0[False]@{shape: rect}
24 .-> False_0
Number_0[Number]@{shape: rect}
5 .-> Number_0
Number_0[Number]@{shape: rect}
29 .-> Number_0
Whitespace_0[Whitespace]@{shape: rect}
1 .-> Whitespace_0
RBrak_0[RBrak]@{shape: rect}
9 .-> RBrak_0
Number_0[Number]@{shape: rect}
26 .-> Number_0
RBrace_0[RBrace]@{shape: rect}
14 .-> RBrace_0
Comma_0[Comma]@{shape: rect}
3 .-> Comma_0
LBrace_0[LBrace]@{shape: rect}
13 .-> LBrace_0
Loading

Minimized DFA

flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x9-0xA, 0xD, 0x20}" --> 1
0 -- "{'#34;'}" --> 2
0 -- "{','}" --> 3
0 -- "{'-'}" --> 4
0 -- "{'0'}" --> 5
0 -- "{'1'-'9'}" --> 6
0 -- "{':'}" --> 7
0 -- "{'['}" --> 8
0 -- "{']'}" --> 9
0 -- "{'f'}" --> 10
0 -- "{'n'}" --> 11
0 -- "{'t'}" --> 12
0 -- "{'{'}" --> 13
0 -- "{'}'}" --> 14
1@{shape: dbl-circ}
1 -- "{0x9-0xA, 0xD, 0x20}" --> 1
2@{shape: dbl-circ}
3@{shape: dbl-circ}
4@{shape: circ}
4 -- "{'0'}" --> 5
4 -- "{'1'-'9'}" --> 6
5@{shape: dbl-circ}
5 -- "{'.'}" --> 25
5 -- "{'0'-'9'}" --> 29
5 -- "{'E', 'e'}" --> 26
6@{shape: dbl-circ}
6 -- "{'.'}" --> 25
6 -- "{'0'-'9'}" --> 6
6 -- "{'E', 'e'}" --> 26
7@{shape: dbl-circ}
8@{shape: dbl-circ}
9@{shape: dbl-circ}
10@{shape: circ}
10 -- "{'a'}" --> 21
11@{shape: circ}
11 -- "{'u'}" --> 18
12@{shape: circ}
12 -- "{'r'}" --> 15
13@{shape: dbl-circ}
14@{shape: dbl-circ}
15@{shape: circ}
15 -- "{'u'}" --> 16
16@{shape: circ}
16 -- "{'e'}" --> 17
17@{shape: dbl-circ}
18@{shape: circ}
18 -- "{'l'}" --> 19
19@{shape: circ}
19 -- "{'l'}" --> 20
20@{shape: dbl-circ}
21@{shape: circ}
21 -- "{'l'}" --> 22
22@{shape: circ}
22 -- "{'s'}" --> 23
23@{shape: circ}
23 -- "{'e'}" --> 24
24@{shape: dbl-circ}
25@{shape: circ}
25 -- "{'0'-'9'}" --> 29
26@{shape: circ}
26 -- "{'+', '-'}" --> 27
26 -- "{'0'-'9'}" --> 28
27@{shape: circ}
27 -- "{'0'-'9'}" --> 28
28@{shape: dbl-circ}
28 -- "{'0'-'9'}" --> 28
29@{shape: dbl-circ}
29 -- "{'0'-'9'}" --> 29
29 -- "{'E', 'e'}" --> 26
False_0[False]@{shape: rect}
24 .-> False_0
Whitespace_0[Whitespace]@{shape: rect}
1 .-> Whitespace_0
Number_0[Number]@{shape: rect}
6 .-> Number_0
Number_0[Number]@{shape: rect}
29 .-> Number_0
LBrace_0[LBrace]@{shape: rect}
13 .-> LBrace_0
Number_0[Number]@{shape: rect}
5 .-> Number_0
RBrace_0[RBrace]@{shape: rect}
14 .-> RBrace_0
String_0[String]@{shape: rect}
2 .-> String_0
Colon_0[Colon]@{shape: rect}
7 .-> Colon_0
RBrak_0[RBrak]@{shape: rect}
9 .-> RBrak_0
True_0[True]@{shape: rect}
17 .-> True_0
Null_0[Null]@{shape: rect}
20 .-> Null_0
LBrak_0[LBrak]@{shape: rect}
8 .-> LBrak_0
Number_0[Number]@{shape: rect}
28 .-> Number_0
Comma_0[Comma]@{shape: rect}
3 .-> Comma_0
Loading

License

Herring is licensed under either of

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

About

Lexer generator for Rust

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

No packages published

Languages