Herring (Highly Efficient Rust Regex-based lexer ImplementatioN Generator) is a lexer generator for Rust implementing a subset of the Logos API.
The key differences compared to Logos are the following.
- The
Herring
trait must be derived on a unit-only enum. - Only
Result<TokenType, ErrorType>
is allowed for aregex
ortoken
callback return type. - Lexer modes (
morph
method) are not supported (use callbacks andextras
instead). - There is no
ignore(ascii_case)
, onlyignore(case)
.
These changes are mostly due to the use case as a lexer for the Lelwel parser generator, where unit-only enums are required. There are also additional features that are not available in Logos.
- There is an
ignore
callback that can be used to skip input before it is passed to the automaton (e.g. for lexing escaped newlines in C). - There is an
initial
callback that can be used for generating tokens without consuming input (e.g. for indent and dedent tokens in Python). - A callback with unit return type can be specified for
skip
regexes.
Warning
At the moment you should almost certainly use Logos instead of Herring, as it is more mature and provides better performance.
There are however also valid reasons for using Herring.
There are currently some bugs in Logos related to backtracking and stack overflows that are not present in Herring.
You may also use Herring if you require the ignore
or initial
callbacks.
Herring implements some of the optimizations also present in Logos. These are jump tables for states with many outgoing transitions and stacked lookup tables for transitions with complicated byte ranges. It does currently not implement some other optimizations such as loop unrolling or transitioning on strings.
Unlike Logos, which encodes its state machine with mutually tail recursive functions, Herring uses a match
inside of a loop
.
This has the advantage, that the state machine will not overflow the stack on long inputs, if it is compiled without optimizations.
On the other hand a disadvantage is, that Rust currently does not optimize away the extra jumps (see rust-lang/rust#80630 and rust-lang/rfcs#3720).
By passing -enable-dfa-jump-thread
to LLVM it is however possible to enable this optimization on the LLVM level.
This can be achieved by adding a .cargo/config.toml
file to your crate.
Caution
Be aware, that adding LLVM passes to the rustc optimization pipeline is probably not well tested and may increase the probability for running into compiler bugs.
The results of the Logos benchmark on an Intel Core i7-8550U CPU are shown in the following table. Currently in most cases Logos is faster than Herring with DFA jump threading.
Benchmark | Without DFA Jump Threading | With DFA Jump Threading | Logos |
---|---|---|---|
iterate/identifiers |
1.2541 µs (592.40 MiB/s) | 824.49 ns (901.06 MiB/s) | 951.53 ns (780.75 MiB/s) |
iterate/keywords_operators_and_punctators |
6.2495 µs (325.19 MiB/s) | 2.9077 µs (698.94 MiB/s) | 2.6915 µs (755.08 MiB/s) |
iterate/strings |
1.0770 µs (771.29 MiB/s) | 830.49 ns (1000.2 MiB/s) | 717.31 ns (1.1309 GiB/s) |
Using complex unicode ranges or finite repetitions may result in large NFAs, for which the transformation to a minimized DFA may take a noticeable amount of time.
To improve build times you can add the following to your Cargo.toml
file, so the procedural macro code is optimized.
[profile.dev.build-override]
opt-level = 3
[profile.release.build-override]
opt-level = 3
The Herring API is almost identical to the Logos API, so it can be used as a drop in replacement, if the above mentioned restrictions apply.
The following example shows a lexer for JSON (string escape sequences are not validated, as this is better handled after lexing, to avoid tripping up the parser with a missing string token).
use herring::{Herring, Lexer};
#[derive(Debug, Clone, PartialEq, Default)]
pub enum LexerError {
#[default]
Invalid,
UnterminatedString,
}
fn parse_string(lexer: &mut Lexer<'_, Token>) -> Result<Token, LexerError> {
let mut it = lexer.remainder().chars();
while let Some(c) = it.next() {
match c {
'"' => {
lexer.bump(1);
return Ok(Token::String);
}
'\\' => {
lexer.bump(1);
if let Some(c) = it.next() {
lexer.bump(c.len_utf8());
}
}
c => lexer.bump(c.len_utf8()),
}
}
Err(LexerError::UnterminatedString)
}
#[derive(Herring, Debug, PartialEq, Copy, Clone)]
#[herring(error = LexerError)]
pub enum Token {
#[regex("[\u{0020}\u{000A}\u{000D}\u{0009}]+")]
Whitespace,
#[token("true")]
True,
#[token("false")]
False,
#[token("null")]
Null,
#[token("{")]
LBrace,
#[token("}")]
RBrace,
#[token("[")]
LBrak,
#[token("]")]
RBrak,
#[token(",")]
Comma,
#[token(":")]
Colon,
#[regex("\"", parse_string)]
String,
#[regex(r"-?(0|[1-9][0-9]*)(\.[0-9]+)?([eE][+-]?[0-9]+)?")]
Number,
}
fn main() {
for tok in Token::lexer(r#"{"test": [1,2,3]}"#) {
println!("{tok:?}");
}
}
You can inspect the generated code by running cargo expand
.
You can generate Graphviz or Mermaid graphs by setting the environment variable HERRING_DEBUG
to either graphviz
or mermaid
during the execution of the procedural macro.
This generates a file for the NFA (nfa.dot
or nfa.mmd
), the subset construction DFA (dfa.dot
or dfa.mmd
), and the minimized DFA (min.dot
or min.mmd
) in the specified format.
Graphviz is useful when debugging locally with tools like xdot
.
Mermaid is useful for directly embedding graphs in Github issues.
Setting HERRING_DEBUG
to log
makes the lexer write all visited states to stderr.
Note
When changing the HERRING_DEBUG
environment variable you have to make sure that the build of the lexer is not skipped by cargo build
.
If you made no changes since the last build you can safe the containing file without any changes to trigger a rebuild.
flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "ε" --> 1
0 -- "ε" --> 3
0 -- "ε" --> 8
0 -- "ε" --> 14
0 -- "ε" --> 19
0 -- "ε" --> 21
0 -- "ε" --> 23
0 -- "ε" --> 25
0 -- "ε" --> 27
0 -- "ε" --> 29
0 -- "ε" --> 31
0 -- "ε" --> 33
1@{shape: circ}
1 -- "{0x9-0xA, 0xD, 0x20}" --> 2
2@{shape: dbl-circ}
2 -- "ε" --> 1
3@{shape: circ}
3 -- "{'t'}" --> 4
4@{shape: circ}
4 -- "{'r'}" --> 5
5@{shape: circ}
5 -- "{'u'}" --> 6
6@{shape: circ}
6 -- "{'e'}" --> 7
7@{shape: dbl-circ}
8@{shape: circ}
8 -- "{'f'}" --> 9
9@{shape: circ}
9 -- "{'a'}" --> 10
10@{shape: circ}
10 -- "{'l'}" --> 11
11@{shape: circ}
11 -- "{'s'}" --> 12
12@{shape: circ}
12 -- "{'e'}" --> 13
13@{shape: dbl-circ}
14@{shape: circ}
14 -- "{'n'}" --> 15
15@{shape: circ}
15 -- "{'u'}" --> 16
16@{shape: circ}
16 -- "{'l'}" --> 17
17@{shape: circ}
17 -- "{'l'}" --> 18
18@{shape: dbl-circ}
19@{shape: circ}
19 -- "{'{'}" --> 20
20@{shape: dbl-circ}
21@{shape: circ}
21 -- "{'}'}" --> 22
22@{shape: dbl-circ}
23@{shape: circ}
23 -- "{'['}" --> 24
24@{shape: dbl-circ}
25@{shape: circ}
25 -- "{']'}" --> 26
26@{shape: dbl-circ}
27@{shape: circ}
27 -- "{','}" --> 28
28@{shape: dbl-circ}
29@{shape: circ}
29 -- "{':'}" --> 30
30@{shape: dbl-circ}
31@{shape: circ}
31 -- "{'#34;'}" --> 32
32@{shape: dbl-circ}
33@{shape: circ}
33 -- "{'-'}" --> 34
33 -- "ε" --> 34
34@{shape: circ}
34 -- "ε" --> 35
35@{shape: circ}
35 -- "ε" --> 36
35 -- "ε" --> 38
36@{shape: circ}
36 -- "{'0'}" --> 37
37@{shape: circ}
37 -- "ε" --> 44
38@{shape: circ}
38 -- "{'1'-'9'}" --> 39
39@{shape: circ}
39 -- "ε" --> 40
40@{shape: circ}
40 -- "ε" --> 41
40 -- "ε" --> 43
41@{shape: circ}
41 -- "{'0'-'9'}" --> 42
42@{shape: circ}
42 -- "ε" --> 41
42 -- "ε" --> 43
43@{shape: circ}
43 -- "ε" --> 44
44@{shape: circ}
44 -- "ε" --> 45
45@{shape: circ}
45 -- "{'.'}" --> 46
45 -- "ε" --> 48
46@{shape: circ}
46 -- "ε" --> 47
47@{shape: circ}
47 -- "{'0'-'9'}" --> 48
48@{shape: circ}
48 -- "ε" --> 47
48 -- "ε" --> 49
49@{shape: circ}
49 -- "{'E', 'e'}" --> 50
49 -- "ε" --> 54
50@{shape: circ}
50 -- "ε" --> 51
51@{shape: circ}
51 -- "{'+', '-'}" --> 52
51 -- "ε" --> 52
52@{shape: circ}
52 -- "ε" --> 53
53@{shape: circ}
53 -- "{'0'-'9'}" --> 54
54@{shape: dbl-circ}
54 -- "ε" --> 53
String_0[String]@{shape: rect}
32 .-> String_0
Colon_0[Colon]@{shape: rect}
30 .-> Colon_0
Number_0[Number]@{shape: rect}
54 .-> Number_0
LBrace_0[LBrace]@{shape: rect}
20 .-> LBrace_0
LBrak_0[LBrak]@{shape: rect}
24 .-> LBrak_0
RBrak_0[RBrak]@{shape: rect}
26 .-> RBrak_0
Null_0[Null]@{shape: rect}
18 .-> Null_0
False_0[False]@{shape: rect}
13 .-> False_0
True_0[True]@{shape: rect}
7 .-> True_0
Whitespace_0[Whitespace]@{shape: rect}
2 .-> Whitespace_0
RBrace_0[RBrace]@{shape: rect}
22 .-> RBrace_0
Comma_0[Comma]@{shape: rect}
28 .-> Comma_0
flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x9-0xA, 0xD, 0x20}" --> 1
0 -- "{'#34;'}" --> 2
0 -- "{','}" --> 3
0 -- "{'-'}" --> 4
0 -- "{'0'}" --> 5
0 -- "{'1'-'9'}" --> 6
0 -- "{':'}" --> 7
0 -- "{'['}" --> 8
0 -- "{']'}" --> 9
0 -- "{'f'}" --> 10
0 -- "{'n'}" --> 11
0 -- "{'t'}" --> 12
0 -- "{'{'}" --> 13
0 -- "{'}'}" --> 14
1@{shape: dbl-circ}
1 -- "{0x9-0xA, 0xD, 0x20}" --> 1
2@{shape: dbl-circ}
3@{shape: dbl-circ}
4@{shape: circ}
4 -- "{'0'}" --> 5
4 -- "{'1'-'9'}" --> 6
5@{shape: dbl-circ}
5 -- "{'.'}" --> 25
5 -- "{'0'-'9'}" --> 30
5 -- "{'E', 'e'}" --> 27
6@{shape: dbl-circ}
6 -- "{'.'}" --> 25
6 -- "{'0'-'9'}" --> 26
6 -- "{'E', 'e'}" --> 27
7@{shape: dbl-circ}
8@{shape: dbl-circ}
9@{shape: dbl-circ}
10@{shape: circ}
10 -- "{'a'}" --> 21
11@{shape: circ}
11 -- "{'u'}" --> 18
12@{shape: circ}
12 -- "{'r'}" --> 15
13@{shape: dbl-circ}
14@{shape: dbl-circ}
15@{shape: circ}
15 -- "{'u'}" --> 16
16@{shape: circ}
16 -- "{'e'}" --> 17
17@{shape: dbl-circ}
18@{shape: circ}
18 -- "{'l'}" --> 19
19@{shape: circ}
19 -- "{'l'}" --> 20
20@{shape: dbl-circ}
21@{shape: circ}
21 -- "{'l'}" --> 22
22@{shape: circ}
22 -- "{'s'}" --> 23
23@{shape: circ}
23 -- "{'e'}" --> 24
24@{shape: dbl-circ}
25@{shape: circ}
25 -- "{'0'-'9'}" --> 30
26@{shape: dbl-circ}
26 -- "{'.'}" --> 25
26 -- "{'0'-'9'}" --> 26
26 -- "{'E', 'e'}" --> 27
27@{shape: circ}
27 -- "{'+', '-'}" --> 28
27 -- "{'0'-'9'}" --> 29
28@{shape: circ}
28 -- "{'0'-'9'}" --> 29
29@{shape: dbl-circ}
29 -- "{'0'-'9'}" --> 29
30@{shape: dbl-circ}
30 -- "{'0'-'9'}" --> 30
30 -- "{'E', 'e'}" --> 27
LBrak_0[LBrak]@{shape: rect}
8 .-> LBrak_0
String_0[String]@{shape: rect}
2 .-> String_0
Number_0[Number]@{shape: rect}
30 .-> Number_0
Colon_0[Colon]@{shape: rect}
7 .-> Colon_0
Number_0[Number]@{shape: rect}
6 .-> Number_0
True_0[True]@{shape: rect}
17 .-> True_0
Null_0[Null]@{shape: rect}
20 .-> Null_0
False_0[False]@{shape: rect}
24 .-> False_0
Number_0[Number]@{shape: rect}
5 .-> Number_0
Number_0[Number]@{shape: rect}
29 .-> Number_0
Whitespace_0[Whitespace]@{shape: rect}
1 .-> Whitespace_0
RBrak_0[RBrak]@{shape: rect}
9 .-> RBrak_0
Number_0[Number]@{shape: rect}
26 .-> Number_0
RBrace_0[RBrace]@{shape: rect}
14 .-> RBrace_0
Comma_0[Comma]@{shape: rect}
3 .-> Comma_0
LBrace_0[LBrace]@{shape: rect}
13 .-> LBrace_0
flowchart LR
style start fill:#FFFFFF00, stroke:#FFFFFF00
start-->0;
0@{shape: circ}
0 -- "{0x9-0xA, 0xD, 0x20}" --> 1
0 -- "{'#34;'}" --> 2
0 -- "{','}" --> 3
0 -- "{'-'}" --> 4
0 -- "{'0'}" --> 5
0 -- "{'1'-'9'}" --> 6
0 -- "{':'}" --> 7
0 -- "{'['}" --> 8
0 -- "{']'}" --> 9
0 -- "{'f'}" --> 10
0 -- "{'n'}" --> 11
0 -- "{'t'}" --> 12
0 -- "{'{'}" --> 13
0 -- "{'}'}" --> 14
1@{shape: dbl-circ}
1 -- "{0x9-0xA, 0xD, 0x20}" --> 1
2@{shape: dbl-circ}
3@{shape: dbl-circ}
4@{shape: circ}
4 -- "{'0'}" --> 5
4 -- "{'1'-'9'}" --> 6
5@{shape: dbl-circ}
5 -- "{'.'}" --> 25
5 -- "{'0'-'9'}" --> 29
5 -- "{'E', 'e'}" --> 26
6@{shape: dbl-circ}
6 -- "{'.'}" --> 25
6 -- "{'0'-'9'}" --> 6
6 -- "{'E', 'e'}" --> 26
7@{shape: dbl-circ}
8@{shape: dbl-circ}
9@{shape: dbl-circ}
10@{shape: circ}
10 -- "{'a'}" --> 21
11@{shape: circ}
11 -- "{'u'}" --> 18
12@{shape: circ}
12 -- "{'r'}" --> 15
13@{shape: dbl-circ}
14@{shape: dbl-circ}
15@{shape: circ}
15 -- "{'u'}" --> 16
16@{shape: circ}
16 -- "{'e'}" --> 17
17@{shape: dbl-circ}
18@{shape: circ}
18 -- "{'l'}" --> 19
19@{shape: circ}
19 -- "{'l'}" --> 20
20@{shape: dbl-circ}
21@{shape: circ}
21 -- "{'l'}" --> 22
22@{shape: circ}
22 -- "{'s'}" --> 23
23@{shape: circ}
23 -- "{'e'}" --> 24
24@{shape: dbl-circ}
25@{shape: circ}
25 -- "{'0'-'9'}" --> 29
26@{shape: circ}
26 -- "{'+', '-'}" --> 27
26 -- "{'0'-'9'}" --> 28
27@{shape: circ}
27 -- "{'0'-'9'}" --> 28
28@{shape: dbl-circ}
28 -- "{'0'-'9'}" --> 28
29@{shape: dbl-circ}
29 -- "{'0'-'9'}" --> 29
29 -- "{'E', 'e'}" --> 26
False_0[False]@{shape: rect}
24 .-> False_0
Whitespace_0[Whitespace]@{shape: rect}
1 .-> Whitespace_0
Number_0[Number]@{shape: rect}
6 .-> Number_0
Number_0[Number]@{shape: rect}
29 .-> Number_0
LBrace_0[LBrace]@{shape: rect}
13 .-> LBrace_0
Number_0[Number]@{shape: rect}
5 .-> Number_0
RBrace_0[RBrace]@{shape: rect}
14 .-> RBrace_0
String_0[String]@{shape: rect}
2 .-> String_0
Colon_0[Colon]@{shape: rect}
7 .-> Colon_0
RBrak_0[RBrak]@{shape: rect}
9 .-> RBrak_0
True_0[True]@{shape: rect}
17 .-> True_0
Null_0[Null]@{shape: rect}
20 .-> Null_0
LBrak_0[LBrak]@{shape: rect}
8 .-> LBrak_0
Number_0[Number]@{shape: rect}
28 .-> Number_0
Comma_0[Comma]@{shape: rect}
3 .-> Comma_0
Herring is licensed under either of
- Apache License, Version 2.0 (LICENSE-APACHE or https://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or https://opensource.org/licenses/MIT)
at your option.
Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.