Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regexgen bug in minimize() / output depends on order of Trie#addAll input #31

Open
mathiasbynens opened this issue Mar 15, 2021 · 4 comments · May be fixed by #32
Open

regexgen bug in minimize() / output depends on order of Trie#addAll input #31

mathiasbynens opened this issue Mar 15, 2021 · 4 comments · May be fixed by #32

Comments

@mathiasbynens
Copy link
Contributor

mathiasbynens commented Mar 15, 2021

See mathiasbynens/emoji-test-regex-pattern#1.

There seems to be an issue where regexgen produces incorrect output. The exact output depends on the order of the Trie#addAll input, which seems like a bug in and of itself.

Test case:

const assert = require('assert');

const Trie = require('regexgen').Trie;
const trie = new Trie();

const strings = [
  '\u{1F9D1}\u{1F3FF}\u200D\u2695\uFE0F',
  '\u{1F468}\u{1F3FE}\u200D\u2708\uFE0F',
  '\u{1F468}\u{1F3FF}\u200D\u2708\uFE0F',
  '\u{1F469}\u200D\u2764\u200D\u{1F469}',
  '\u{1F9D1}\u{1F3FF}\u200D\u2695',
  '\u{1F468}\u{1F3FE}\u200D\u2708',
  '\u{1F468}\u{1F3FF}\u200D\u2708',
  '\u{1F468}\u200D\u2708',
  '\u{1F469}\u200D\u2708',
  '\u2708\uFE0F',
  '\u{1F469}',
  '\u2708',
];

const STRINGS_THAT_MUST_BE_INCLUDED = [
  '\u{1F9D1}\u200D\u2708\uFE0F', // 🧑‍✈️ E12.1 pilot
  '\u{1F468}\u200D\u2708\uFE0F', // 👨‍✈️ E4.0 man pilot
  '\u{1F469}\u200D\u2708\uFE0F', // 👩‍✈️ E4.0 woman pilot
];

strings.push(...STRINGS_THAT_MUST_BE_INCLUDED);

// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below, but is wrong in a different way:
//strings.sort();

trie.addAll(strings);
const pattern = trie.toString();
console.log(pattern);

const re = new RegExp(pattern, 'g');
assert(strings.includes('\u{1F9D1}\u200D\u2708\uFE0F')); // 🧑‍✈️ E12.1 pilot
assert(strings.includes('\u{1F468}\u200D\u2708\uFE0F')); // 👨‍✈️ E4.0 man pilot
assert(strings.includes('\u{1F469}\u200D\u2708\uFE0F')); // 👩‍✈️ E4.0 woman pilot
const text = `
\u{1F9D1}\u200D\u2708\uFE0F  # E12.1 pilot
\u{1F468}\u200D\u2708\uFE0F  # E4.0 man pilot
\u{1F469}\u200D\u2708\uFE0F  # E4.0 woman pilot
`;
console.log('3 pilots expected', text.match(re));
// → expected: 3 pilots expected [ '🧑‍✈️', '👨‍✈️', '👩‍✈️' ]
// → actual: 3 pilots expected [ '🧑‍✈️', '👨‍✈️', '👩', '✈️' ]

Uncomment the strings.sort() line results in a different pattern (which appears to work correctly for this particular case, but it actually still doesn't match all expected strings). Here are the patterns regexgen generates for both cases:

# Without sort(), not working correctly:
\uD83D\uDC69(?:\u200D\u2764\u200D\uD83D\uDC69)?|\uD83E\uDDD1(?:\uD83C\uDFFF\u200D\u2695\uFE0F?|\u200D\u2708\uFE0F)|(?:\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D|\uD83D\uDC69\u200D)?\u2708\uFE0F?
# With sort(), seemingly working correctly:
\uD83D\uDC69\u200D\u2764\u200D\uD83D\uDC69|(?:\uD83E\uDDD1\uD83C\uDFFF\u200D\u2695|\uD83D\uDC68(?:\uD83C[\uDFFE\uDFFF])?\u200D\u2708|\uD83D\uDC69\u200D\u2708|\u2708)\uFE0F?|\uD83E\uDDD1\u200D\u2708\uFE0F|\uD83D\uDC69

I think there's a bug in regexgen:

  • it should produce the same pattern regardless of the sort order of the input strings
  • it should correctly match all of the input strings
@mathiasbynens
Copy link
Contributor Author

mathiasbynens commented Mar 16, 2021

Turns out that .sort()ing before passing to regexgen doesn't actually fix the issue, it just moves it around (other strings are now no longer matched). So ignore that part of my post — it’s not a workaround that fixes the problem in general (even though it helps in this particular test case).

@mathiasbynens
Copy link
Contributor Author

mathiasbynens commented Mar 16, 2021

Simpler test case with ASCII characters only:

const assert = require('assert');

const Trie = require('regexgen').Trie;
const trie = new Trie();

const STRING_TO_MATCH = 'FBCD';
const strings = [
  'AGBHD',
  'EIBCD',
  'EGBCD',
  'FBJBF',
  'AGBH',
  'EIBC',
  'EGBC',
  'EBC',
  'FBC',
  'CD',
  'F',
  'C',
  'ABCD',
  'EBCD',
  STRING_TO_MATCH,
];

// Uncommenting this results in regexgen generating a different pattern
// that passes the tests below (but still produces incorrect results in other cases):
//strings.sort();

trie.addAll(strings);
const pattern = trie.toString();
//console.log(pattern);
// → 'F(?:BJBF)?|(?:E[GI]?B|FB)?CD?|A(?:GBHD?|BCD)'
// Or with sort() first:
// → 'FBJBF|(?:E[GI]?B|FB)?CD|A(?:GBHD?|BCD)|(?:E[GI]?B|FB)?C|F'

const re = new RegExp(pattern, 'g');
assert(strings.includes(STRING_TO_MATCH));

// Verify that every string we told regexgen to match, is actually
// matched by the generated pattern.
for (const string of strings) {
  const actual = string.match(re)[0];
  assert(string === actual);
}

@mathiasbynens
Copy link
Contributor Author

This patch results in correct output:

diff --git a/src/trie.js b/src/trie.js
index 8e363e1..f938633 100644
--- a/src/trie.js
+++ b/src/trie.js
@@ -42,7 +42,7 @@ class Trie {
   * @return {State} - the starting state of the minimal DFA
   */
  minimize() {
-    return minimize(this.root);
+    return this.root;
  }

  /**

So (unsurprisingly) the bug is somewhere in minimize().

mathiasbynens added a commit to mathiasbynens/regexgen that referenced this issue Mar 16, 2021
We can re-enable it once the upstream bugs are fixed:
devongovett#31
@mathiasbynens mathiasbynens linked a pull request Mar 16, 2021 that will close this issue
mathiasbynens added a commit to mathiasbynens/regexgen that referenced this issue Mar 16, 2021
We can re-enable it once the bugs are fixed:
devongovett#31
@mathiasbynens mathiasbynens changed the title regexgen output depends on order of Trie#addAll input regexgen bug in minimize() / output depends on order of Trie#addAll input Mar 16, 2021
mathiasbynens added a commit to mathiasbynens/emoji-test-regex-pattern that referenced this issue Mar 16, 2021
Due to an upstream bug in the regular expression minimizing implementation, I’m temporarily disabling this functionaity. Unfortunately this results in significantly larger regular expression patterns, but at least the output is (presumably) correct now.

Issue: #1
Issue: devongovett/regexgen#31
mathiasbynens added a commit to mathiasbynens/emoji-test-regex-pattern that referenced this issue Mar 17, 2021
@mathiasbynens
Copy link
Contributor Author

I patched regexgen to allow for easier inspection of its internal state. Here‘s the state for the above test case.

Without sort (incorrect):

A => G => B => H
A => G => B => H => D
A => B => C => D
E => I => B => C
E => I => B => C => D
E => G => B => C
E => G => B => C => D
E => B => C
E => B => C => D
F
F => B => J => B => F
F => B => C
F => B => C => D
C
C => D

==>
F(?:BJBF)?|(?:E[GI]?B|FB)?CD?|A(?:GBHD?|BCD)

With sort (which in the case of this particular test case, gives 100% correct results, matching all the strings completely — so we can use it as a reference):

A => B => C => D
A => G => B => H
A => G => B => H => D
C
C => D
E => B => C
E => B => C => D
E => G => B => C
E => G => B => C => D
E => I => B => C
E => I => B => C => D
F
F => B => C
F => B => C => D
F => B => J => B => F

==>
FBJBF|(?:E[GI]?B|FB)?CD|A(?:GBHD?|BCD)|(?:E[GI]?B|FB)?C|F

F => B => C => D is the problematic one in this test case — 'FBCD' is the string that does not get matched, despite being included in the input to regexgen. In the broken output, it doesn't get matched because F(?:BJBF)? appears first in the pattern, and so only the F is matched. Within the generated pattern, it should never happen that something on the left matches a prefix of something that's further on the right, because then the latter can never match. It seems regexgen should either

  • sort the states, if the minimizer depends on any particular order (in src/minimize.js), or
  • be more careful about how it orders the different parts of the pattern (in src/regex.js), or
  • do both.

mathiasbynens added a commit to mathiasbynens/emoji-test-regex-pattern that referenced this issue Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant