regexgen

Generates regular expressions that match a set of strings.

Installation

regexgen can be installed using npm:

npm install regexgen

Example

The simplest use is to simply pass an array of strings to regexgen:

const regexgen = require('regexgen');

regexgen(['foobar', 'foobaz', 'foozap', 'fooza']); // => /foo(?:zap?|ba[rz])/

You can also use the Trie class directly:

const {Trie} = require('regexgen');

let t = new Trie;
t.add('foobar');
t.add('foobaz');

t.toRegExp(); // => /fooba[rz]/

API

regexgen(inputs, flags)

Returns a regular expression that matches the given input strings.

Parameter	Type	Description
`inputs`	String Array	List of strings used to generate the regex
`flags`	String	Optional flags to add to the regex

The `Trie` Class:

add(string)

Adds the given string to the trie.

Parameter	Type	Description
`string`	String	The string to add

addAll(strings)

Adds the given array of strings to the trie.

Parameter	Type	Description
`strings`	String Array	The array of strings to add

minimize()

Returns a minimal DFA representing the strings in the trie.

toString(flags)

Returns a regex pattern that matches the strings in the trie.

Parameter	Type	Description
`flags`	String	Optional flags to add to the regex

toRegExp(flags)

Returns a regex that matches the strings in the trie.

Parameter	Type	Description
`flags`	String	Optional flags to add to the regex

CLI

regexgen also has a simple CLI to generate regexes using inputs from the command line.

$ regexgen
Usage: regexgen [-gimuy] string1 string2 string3...

The optional first parameter is the flags to add to the regex (e.g. -i for a case insensitive match).

ES2015 and Unicode

By default regexgen will output a standard JavaScript regular expression, with Unicode codepoints converted into UCS-2 surrogate pairs.

If desired, you can request an ES2015-compatible Unicode regular expression by supplying the -u flag, which results in those codepoints being retained.

$ regexgen 👩 👩‍💻 👩🏻‍💻 👩🏼‍💻 👩🏽‍💻 👩🏾‍💻 👩🏿‍💻
/\uD83D\uDC69(?:(?:\uD83C[\uDFFB-\uDFFF])?\u200D\uD83D\uDCBB)?/

$ regexgen -u 👩 👩‍💻 👩🏻‍💻 👩🏼‍💻 👩🏽‍💻 👩🏾‍💻 👩🏿‍💻
/\u{1F469}(?:[\u{1F3FB}-\u{1F3FF}]?\u200D\u{1F4BB})?/u

Such regular expressions are compatible with current versions of Node, as well as the latest browsers, and may be more transferrable to other languages.

How does it work?

Generate a Trie containing all of the input strings. This is a tree structure where each edge represents a single character. This removes redundancies at the start of the strings, but common branches further down are not merged.
A trie can be seen as a tree-shaped deterministic finite automaton (DFA), so DFA algorithms can be applied. In this case, we apply Hopcroft's DFA minimization algorithm to merge the nondistinguishable states.
Convert the resulting minimized DFA to a regular expression. This is done using Brzozowski's algebraic method, which is quite elegant. It expresses the DFA as a system of equations which can be solved for a resulting regex. Along the way, some additional optimizations are made, such as hoisting common substrings out of an alternation, and using character class ranges. This produces an an Abstract Syntax Tree (AST) for the regex, which is then converted to a string and compiled to a JavaScript RegExp object.

License

regexgen is distributed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

regexgen

Table of Contents

Installation

Example

API

regexgen(inputs, flags)

The `Trie` Class:

add(string)

addAll(strings)

minimize()

toString(flags)

toRegExp(flags)

CLI

ES2015 and Unicode

How does it work?

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

regexgen

Table of Contents

Installation

Example

API

regexgen(inputs, flags)

The Trie Class:

add(string)

addAll(strings)

minimize()

toString(flags)

toRegExp(flags)

CLI

ES2015 and Unicode

How does it work?

License

The `Trie` Class: