Add support for UTF-32 matching, and other fixes #38

ridiculousfish · 2024-02-04T02:25:52Z

fish-shell currently uses a fork of rust-pcre2 with added UTF-32 support. We would like to merge this support upstream, as it is generally useful, and will allow fish-shell to depend on the official rust-pcre2 crate. This UTF-32 support is behind a crate flag, off by default.

To avoid code duplication between the bytes and utf32 modules, the approach is:

Add a new trait CodeUnitWidth. This encapsulates differences between matching bytes and UTF-32. It also provides hooks to use the _8 vs _32 suffixes of PCRE2.
Migrate the bytes module into a regex_impl module and make it generic over this trait. Now the bytes module can simply be RegexImpl<CodeUnitWidth8>, and the UTF-32 module can be RegexImpl<CodeUnitWidth32>.
Provide type aliases in the bytes and utf32 modules for better names.

Each commit is independently reviewable and testable, or it can be reviewed in aggregate.

One casualty of this approach is that the documentation has somewhat regressed - documentation is now attached to the RegexImpl type and linked from the respective modules. I'm not sure how to fix that.

`cargo install bindgen-cli` is how it's done these days.

The PCRE2 jit is disabled dependent on the platform in build.rs. If it is disabled, then tests which assume the jit is available will fail. Fix these tests by switching to jit_if_available. This fixes the static build on macOS.

This moves the "bytes" module into regex_impl, and equips it with a trait, in preparation for UTF-32 matching.

This adds a module `utf32` which mirrors the module in `bytes`. It uses `CodeUnitWidth32` to provide the implementation.

This adds a new crate feature `jit` to enable the JIT. It is on by default.

By default, PCRE2 enables the strange sequence "(*UTF)" which turns on UTF validity checking for both patterns and subjects. This is hinted at as a potential security concern in the man page: "If the data string is very long, such a check might use sufficiently many resources as to cause your application to lose performance" For this reason, pcre2 provides a flag to avoid interpreting this sequence. Re-expose that in rust-pcre2, under the clearer name `block_utf_pattern_directive`.

Prior to this commit, rust-pcre2 would wrap pcre2's error messages with a prefix like "PCRE2: error compiling pattern:". However some clients want the raw error message as returned by pcre2. Allow access to this.

This both respects the PCRE2 API better and allows us to compile without UTF8 support.

This adds a new regex function `capture`, which captures matching substrings, using a new type `Captures`. It also adds new functions `replace` and `replace_all`, allowing substring replacement.

BurntSushi · 2024-03-22T17:03:53Z

Thanks for this PR. Unfortunately, I'm not sure if I'm ever going to be able to merge something like this. It's an enormous change and adds a fair bit of complexity to this crate that I'm not sure I'll be able to maintain.

I might be able to stomach something like this if this change could be broken down into more digestible patches, but even then I can't guarantee that I'll have the bandwidth to review them in a timely manner.

To be clear, I am open to the idea of supporting this, but I do have to say that matching on UTF-32 is a bit of an "odd duck" scenario IMO. I feel a little weird asking at this point given all the work that has been done, but have you considered alternative approaches? (I'm sure you have and I'm sure there are reasons why they won't work or are too costly, but I think it would still be useful for me to understand them.)

ridiculousfish · 2024-04-12T18:12:41Z

Yeah this a big pill to swallow for sure. To your point about "digestible patches", I have made an effort to break it down into independently reviewable, all-tests-pass commits, but the design is to make the API generic over widths (8/16/32) to reduce duplication, and so there is inevitably one very large commit that switches from hard-coded 8 to the generic.

If you are interested in adding UTF-16/32 support to rust-pcre2, I think this is a pretty good approach and I'm happy to work to get it mergeable. If you're not interested in that feature, that's fine; go ahead and close, and fish can ship a fork.

As to why fish-shell uses UTF-32: the answer is partly historical, and partly to handle invalid byte sequences. Rust cannot represent a file named 0xFF in a &str, so fish has a custom scheme to map between OsStr <=> Unicode which allows round-tripping invalid byte sequences. UTF-8 runs the risk of forgetting to decode: we might accidentally use Rust's native &str => OsStr conversion, which would be bad. UTF-32 means we can't forget.

We may move to UTF-8 eventually, but likely not Rust's native strings, for that reason.

I'm curious if you have other ideas for how to handle invalid byte sequences, while also doing Unicode-y things like text measurement.

krobelus · 2024-04-14T19:11:15Z

I think these patches are very easy to review, the question is whether you want to support this use case (interop with languages that use UTF-16 or as in this concrete case, C/C++ code that doesn't use UTF-8).
fish could probably convert to UTF-8 and back but that would be silly given that we already have this.

mqudsi · 2024-04-16T17:47:58Z

I think @ridiculousfish is already aware of it, but @BurntSushi's answer to the invalid Unicode chars question was the bstr crate, which provides str-like ops over input that may contain invalid UTF-8 chars (so it would be used in lieu of &str altogether).

I think this PR is worthwhile if only to maximize the options available for compatibility with the system PCRE2 library, but it's inarguably true that UTF-32 adoption in the rust world is even lower than in the C/C++ one. The majority of the changes from this PR should compile away and a type alias for the default UTF-8 wrapper would theoretically go a long way to hide that complexity from users, but I think currently (unfortunately) most tooling around the language (lsp, docs, etc) tend to "see through" the aliases and completions, etc are inevitably going to be messier.

One option that minimizes the size of the diff and possibly the maintenance burden (but does not reduce the complexity!) would be to merge the generic façade but not the UTF-32 backend, and fish could <somehow> plug in its own UTF-32 backend but that seems like it would ultimately be the worst of both worlds.

ridiculousfish and others added 10 commits January 28, 2024 16:42

Update generate-bindings instructions to install bindgen-cli

28d312a

`cargo install bindgen-cli` is how it's done these days.

Switch from jit to jit_if_available in tests

0637728

The PCRE2 jit is disabled dependent on the platform in build.rs. If it is disabled, then tests which assume the jit is available will fail. Fix these tests by switching to jit_if_available. This fixes the static build on macOS.

Factor bytes into regex_impl and prepare for UTF-32

41a8085

This moves the "bytes" module into regex_impl, and equips it with a trait, in preparation for UTF-32 matching.

Add support for UTF-32 matching

17fb76a

This adds a module `utf32` which mirrors the module in `bytes`. It uses `CodeUnitWidth32` to provide the implementation.

Add crate feature for JIT

27c6eb0

This adds a new crate feature `jit` to enable the JIT. It is on by default.

Mark Error::error_message as public

338a966

Prior to this commit, rust-pcre2 would wrap pcre2's error messages with a prefix like "PCRE2: error compiling pattern:". However some clients want the raw error message as returned by pcre2. Allow access to this.

Make is_jit_available() require a CodeUnitWidth

f933dc9

This both respects the PCRE2 API better and allows us to compile without UTF8 support.

Add support for capture groups and substring replacement

f56601b

This adds a new regex function `capture`, which captures matching substrings, using a new type `Captures`. It also adds new functions `replace` and `replace_all`, allowing substring replacement.

Update the CI workflow to build and test UTF32

f0e5adb

ridiculousfish force-pushed the utf32 branch from 529780e to f0e5adb Compare February 4, 2024 03:32

ridiculousfish marked this pull request as ready for review February 4, 2024 03:56

LeoniePhiline mentioned this pull request Mar 24, 2024

Feature request: supporting lookarounds (PCRE2 or fancy-regex) facebookincubator/fastmod#49

Open

ridiculousfish mentioned this pull request Jul 27, 2024

Let's release Rust-based fish fish-shell/fish-shell#10633

Open

25 tasks

ridiculousfish mentioned this pull request Sep 22, 2024

Incorporate UTF-32 changes fish-shell/rust-pcre2#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for UTF-32 matching, and other fixes #38

Add support for UTF-32 matching, and other fixes #38

ridiculousfish commented Feb 4, 2024

BurntSushi commented Mar 22, 2024

ridiculousfish commented Apr 12, 2024

krobelus commented Apr 14, 2024

mqudsi commented Apr 16, 2024 •

edited

Loading

Add support for UTF-32 matching, and other fixes #38

Are you sure you want to change the base?

Add support for UTF-32 matching, and other fixes #38

Conversation

ridiculousfish commented Feb 4, 2024

BurntSushi commented Mar 22, 2024

ridiculousfish commented Apr 12, 2024

krobelus commented Apr 14, 2024

mqudsi commented Apr 16, 2024 • edited Loading

mqudsi commented Apr 16, 2024 •

edited

Loading