-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for UTF-32 matching, and other fixes #38
base: master
Are you sure you want to change the base?
Conversation
`cargo install bindgen-cli` is how it's done these days.
The PCRE2 jit is disabled dependent on the platform in build.rs. If it is disabled, then tests which assume the jit is available will fail. Fix these tests by switching to jit_if_available. This fixes the static build on macOS.
This moves the "bytes" module into regex_impl, and equips it with a trait, in preparation for UTF-32 matching.
This adds a module `utf32` which mirrors the module in `bytes`. It uses `CodeUnitWidth32` to provide the implementation.
This adds a new crate feature `jit` to enable the JIT. It is on by default.
By default, PCRE2 enables the strange sequence "(*UTF)" which turns on UTF validity checking for both patterns and subjects. This is hinted at as a potential security concern in the man page: "If the data string is very long, such a check might use sufficiently many resources as to cause your application to lose performance" For this reason, pcre2 provides a flag to avoid interpreting this sequence. Re-expose that in rust-pcre2, under the clearer name `block_utf_pattern_directive`.
Prior to this commit, rust-pcre2 would wrap pcre2's error messages with a prefix like "PCRE2: error compiling pattern:". However some clients want the raw error message as returned by pcre2. Allow access to this.
This both respects the PCRE2 API better and allows us to compile without UTF8 support.
This adds a new regex function `capture`, which captures matching substrings, using a new type `Captures`. It also adds new functions `replace` and `replace_all`, allowing substring replacement.
Thanks for this PR. Unfortunately, I'm not sure if I'm ever going to be able to merge something like this. It's an enormous change and adds a fair bit of complexity to this crate that I'm not sure I'll be able to maintain. I might be able to stomach something like this if this change could be broken down into more digestible patches, but even then I can't guarantee that I'll have the bandwidth to review them in a timely manner. To be clear, I am open to the idea of supporting this, but I do have to say that matching on UTF-32 is a bit of an "odd duck" scenario IMO. I feel a little weird asking at this point given all the work that has been done, but have you considered alternative approaches? (I'm sure you have and I'm sure there are reasons why they won't work or are too costly, but I think it would still be useful for me to understand them.) |
Yeah this a big pill to swallow for sure. To your point about "digestible patches", I have made an effort to break it down into independently reviewable, all-tests-pass commits, but the design is to make the API generic over widths (8/16/32) to reduce duplication, and so there is inevitably one very large commit that switches from hard-coded 8 to the generic. If you are interested in adding UTF-16/32 support to rust-pcre2, I think this is a pretty good approach and I'm happy to work to get it mergeable. If you're not interested in that feature, that's fine; go ahead and close, and fish can ship a fork. As to why fish-shell uses UTF-32: the answer is partly historical, and partly to handle invalid byte sequences. Rust cannot represent a file named 0xFF in a &str, so fish has a custom scheme to map between OsStr <=> Unicode which allows round-tripping invalid byte sequences. UTF-8 runs the risk of forgetting to decode: we might accidentally use Rust's native &str => OsStr conversion, which would be bad. UTF-32 means we can't forget. We may move to UTF-8 eventually, but likely not Rust's native strings, for that reason. I'm curious if you have other ideas for how to handle invalid byte sequences, while also doing Unicode-y things like text measurement. |
I think these patches are very easy to review, the question is whether you want to support this use case (interop with languages that use UTF-16 or as in this concrete case, C/C++ code that doesn't use UTF-8). |
I think @ridiculousfish is already aware of it, but @BurntSushi's answer to the invalid Unicode chars question was the bstr crate, which provides str-like ops over input that may contain invalid UTF-8 chars (so it would be used in lieu of I think this PR is worthwhile if only to maximize the options available for compatibility with the system PCRE2 library, but it's inarguably true that UTF-32 adoption in the rust world is even lower than in the C/C++ one. The majority of the changes from this PR should compile away and a type alias for the default UTF-8 wrapper would theoretically go a long way to hide that complexity from users, but I think currently (unfortunately) most tooling around the language (lsp, docs, etc) tend to "see through" the aliases and completions, etc are inevitably going to be messier. One option that minimizes the size of the diff and possibly the maintenance burden (but does not reduce the complexity!) would be to merge the generic façade but not the UTF-32 backend, and fish could <somehow> plug in its own UTF-32 backend but that seems like it would ultimately be the worst of both worlds. |
fish-shell currently uses a fork of rust-pcre2 with added UTF-32 support. We would like to merge this support upstream, as it is generally useful, and will allow fish-shell to depend on the official rust-pcre2 crate. This UTF-32 support is behind a crate flag, off by default.
To avoid code duplication between the
bytes
andutf32
modules, the approach is:CodeUnitWidth
. This encapsulates differences between matching bytes and UTF-32. It also provides hooks to use the_8
vs_32
suffixes of PCRE2.bytes
module into aregex_impl
module and make it generic over this trait. Now thebytes
module can simply beRegexImpl<CodeUnitWidth8>
, and the UTF-32 module can beRegexImpl<CodeUnitWidth32>
.bytes
andutf32
modules for better names.Each commit is independently reviewable and testable, or it can be reviewed in aggregate.
One casualty of this approach is that the documentation has somewhat regressed - documentation is now attached to the
RegexImpl
type and linked from the respective modules. I'm not sure how to fix that.