[WIP] Regex lookbehind support #760

JacobHearst · 2024-08-18T20:45:17Z

A mostly complete implementation of lookbehind support for Regex via reverse matching.

Forewarning

I'm a casual user of regex, not an expert. I wanted lookbehinds in Swift and was willing to dive into a project I knew nothing about. As a result, I may have made assumptions or choices that don't seem to make sense to more knowledgeable readers.

Some notes

There's a lot of duplication of code between emitReverseXYZ functions and their emitXYZ counterparts.
The code for forwards matching makes heavy use of the fact that a String's endIndex is still a valid index even though there's no character at that index. This means that the reverse variants of these matching functions need to take a different approach. I've tried to address this in some places (reverse quantifications), but others I haven't gotten to yet (^ assertions)

milseman

This is looking really awesome! I'm going to be trying to think about ways to clean the code up a bit (and avoid as much duplication).

Below are some quick questions. If code review detects a bug then it's a great idea to add a test case that gets fixed when the bug gets fixed. Similarly, if something's unclear in the review and a test case can formalize it, that would be great to add to.

@natecook1000 can you help me review this and integrate it with me?

I also want to get much more sanity checking in place. For example, we run every test both optimized and unoptimized, which helps catch a lot of bugs. We are so close to generalized reverse matching that I wonder if we can similarly do something for reverse matching. For example, every test that matches forwards from the start of a string will also match when compiled in reverse from the end of a string. That will also give us an extra layer of nesting.

milseman · 2024-08-19T16:26:24Z

Sources/RegexBuilder/Anchor.swift

+/// A regex component that allows a match to continue only if its contents
+/// match at the given location.
+///
+/// A lookbehind is a zero-length assertion that its included regex matches at
+/// a particular position. Lookbehinds do not advance the overall matching
+/// position in the input string — once a lookbehind succeeds, matching continues
+/// in the regex from the same position.
+@available(SwiftStdlib 5.7, *) // TODO: How should this be gated?


We probably want to adjust the wording to denote the "behind" part of the assertion. CC @natecook1000

milseman · 2024-08-19T16:29:44Z

Sources/_StringProcessing/ByteCodeGen.swift

@@ -16,6 +16,7 @@ import Swift

 extension Compiler {
  struct ByteCodeGen {
+    var reverse = false


@natecook1000 what do you think about modeling reverse as an option? Do you think that would clean up the implementation?

milseman · 2024-08-19T16:31:48Z

Sources/_StringProcessing/ByteCodeGen.swift

@@ -326,6 +382,9 @@ fileprivate extension Compiler.ByteCodeGen {

    builder.buildSave(success)
    builder.buildSave(intercept)
+    if reverse {
+       builder.buildReverse(1)


What's the logic behind emitting this reverse?

All the match instructions match the current position, so my logic was in order to start matching in reverse, we needed to step back once and then start the match

Are we going to be emitting the reverse variant of those matching instructions?

I believe so. Assuming the save point machinery doesn't have to change, we call emitNode for the pattern inside the lookbehind so it should be reversed so long as reverse is true.

So long as the reverse flag is set, yes. Thinking about nested lookarounds though, I think my current implementation of the reverse flag will have a bug here because of what you mentioned in another comment

Are nested lookarounds possible? If so, do we need to restore the prior reverse value instead of clearing it?

milseman · 2024-08-19T16:32:27Z

Sources/_StringProcessing/ByteCodeGen.swift

    }
+    reverse = false


Are nested lookarounds possible? If so, do we need to restore the prior reverse value instead of clearing it?

I haven't tried yet but I suspect with this implementation they are not. Seems like a good thing to support though

milseman · 2024-08-19T16:35:57Z

Sources/_StringProcessing/ByteCodeGen.swift

@@ -743,22 +917,22 @@ fileprivate extension Compiler.ByteCodeGen {
        guard let val = c._singleScalarAsciiValue else {
          return false
        }
-        builder.buildQuantify(asciiChar: val, kind, minTrips, maxExtraTrips, isScalarSemantics: isScalarSemantics)
+        builder.buildReverseQuantify(asciiChar: val, kind, minTrips, maxExtraTrips, isScalarSemantics: isScalarSemantics)


Should these be guarded by an if reverse? (With the diff I can't quite see the full logic here).

Nope, this is part of tryEmitFastReverseQuant. The diff is showing tryEmitFastQuant as an addition and the reverse variant as a change to the original

milseman · 2024-08-19T16:42:37Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

+  /// - Returns: The character at `pos`, bounded by `start`, if it exists, along
+  ///   with the upper bound of that character. The upper bound is always
+  ///   scalar-aligned.
+  func characterAndStart(at pos: String.Index, limitedBy start: String.Index) -> (Character, String.Index)? {


It's actually the lower bound of the character. Let's add a label to the return tuple value that's something like:
(Character, characterStart: String.Index)

milseman · 2024-08-19T16:44:26Z

Sources/_StringProcessing/Engine/MEBuiltins.swift

@@ -367,4 +546,87 @@ extension String {
    }
    return next
  }
+
+  @inline(never)


I'll think over how much code duplication we can avoid here later (and otherwise skip these implementation functions in the review). If we're implementing this much duplication, we'll want to make sure we have very, very solid testing in place.

milseman · 2024-08-19T16:47:37Z

Tests/RegexTests/MatchTests.swift

    firstMatchTest(
-      #"\d{3}(?<=USD\d{3})"#, input: "Price: USD100", match: "100", xfail: true)
+      #"\d{3}(?<=USD\d{3})"#, input: "Price: USD100", match: "100", validateOptimizations: false)


In the unoptimized case we emit reverse(1), I wonder if that's related?

I ended up fixing this test case by checking if we're at the start during reverseMatch and reverseScalarMatch

milseman · 2024-08-19T16:49:09Z

Tests/RegexTests/MatchTests.swift

+      ("1234abc", nil), // FIXME: Shouldn't match but does because `^` assertions are broken
+      ("z123abc", nil), // FIXME: Same as above
+      validateOptimizations: false
+    )


Let's see if we can have nested lookarounds (like a positive inside a negative inside a positive, etc). You can cross reference other engine's behavior such as Javascript and .NET (e.g., via https://regex101.com).

milseman · 2024-08-19T16:49:42Z

Tests/RegexTests/MatchTests.swift

-    // engines generally enforce that lookbehinds are fixed width
+      input: "Price: JYP100", match: "100")
+
+    firstMatchTest(


I expect us to add lots more tests, so maybe extract into a separate function.

natecook1000 · 2024-08-19T21:10:55Z

Thanks so much for this contribution, @JacobHearst!

Two additional notes outside of the implementation:

We'll need to have a discussion about whether and how to limit the kinds of patterns that can be used inside a lookbehind. Some other regex engines only allow fixed-length patterns (Perl, Python, e.g. /(?<=aa)b/) or finite-length patterns (Java, e.g. /(?<=aa?)b/), while others are more open-ended (.NET, Javascript). If we do want to limit the kinds of patterns, we might need an additional protocol to constrain the kinds of patterns that can appear in a Lookbehind or NegativeLookbehind regex builder block. (more info)
We'll also need to look at how to manage availability with literals, since older Swift runtimes won't have the necessary support for lookbehinds. (This is an issue that we would need to manage with support for any additional regex syntax — this is just the first one to raise the question.)

milseman · 2024-08-20T16:02:18Z

I think we want .NET/JS style lookbehind, that is you have the full expressive power of regex inside. This PR includes the standard optimizations we have (such as a run of scalars, compile time character classes, or quantifications of scalars or CCs).
That's a good point about availability, I'll find someone to follow up with.

I'm a little concerned about our bounds checking, as that's where we've introduced bugs in the past. I see that this checks for the start-of-string, but what about the start-of-substring for repeated searches? If we want to enumerate all the matches in a string, then lookbehind should be able to view previously-matched text, but we don't want to it to look outside of a Substring's bounds. This seems analogous to anchors, what do you think @natecook1000

milseman · 2024-08-25T17:05:45Z

Here's a couple examples of nesting (https://regex101.com/r/ZnXq00/1). We'll want to expand this testing to cover even deeper nesting, test nesting alternations, custom character classes, etc., to make sure we get full coverage of our complex constructs:

input = "abcdefg"

// firstMatch results:

/abcd(?<=c(?=d)d)/     // abcd
/abcd(?<=cd(?=d).)/    // no match (. consumed the d)
/abcd(?<=c(?=e)d)/     // no match (local position expecting a lookahead of d)
/abcd(?<=bc(?=de)d)/   // abcd
/abcd(?<=bc(?=de).)/   // abcd

I'd suggest switching to the extended syntax (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0354-regex-literals.md#extended-delimiters-- ) or (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#matching-options) for the more complex nesting at some point

milseman · 2024-08-26T15:32:42Z

Sources/_StringProcessing/ByteCodeGen.swift

@@ -125,9 +125,9 @@ fileprivate extension Compiler.ByteCodeGen {
        let boundaryCheck = idx == lastIdx
        let scalar = s.unicodeScalars[idx]
        if options.isCaseInsensitive && scalar.properties.isCased {
-          builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck)
+          builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck, reverse: false)


Let's add an assertion at the top of this function that we're not in reverse mode.

assert(!reverse)

That will help us catch times we're calling this in error

milseman · 2024-08-26T15:36:37Z

Sources/_StringProcessing/ByteCodeGen.swift

@@ -153,9 +153,9 @@ fileprivate extension Compiler.ByteCodeGen {
        let boundaryCheck = idx == lastIdx
        let scalar = s.unicodeScalars[idx]
        if options.isCaseInsensitive && scalar.properties.isCased {
-          builder.buildReverseMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck)
+          builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck, reverse: true)


Similarly, an assertion that we are in reverse mode

milseman · 2024-08-27T14:17:54Z

Here's what we need to ultimately ship:

Expand testing. The large amount of semi-duplicated code in the engine for handling reverse match variants is acceptable if we have really good testing.
Reject custom regex consumers in the builder syntax from being embedded in a lookbehind, as these are not reversible.
Limit availability. We need to error out at parse time for literals that use lookbehind when targeting an OS that may not support them.

milseman · 2024-08-28T15:13:13Z

@JacobHearst could you rebase this work on top of the current swift/main?

@natecook1000 when I try to build these tests, sometimes it works and sometimes I get the error:

Undefined symbol: static RegexBuilder.AlternationBuilder.buildExpression<A where A: _StringProcessing.RegexComponent>(A) -> A
... 31 more errors

have you seen this before?

JacobHearst · 2024-08-31T14:52:33Z

Sources/_StringProcessing/ByteCodeGen.swift

@@ -110,7 +111,8 @@ fileprivate extension Compiler.ByteCodeGen {
  mutating func emitQuotedLiteral(_ s: String) {


@milseman Curious to get your thoughts on combining forwards and reverse variants of functions like I've done here. I think this same approach could be applied to the duplicated functions in the processor but I'm a bit torn on whether it's better to have to completely separate functions or to add a bit more complexity to the forwards matching functions

Obviously this doesn't preclude the need for unit tests but I think it'll make maintenance easier in the long term if there aren't two nearly identical copies of functions all over the place

Avoiding duplication in the compilation makes sense. For the processor itself, I don't want a dynamic branch at the lowest-most part of matching to choose between forwards and reverse, so there's more nuance there. We might be able to extract functionality into common shared code, or we might do some kind of static meta programming.

JacobHearst · 2025-01-01T18:01:00Z

Just for transparency, the main reasons for the long gap in my work on this are:

Most of my brainpower is soaked up by my job these days leaving me with little to no energy for outside projects
The longer I go without working on this, the more intimidating it is to start up again

Knowing this, my current plan to bring this work to completion revolves around identifying functions I wrote reverse variants of and writing unit tests for the original and my variant. My goal right now is to unit test 1 pair of functions a week so we'll see how that goes. I would really like to get this into Swift 6.2 (I suspect I've already missed or will miss the boat for 6.1) and I think this plan will let me keep some level of momentum to achieve this.

milseman reviewed Aug 19, 2024

View reviewed changes

milseman reviewed Aug 26, 2024

View reviewed changes

JacobHearst force-pushed the reverse-matching branch from 2d7cadc to c9aa07c Compare August 30, 2024 13:12

JacobHearst commented Aug 31, 2024

View reviewed changes

milseman mentioned this pull request Sep 12, 2024

DO NOT MERGE: Make a Regex toolchain in CI swiftlang/swift#76430

Open

JacobHearst force-pushed the reverse-matching branch from 00bbc12 to b3f706f Compare December 27, 2024 00:16

JacobHearst changed the base branch from swift/main to main December 27, 2024 00:16

Jacob Hearst added 4 commits January 20, 2025 11:48

Reapply changes on top of main

e0a73ab

Squash some bugs

cf53ddd

Add ASCII _quickMatch and _quickReverseMatch tests

2d3c691

Unit test matchScalar and reversMatchScalar

f28e9fa

JacobHearst force-pushed the reverse-matching branch from 6543f78 to f28e9fa Compare January 20, 2025 17:49

Add reverseMatchUTF8

41f331d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Regex lookbehind support #760

[WIP] Regex lookbehind support #760

JacobHearst commented Aug 18, 2024

milseman left a comment •

edited

Loading

milseman Aug 19, 2024

milseman Aug 19, 2024

milseman Aug 19, 2024

JacobHearst Aug 25, 2024

milseman Aug 26, 2024

JacobHearst Aug 26, 2024

JacobHearst Aug 26, 2024

milseman Aug 19, 2024

JacobHearst Aug 25, 2024

milseman Aug 19, 2024

JacobHearst Aug 25, 2024

milseman Aug 19, 2024

milseman Aug 19, 2024

milseman Aug 19, 2024

JacobHearst Aug 31, 2024

milseman Aug 19, 2024

milseman Aug 19, 2024

natecook1000 commented Aug 19, 2024

milseman commented Aug 20, 2024

milseman commented Aug 25, 2024 •

edited

Loading

milseman Aug 26, 2024

milseman Aug 26, 2024

milseman commented Aug 27, 2024

milseman commented Aug 28, 2024

JacobHearst Aug 31, 2024 •

edited

Loading

JacobHearst Aug 31, 2024

milseman Sep 2, 2024

JacobHearst commented Jan 1, 2025

		@@ -110,7 +111,8 @@ fileprivate extension Compiler.ByteCodeGen {
		mutating func emitQuotedLiteral(_ s: String) {

[WIP] Regex lookbehind support #760

Are you sure you want to change the base?

[WIP] Regex lookbehind support #760

Conversation

JacobHearst commented Aug 18, 2024

Forewarning

Some notes

milseman left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

natecook1000 commented Aug 19, 2024

milseman commented Aug 20, 2024

milseman commented Aug 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

milseman commented Aug 27, 2024

milseman commented Aug 28, 2024

JacobHearst Aug 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacobHearst commented Jan 1, 2025

milseman left a comment •

edited

Loading

milseman commented Aug 25, 2024 •

edited

Loading

JacobHearst Aug 31, 2024 •

edited

Loading