Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Regex lookbehind support #760

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

JacobHearst
Copy link

A mostly complete implementation of lookbehind support for Regex via reverse matching.

Forewarning

I'm a casual user of regex, not an expert. I wanted lookbehinds in Swift and was willing to dive into a project I knew nothing about. As a result, I may have made assumptions or choices that don't seem to make sense to more knowledgeable readers.

Some notes

  • There's a lot of duplication of code between emitReverseXYZ functions and their emitXYZ counterparts.
  • The code for forwards matching makes heavy use of the fact that a String's endIndex is still a valid index even though there's no character at that index. This means that the reverse variants of these matching functions need to take a different approach. I've tried to address this in some places (reverse quantifications), but others I haven't gotten to yet (^ assertions)

Copy link
Member

@milseman milseman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really awesome! I'm going to be trying to think about ways to clean the code up a bit (and avoid as much duplication).

Below are some quick questions. If code review detects a bug then it's a great idea to add a test case that gets fixed when the bug gets fixed. Similarly, if something's unclear in the review and a test case can formalize it, that would be great to add to.

@natecook1000 can you help me review this and integrate it with me?

I also want to get much more sanity checking in place. For example, we run every test both optimized and unoptimized, which helps catch a lot of bugs. We are so close to generalized reverse matching that I wonder if we can similarly do something for reverse matching. For example, every test that matches forwards from the start of a string will also match when compiled in reverse from the end of a string. That will also give us an extra layer of nesting.

Comment on lines +230 to +237
/// A regex component that allows a match to continue only if its contents
/// match at the given location.
///
/// A lookbehind is a zero-length assertion that its included regex matches at
/// a particular position. Lookbehinds do not advance the overall matching
/// position in the input string — once a lookbehind succeeds, matching continues
/// in the regex from the same position.
@available(SwiftStdlib 5.7, *) // TODO: How should this be gated?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to adjust the wording to denote the "behind" part of the assertion. CC @natecook1000

@@ -16,6 +16,7 @@ import Swift

extension Compiler {
struct ByteCodeGen {
var reverse = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@natecook1000 what do you think about modeling reverse as an option? Do you think that would clean up the implementation?

@@ -326,6 +382,9 @@ fileprivate extension Compiler.ByteCodeGen {

builder.buildSave(success)
builder.buildSave(intercept)
if reverse {
builder.buildReverse(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the logic behind emitting this reverse?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the match instructions match the current position, so my logic was in order to start matching in reverse, we needed to step back once and then start the match

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to be emitting the reverse variant of those matching instructions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. Assuming the save point machinery doesn't have to change, we call emitNode for the pattern inside the lookbehind so it should be reversed so long as reverse is true.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as the reverse flag is set, yes. Thinking about nested lookarounds though, I think my current implementation of the reverse flag will have a bug here because of what you mentioned in another comment

Are nested lookarounds possible? If so, do we need to restore the prior reverse value instead of clearing it?

}
reverse = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are nested lookarounds possible? If so, do we need to restore the prior reverse value instead of clearing it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried yet but I suspect with this implementation they are not. Seems like a good thing to support though

@@ -743,22 +917,22 @@ fileprivate extension Compiler.ByteCodeGen {
guard let val = c._singleScalarAsciiValue else {
return false
}
builder.buildQuantify(asciiChar: val, kind, minTrips, maxExtraTrips, isScalarSemantics: isScalarSemantics)
builder.buildReverseQuantify(asciiChar: val, kind, minTrips, maxExtraTrips, isScalarSemantics: isScalarSemantics)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be guarded by an if reverse? (With the diff I can't quite see the full logic here).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is part of tryEmitFastReverseQuant. The diff is showing tryEmitFastQuant as an addition and the reverse variant as a change to the original

Comment on lines 204 to 207
/// - Returns: The character at `pos`, bounded by `start`, if it exists, along
/// with the upper bound of that character. The upper bound is always
/// scalar-aligned.
func characterAndStart(at pos: String.Index, limitedBy start: String.Index) -> (Character, String.Index)? {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's actually the lower bound of the character. Let's add a label to the return tuple value that's something like:
(Character, characterStart: String.Index)

@@ -367,4 +546,87 @@ extension String {
}
return next
}

@inline(never)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll think over how much code duplication we can avoid here later (and otherwise skip these implementation functions in the review). If we're implementing this much duplication, we'll want to make sure we have very, very solid testing in place.

firstMatchTest(
#"\d{3}(?<=USD\d{3})"#, input: "Price: USD100", match: "100", xfail: true)
#"\d{3}(?<=USD\d{3})"#, input: "Price: USD100", match: "100", validateOptimizations: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the unoptimized case we emit reverse(1), I wonder if that's related?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up fixing this test case by checking if we're at the start during reverseMatch and reverseScalarMatch

("1234abc", nil), // FIXME: Shouldn't match but does because `^` assertions are broken
("z123abc", nil), // FIXME: Same as above
validateOptimizations: false
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see if we can have nested lookarounds (like a positive inside a negative inside a positive, etc). You can cross reference other engine's behavior such as Javascript and .NET (e.g., via https://regex101.com).

// engines generally enforce that lookbehinds are fixed width
input: "Price: JYP100", match: "100")

firstMatchTest(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect us to add lots more tests, so maybe extract into a separate function.

@natecook1000
Copy link
Member

Thanks so much for this contribution, @JacobHearst!

Two additional notes outside of the implementation:

  1. We'll need to have a discussion about whether and how to limit the kinds of patterns that can be used inside a lookbehind. Some other regex engines only allow fixed-length patterns (Perl, Python, e.g. /(?<=aa)b/) or finite-length patterns (Java, e.g. /(?<=aa?)b/), while others are more open-ended (.NET, Javascript). If we do want to limit the kinds of patterns, we might need an additional protocol to constrain the kinds of patterns that can appear in a Lookbehind or NegativeLookbehind regex builder block. (more info)

  2. We'll also need to look at how to manage availability with literals, since older Swift runtimes won't have the necessary support for lookbehinds. (This is an issue that we would need to manage with support for any additional regex syntax — this is just the first one to raise the question.)

@milseman
Copy link
Member

  1. I think we want .NET/JS style lookbehind, that is you have the full expressive power of regex inside. This PR includes the standard optimizations we have (such as a run of scalars, compile time character classes, or quantifications of scalars or CCs).

  2. That's a good point about availability, I'll find someone to follow up with.

I'm a little concerned about our bounds checking, as that's where we've introduced bugs in the past. I see that this checks for the start-of-string, but what about the start-of-substring for repeated searches? If we want to enumerate all the matches in a string, then lookbehind should be able to view previously-matched text, but we don't want to it to look outside of a Substring's bounds. This seems analogous to anchors, what do you think @natecook1000

@milseman
Copy link
Member

milseman commented Aug 25, 2024

Here's a couple examples of nesting (https://regex101.com/r/ZnXq00/1). We'll want to expand this testing to cover even deeper nesting, test nesting alternations, custom character classes, etc., to make sure we get full coverage of our complex constructs:

input = "abcdefg"

// firstMatch results:

/abcd(?<=c(?=d)d)/     // abcd
/abcd(?<=cd(?=d).)/    // no match (. consumed the d)
/abcd(?<=c(?=e)d)/     // no match (local position expecting a lookahead of d)
/abcd(?<=bc(?=de)d)/   // abcd
/abcd(?<=bc(?=de).)/   // abcd

I'd suggest switching to the extended syntax (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0354-regex-literals.md#extended-delimiters-- ) or (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#matching-options) for the more complex nesting at some point

@@ -125,9 +125,9 @@ fileprivate extension Compiler.ByteCodeGen {
let boundaryCheck = idx == lastIdx
let scalar = s.unicodeScalars[idx]
if options.isCaseInsensitive && scalar.properties.isCased {
builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck)
builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck, reverse: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add an assertion at the top of this function that we're not in reverse mode.

assert(!reverse)

That will help us catch times we're calling this in error

@@ -153,9 +153,9 @@ fileprivate extension Compiler.ByteCodeGen {
let boundaryCheck = idx == lastIdx
let scalar = s.unicodeScalars[idx]
if options.isCaseInsensitive && scalar.properties.isCased {
builder.buildReverseMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck)
builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck, reverse: true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, an assertion that we are in reverse mode

@milseman
Copy link
Member

Here's what we need to ultimately ship:

  1. Expand testing. The large amount of semi-duplicated code in the engine for handling reverse match variants is acceptable if we have really good testing.
  2. Reject custom regex consumers in the builder syntax from being embedded in a lookbehind, as these are not reversible.
  3. Limit availability. We need to error out at parse time for literals that use lookbehind when targeting an OS that may not support them.

@milseman
Copy link
Member

@JacobHearst could you rebase this work on top of the current swift/main?

@natecook1000 when I try to build these tests, sometimes it works and sometimes I get the error:

Undefined symbol: static RegexBuilder.AlternationBuilder.buildExpression<A where A: _StringProcessing.RegexComponent>(A) -> A
... 31 more errors

have you seen this before?

@@ -110,7 +111,8 @@ fileprivate extension Compiler.ByteCodeGen {
mutating func emitQuotedLiteral(_ s: String) {
Copy link
Author

@JacobHearst JacobHearst Aug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@milseman Curious to get your thoughts on combining forwards and reverse variants of functions like I've done here. I think this same approach could be applied to the duplicated functions in the processor but I'm a bit torn on whether it's better to have to completely separate functions or to add a bit more complexity to the forwards matching functions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously this doesn't preclude the need for unit tests but I think it'll make maintenance easier in the long term if there aren't two nearly identical copies of functions all over the place

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding duplication in the compilation makes sense. For the processor itself, I don't want a dynamic branch at the lowest-most part of matching to choose between forwards and reverse, so there's more nuance there. We might be able to extract functionality into common shared code, or we might do some kind of static meta programming.

@JacobHearst
Copy link
Author

Just for transparency, the main reasons for the long gap in my work on this are:

  • Most of my brainpower is soaked up by my job these days leaving me with little to no energy for outside projects
  • The longer I go without working on this, the more intimidating it is to start up again

Knowing this, my current plan to bring this work to completion revolves around identifying functions I wrote reverse variants of and writing unit tests for the original and my variant. My goal right now is to unit test 1 pair of functions a week so we'll see how that goes. I would really like to get this into Swift 6.2 (I suspect I've already missed or will miss the boat for 6.1) and I think this plan will let me keep some level of momentum to achieve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants