-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Regex lookbehind support #760
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really awesome! I'm going to be trying to think about ways to clean the code up a bit (and avoid as much duplication).
Below are some quick questions. If code review detects a bug then it's a great idea to add a test case that gets fixed when the bug gets fixed. Similarly, if something's unclear in the review and a test case can formalize it, that would be great to add to.
@natecook1000 can you help me review this and integrate it with me?
I also want to get much more sanity checking in place. For example, we run every test both optimized and unoptimized, which helps catch a lot of bugs. We are so close to generalized reverse matching that I wonder if we can similarly do something for reverse matching. For example, every test that matches forwards from the start of a string will also match when compiled in reverse from the end of a string. That will also give us an extra layer of nesting.
/// A regex component that allows a match to continue only if its contents | ||
/// match at the given location. | ||
/// | ||
/// A lookbehind is a zero-length assertion that its included regex matches at | ||
/// a particular position. Lookbehinds do not advance the overall matching | ||
/// position in the input string — once a lookbehind succeeds, matching continues | ||
/// in the regex from the same position. | ||
@available(SwiftStdlib 5.7, *) // TODO: How should this be gated? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to adjust the wording to denote the "behind" part of the assertion. CC @natecook1000
@@ -16,6 +16,7 @@ import Swift | |||
|
|||
extension Compiler { | |||
struct ByteCodeGen { | |||
var reverse = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@natecook1000 what do you think about modeling reverse as an option? Do you think that would clean up the implementation?
@@ -326,6 +382,9 @@ fileprivate extension Compiler.ByteCodeGen { | |||
|
|||
builder.buildSave(success) | |||
builder.buildSave(intercept) | |||
if reverse { | |||
builder.buildReverse(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the logic behind emitting this reverse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the match
instructions match the current position, so my logic was in order to start matching in reverse, we needed to step back once and then start the match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we going to be emitting the reverse variant of those matching instructions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe so. Assuming the save point machinery doesn't have to change, we call emitNode
for the pattern inside the lookbehind so it should be reversed so long as reverse
is true.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So long as the reverse
flag is set, yes. Thinking about nested lookarounds though, I think my current implementation of the reverse
flag will have a bug here because of what you mentioned in another comment
Are nested lookarounds possible? If so, do we need to restore the prior reverse value instead of clearing it?
} | ||
reverse = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are nested lookarounds possible? If so, do we need to restore the prior reverse value instead of clearing it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tried yet but I suspect with this implementation they are not. Seems like a good thing to support though
@@ -743,22 +917,22 @@ fileprivate extension Compiler.ByteCodeGen { | |||
guard let val = c._singleScalarAsciiValue else { | |||
return false | |||
} | |||
builder.buildQuantify(asciiChar: val, kind, minTrips, maxExtraTrips, isScalarSemantics: isScalarSemantics) | |||
builder.buildReverseQuantify(asciiChar: val, kind, minTrips, maxExtraTrips, isScalarSemantics: isScalarSemantics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be guarded by an if reverse
? (With the diff I can't quite see the full logic here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is part of tryEmitFastReverseQuant
. The diff is showing tryEmitFastQuant
as an addition and the reverse variant as a change to the original
/// - Returns: The character at `pos`, bounded by `start`, if it exists, along | ||
/// with the upper bound of that character. The upper bound is always | ||
/// scalar-aligned. | ||
func characterAndStart(at pos: String.Index, limitedBy start: String.Index) -> (Character, String.Index)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's actually the lower bound of the character. Let's add a label to the return tuple value that's something like:
(Character, characterStart: String.Index)
@@ -367,4 +546,87 @@ extension String { | |||
} | |||
return next | |||
} | |||
|
|||
@inline(never) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll think over how much code duplication we can avoid here later (and otherwise skip these implementation functions in the review). If we're implementing this much duplication, we'll want to make sure we have very, very solid testing in place.
Tests/RegexTests/MatchTests.swift
Outdated
firstMatchTest( | ||
#"\d{3}(?<=USD\d{3})"#, input: "Price: USD100", match: "100", xfail: true) | ||
#"\d{3}(?<=USD\d{3})"#, input: "Price: USD100", match: "100", validateOptimizations: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the unoptimized case we emit reverse(1)
, I wonder if that's related?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up fixing this test case by checking if we're at the start
during reverseMatch
and reverseScalarMatch
("1234abc", nil), // FIXME: Shouldn't match but does because `^` assertions are broken | ||
("z123abc", nil), // FIXME: Same as above | ||
validateOptimizations: false | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see if we can have nested lookarounds (like a positive inside a negative inside a positive, etc). You can cross reference other engine's behavior such as Javascript and .NET (e.g., via https://regex101.com).
// engines generally enforce that lookbehinds are fixed width | ||
input: "Price: JYP100", match: "100") | ||
|
||
firstMatchTest( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect us to add lots more tests, so maybe extract into a separate function.
Thanks so much for this contribution, @JacobHearst! Two additional notes outside of the implementation:
|
I'm a little concerned about our bounds checking, as that's where we've introduced bugs in the past. I see that this checks for the start-of-string, but what about the start-of-substring for repeated searches? If we want to enumerate all the matches in a string, then lookbehind should be able to view previously-matched text, but we don't want to it to look outside of a |
Here's a couple examples of nesting (https://regex101.com/r/ZnXq00/1). We'll want to expand this testing to cover even deeper nesting, test nesting alternations, custom character classes, etc., to make sure we get full coverage of our complex constructs: input = "abcdefg"
// firstMatch results:
/abcd(?<=c(?=d)d)/ // abcd
/abcd(?<=cd(?=d).)/ // no match (. consumed the d)
/abcd(?<=c(?=e)d)/ // no match (local position expecting a lookahead of d)
/abcd(?<=bc(?=de)d)/ // abcd
/abcd(?<=bc(?=de).)/ // abcd I'd suggest switching to the extended syntax (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0354-regex-literals.md#extended-delimiters-- ) or (https://github.com/swiftlang/swift-evolution/blob/main/proposals/0355-regex-syntax-run-time-construction.md#matching-options) for the more complex nesting at some point |
@@ -125,9 +125,9 @@ fileprivate extension Compiler.ByteCodeGen { | |||
let boundaryCheck = idx == lastIdx | |||
let scalar = s.unicodeScalars[idx] | |||
if options.isCaseInsensitive && scalar.properties.isCased { | |||
builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck) | |||
builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck, reverse: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add an assertion at the top of this function that we're not in reverse mode.
assert(!reverse)
That will help us catch times we're calling this in error
@@ -153,9 +153,9 @@ fileprivate extension Compiler.ByteCodeGen { | |||
let boundaryCheck = idx == lastIdx | |||
let scalar = s.unicodeScalars[idx] | |||
if options.isCaseInsensitive && scalar.properties.isCased { | |||
builder.buildReverseMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck) | |||
builder.buildMatchScalarCaseInsensitive(scalar, boundaryCheck: boundaryCheck, reverse: true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, an assertion that we are in reverse mode
Here's what we need to ultimately ship:
|
@JacobHearst could you rebase this work on top of the current swift/main? @natecook1000 when I try to build these tests, sometimes it works and sometimes I get the error:
have you seen this before? |
2d7cadc
to
c9aa07c
Compare
@@ -110,7 +111,8 @@ fileprivate extension Compiler.ByteCodeGen { | |||
mutating func emitQuotedLiteral(_ s: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@milseman Curious to get your thoughts on combining forwards and reverse variants of functions like I've done here. I think this same approach could be applied to the duplicated functions in the processor but I'm a bit torn on whether it's better to have to completely separate functions or to add a bit more complexity to the forwards matching functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obviously this doesn't preclude the need for unit tests but I think it'll make maintenance easier in the long term if there aren't two nearly identical copies of functions all over the place
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoiding duplication in the compilation makes sense. For the processor itself, I don't want a dynamic branch at the lowest-most part of matching to choose between forwards and reverse, so there's more nuance there. We might be able to extract functionality into common shared code, or we might do some kind of static meta programming.
00bbc12
to
b3f706f
Compare
Just for transparency, the main reasons for the long gap in my work on this are:
Knowing this, my current plan to bring this work to completion revolves around identifying functions I wrote reverse variants of and writing unit tests for the original and my variant. My goal right now is to unit test 1 pair of functions a week so we'll see how that goes. I would really like to get this into Swift 6.2 (I suspect I've already missed or will miss the boat for 6.1) and I think this plan will let me keep some level of momentum to achieve this. |
6543f78
to
f28e9fa
Compare
A mostly complete implementation of lookbehind support for Regex via reverse matching.
Forewarning
I'm a casual user of regex, not an expert. I wanted lookbehinds in Swift and was willing to dive into a project I knew nothing about. As a result, I may have made assumptions or choices that don't seem to make sense to more knowledgeable readers.
Some notes
emitReverseXYZ
functions and theiremitXYZ
counterparts.String
'sendIndex
is still a valid index even though there's no character at that index. This means that the reverse variants of these matching functions need to take a different approach. I've tried to address this in some places (reverse quantifications), but others I haven't gotten to yet (^
assertions)