[rust] Skip unnecessary calls to find_among #202

jedav · 2024-10-30T20:45:44Z

This change ports an optimization from the C stemmer generator to the rust generator.

In order to determine whether find_among and find_among_b are worth calling, the generated C checks that the current string is long enough to potentially match the shortest string literal in the among, then does some bitwise magic to check that the byte in the current string at the length of the shortest string in the among may match the byte at that position in at least one of the among strings. If the current string is too short or that byte doesn't match any of the amongs at that position, then we can fail-fast instead of calling find_among or find_among_b.

This change allows the generated rust stemmers to similarly skip find_among and find_among_b calls where we can quickly determine that the call could not succeed. In some stemming benchmarks I've been running, this improves the performance of the rust english stemmer by between 40% and 50%.

This change ports an optimization from the C stemmer generator to the rust generator. In order to determine whether find_among and find_among_b are worth calling, the generated C checks that the current string is long enough to potentially match the shortest string literal in the among, then does some bitwise magic to check that the characters in the current string at length of the end of the shortest string in the among are the same as the character at that position in at least one of the among strings. If the current string is too short or that character doesn't match any of the amongs at that position, then we can fail-fast instead of calling find_among. This change allows the generated rust stemmers to similarly skip find_among calls where we can quickly determine that find_among would not succeed. In some stemming benchmarks I've been running, this improves the performance of the rust english stemmer by between 40 and 50%.

ojwb · 2024-10-31T00:53:27Z

Sounds good, but CI failed as this seems to result in different output for one of the arabic test vocabulary words so there's something not quite right in the implementation.

To reproduce locally, try make check_rust_arabic (or make check_rust to run all of them).

jedav · 2024-10-31T20:11:33Z

Ah, I see, I had missed make check. Will investigate, thanks for pointing this out.

jedav · 2024-11-01T23:06:29Z

OK, I think it's passing make check_rust after narrowing the numeric types in the bitwise operations in the comparison. The performance improvement is now only 27%, which makes some sense; previously it was skipping more find_amongs than it should have, resulting in words not being stemmed down as far as they should've been, while now it is performing those find_amongs and doing more work.

jedav · 2024-11-01T23:13:51Z

Ooh, I think understand it now, it was simplifying the !(...) != 0 to (...) == 0 that fixed it I think - in rust, ! on an integer is bitwise negation like ~ in C, not boolean negation like ! is in C.

ojwb · 2024-11-05T03:36:37Z

I guess those ! semantics seemed like a good idea to someone, but it does rather set a trap for those coming from a C/C++ background.

The 40-50% gain did seem a bit surprising, but the C optimisation was a long time ago.

jedav · 2024-11-07T20:47:04Z

Yeah, the ! semantics definitely got me pretty good.

I wasn't as surprised with a nominal speedup of that magnitude because we were seeing the rust version as 60-70% slower than the C version in benchmarks, and it seemed possible that this optimization might be the main cause. It's still a big piece but we've found other sources for the remaining performance delta (mostly artifacts of the way we were wrapping the rust code for benchmarking, resulting in extra allocations).

Is there anything else you'd like to see changed here before this can be merged? I'm not totally clear on the process here.

ojwb · 2024-11-19T03:07:29Z

Sorry, was just on vacation and not really able to actually look at the changes.

Looks good, thanks.

jedav · 2024-11-20T01:03:01Z

Thanks Olly!

[rust] Fix types in find_among skip checks

9506b39

jedav force-pushed the faster_rust branch from 4ea8a16 to 9506b39 Compare November 1, 2024 22:01

ojwb merged commit a10c9c4 into snowballstem:master Nov 19, 2024
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust] Skip unnecessary calls to find_among #202

[rust] Skip unnecessary calls to find_among #202

jedav commented Oct 30, 2024

ojwb commented Oct 31, 2024

jedav commented Oct 31, 2024

jedav commented Nov 1, 2024

jedav commented Nov 1, 2024 •

edited

Loading

ojwb commented Nov 5, 2024

jedav commented Nov 7, 2024

ojwb commented Nov 19, 2024

jedav commented Nov 20, 2024

[rust] Skip unnecessary calls to find_among #202

[rust] Skip unnecessary calls to find_among #202

Conversation

jedav commented Oct 30, 2024

ojwb commented Oct 31, 2024

jedav commented Oct 31, 2024

jedav commented Nov 1, 2024

jedav commented Nov 1, 2024 • edited Loading

ojwb commented Nov 5, 2024

jedav commented Nov 7, 2024

ojwb commented Nov 19, 2024

jedav commented Nov 20, 2024

jedav commented Nov 1, 2024 •

edited

Loading