-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup Uint::shr
#753
base: master
Are you sure you want to change the base?
Speedup Uint::shr
#753
Conversation
@tarcieri, can you think of ways to better integrate the (helper) functions I wrote? This might improve the quality of the code base and extend the performance speed up to other Also, once you're more/less on board with this PR, I'd be happy to code up the same improvement for |
35e4f29
to
a0afa4b
Compare
201733e
to
1579e18
Compare
Also, with this implementation in place, it becomes painfully apparent that We could rename the current implementation as pub const fn shr(self, shift: u32) -> Self {
assert!(shift < Self::BITS);
self.carrying_shr(shift).0
} This would not change the interface of |
@erik-3milabs that's unfortunate to hear, can you open a separate issue? It sounds like a potential security vulnerability |
src/uint/shr.rs
Outdated
let mut result = *self; | ||
let mut i = 0; | ||
|
||
// Speed up shifting upto 8 limbs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much does this part contribute to the overall speedup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good question. I'd expect this to improve the performance of shr
for large Uint
instances. However, my benchmark are failing to provide clear evidence of this effect.
I generalized the full_limb_shr
over the WINDOW_SIZE
(hardcoded to 4 in this PR); I made sure to completely remove the speed-up for WINDOWS_SIZE=0
.
Below are the (condensed) raw results for running shr
with a certain WINDOW_SIZE
. There does seem to be some merit to the idea: for U2048
, U8192
and U16384
the running time drops significantly. But the drop is not consistent (e.g., WS=2 is larger than WS=1 for U2048, U8192 and U16384, no impact on U4096).
I understand that I need more consistent performance to make this speedup actually stick 🤔
right shift/U256/
WINDOW_SIZE: 0 time: [15.822 ns 15.831 ns 15.840 ns]
WINDOW_SIZE: 1 time: [15.534 ns 15.544 ns 15.554 ns]
WINDOW_SIZE: 2 time: [15.888 ns 15.894 ns 15.900 ns]
WINDOW_SIZE: 3 time: [15.886 ns 15.893 ns 15.900 ns]
WINDOW_SIZE: 4 time: [15.890 ns 15.897 ns 15.904 ns]
WINDOW_SIZE: 5 time: [15.936 ns 15.946 ns 15.956 ns]
right shift/U512/
WINDOW_SIZE: 0 time: [32.916 ns 32.938 ns 32.961 ns]
WINDOW_SIZE: 1 time: [30.961 ns 30.974 ns 30.988 ns]
WINDOW_SIZE: 2 time: [32.949 ns 32.962 ns 32.974 ns]
WINDOW_SIZE: 3 time: [34.661 ns 35.130 ns 35.783 ns]
WINDOW_SIZE: 4 time: [34.262 ns 34.350 ns 34.443 ns]
WINDOW_SIZE: 5 time: [35.019 ns 35.121 ns 35.231 ns]
right shift/U1024/
WINDOW_SIZE: 0 time: [51.829 ns 51.847 ns 51.864 ns]
WINDOW_SIZE: 1 time: [51.862 ns 51.885 ns 51.910 ns]
WINDOW_SIZE: 2 time: [51.422 ns 51.435 ns 51.448 ns]
WINDOW_SIZE: 3 time: [50.870 ns 50.891 ns 50.914 ns]
WINDOW_SIZE: 4 time: [50.867 ns 50.883 ns 50.898 ns]
WINDOW_SIZE: 5 time: [50.820 ns 50.833 ns 50.846 ns]
right shift/U2048/
WINDOW_SIZE: 0 time: [167.11 ns 167.17 ns 167.23 ns]
WINDOW_SIZE: 1 time: [147.72 ns 147.78 ns 147.82 ns]
WINDOW_SIZE: 2 time: [173.59 ns 173.68 ns 173.77 ns]
WINDOW_SIZE: 3 time: [114.72 ns 114.77 ns 114.83 ns]
WINDOW_SIZE: 4 time: [112.92 ns 113.12 ns 113.34 ns]
WINDOW_SIZE: 5 time: [114.96 ns 115.03 ns 115.10 ns]
right shift/U4096/
WINDOW_SIZE: 0 time: [236.76 ns 237.06 ns 237.33 ns]
WINDOW_SIZE: 1 time: [237.82 ns 238.08 ns 238.32 ns]
WINDOW_SIZE: 2 time: [234.11 ns 234.41 ns 234.68 ns]
WINDOW_SIZE: 3 time: [232.93 ns 233.20 ns 233.45 ns]
WINDOW_SIZE: 4 time: [234.26 ns 234.50 ns 234.72 ns]
WINDOW_SIZE: 5 time: [230.70 ns 231.10 ns 231.51 ns]
right shift/U8192/
WINDOW_SIZE: 0 time: [831.95 ns 833.59 ns 835.47 ns]
WINDOW_SIZE: 1 time: [788.52 ns 789.06 ns 789.62 ns]
WINDOW_SIZE: 2 time: [847.16 ns 847.81 ns 848.59 ns]
WINDOW_SIZE: 3 time: [777.58 ns 778.05 ns 778.49 ns]
WINDOW_SIZE: 4 time: [698.27 ns 698.73 ns 699.21 ns]
WINDOW_SIZE: 5 time: [623.64 ns 624.28 ns 624.90 ns]
right shift/U16384/
WINDOW_SIZE: 0 time: [1.6641 µs 1.6650 µs 1.6659 µs]
WINDOW_SIZE: 1 time: [1.5565 µs 1.5608 µs 1.5657 µs]
WINDOW_SIZE: 2 time: [1.4930 µs 1.4952 µs 1.4975 µs]
WINDOW_SIZE: 3 time: [1.6327 µs 1.6339 µs 1.6350 µs]
WINDOW_SIZE: 4 time: [1.5341 µs 1.5356 µs 1.5374 µs]
WINDOW_SIZE: 5 time: [1.4262 µs 1.4278 µs 1.4296 µs]
Could you elaborate? Does Rust compiler introduce some checks for |
The only branch I'm aware of is the one for the potential panic condition, however we don't consider conditions that cause panics to be "variable time" even though there is a branch, because the only time the branch is taken will crash the program, so the non-crashing case still runs in constant time. Also, that branch can be elided by the compiler in many conditions, especially as the code is annotated Edit: also, notably it's so aggressively inlined by LLVM that I'm having trouble making Godbolt even generate code for it |
src/limb/shr.rs
Outdated
@@ -72,6 +102,60 @@ mod tests { | |||
assert_eq!(Limb(16) >> 2, Limb(4)); | |||
} | |||
|
|||
#[cfg(target_pointer_width = "64")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's more reliable to use proptest
in such situations instead of a number of hand-written tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you point me to a good source on proptest
s? I'm unfamiliar with the concept but would love to learn about it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can read about it here: https://proptest-rs.github.io/proptest/proptest/index.html
If you look under the toplevel tests/
directory there are several files named .proptest-regressions
. You can look at any of the correspondingly named tests for some proptest examples, e.g. https://github.com/RustCrypto/crypto-bigint/blob/master/tests/boxed_monty_form.rs#L37-L155
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added them. Let me know what you think!
My apologies; perhaps I was too quick to jump to a conclusion here. Taking a step back, a better statement is: Either Why I thought
|
1579e18
to
068b054
Compare
@tarcieri @fjarri The table lists
@fjarri, you asked
To answer that question: I now use this technique for all limb shifts (not just the eight-limb window we had before) in the "fast split" variant. Your answer is answered by observing the difference in the last two columns. I added a footnote to highlight one result I don't understand... 🤔
And the raw data, in case you'd like to look at the variances:
Footnotes
|
@@ -165,7 +165,7 @@ impl<const LIMBS: usize> Uint<LIMBS> { | |||
} | |||
|
|||
/// Borrow the limbs of this [`Uint`] mutably. | |||
pub fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] { | |||
pub const fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
src/uint/shr.rs
Outdated
/// | ||
/// Returns `None` if `shift >= Self::BITS`. | ||
pub const fn split_overflowing_shr(&self, shift: u32) -> ConstCtOption<Self> { | ||
let (intra_limb_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI: clippy has an integer_division_remainder_used
lint it would be nice to eventually turn on in this crate, if you can find a solution which doesn't use these operators
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Not sure about the |
That was exactly what I was thinking 😅 @fjarri do you have a better proposal? |
src/uint/shr.rs
Outdated
/// | ||
/// Panics if `shift >= Limb::BITS`. | ||
#[inline(always)] | ||
const fn intra_limb_carrying_shr_internal(&self, shift: u32) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: not being pub
makes this implicitly internal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@erik-3milabs one option which would let us merge this PR without first bikeshedding the |
/// Split `shift` into `shift % Limb::BITS` (its intra-limb-shift component), and | ||
/// `shift / Limb::BITS` (its limb-shift component). | ||
/// | ||
/// This function achieves this without using a division/remainder operation. | ||
pub(crate) const fn decompose_shift(shift: u32) -> (u32, u32) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this function is used in both shl.rs
and shr.rs
, is there perhaps a better spot to put this?
I still can't think of anything better than I went ahead and "mirrored" the implementation to replace the existing |
@tarcieri One thing before we merge this: On my machine, the benchmarks are fluctuating a teeny-tiny bit: it looks like Uint::<LIMBS>>::ONE.shr(Uint::<LIMBS>>::BITS/2 + 10) is consistently slower than Uint::<LIMBS>>::ONE.shr(10) The difference is almost nothing (~0.5%) but it is there. |
Hmm, the compiler optimizes out my zero shift 🤔 The zero shift is (partially) optimized out:
|
|
I think this idea just went bust. I benchmarked I'm out of ideas for now. Unless one of you has an idea on preventing this, I propose we close this PR for now. |
Refactor
Uint::shr
, achieving a >60% speed up forU2048
.Benchmarks (run on commit b2dcb20)