Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup Uint::shr #753

Draft
wants to merge 21 commits into
base: master
Choose a base branch
from
Draft

Conversation

erik-3milabs
Copy link
Contributor

@erik-3milabs erik-3milabs commented Jan 24, 2025

Refactor Uint::shr, achieving a >60% speed up for U2048.

Benchmarks (run on commit b2dcb20)

right shift/shr, U2048  time:   [292.04 ns 292.16 ns 292.30 ns]
                        change: [-0.2073% -0.0666% +0.0669%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
right shift/new shr, U2048
                        time:   [114.34 ns 114.48 ns 114.62 ns]
                        change: [+0.8375% +1.0410% +1.2401%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 24, 2025

@tarcieri, can you think of ways to better integrate the (helper) functions I wrote? This might improve the quality of the code base and extend the performance speed up to other shr operations.

Also, once you're more/less on board with this PR, I'd be happy to code up the same improvement for shl.

@erik-3milabs erik-3milabs marked this pull request as draft January 24, 2025 16:12
@erik-3milabs erik-3milabs force-pushed the speedup_shr branch 3 times, most recently from 201733e to 1579e18 Compare January 24, 2025 16:33
@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 24, 2025

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

We could rename the current implementation as shr_vartime and construct a new shr as follows:

pub const fn shr(self, shift: u32) -> Self {
    assert!(shift < Self::BITS);
    self.carrying_shr(shift).0
}

This would not change the interface of shr, meaning it could fit in 0.6.1

@erik-3milabs erik-3milabs marked this pull request as ready for review January 24, 2025 16:39
@tarcieri
Copy link
Member

it becomes painfully apparent that Limb::shr is not constant time

@erik-3milabs that's unfortunate to hear, can you open a separate issue? It sounds like a potential security vulnerability

src/uint/shr.rs Outdated
let mut result = *self;
let mut i = 0;

// Speed up shifting upto 8 limbs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much does this part contribute to the overall speedup?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good question. I'd expect this to improve the performance of shr for large Uint instances. However, my benchmark are failing to provide clear evidence of this effect.

I generalized the full_limb_shr over the WINDOW_SIZE (hardcoded to 4 in this PR); I made sure to completely remove the speed-up for WINDOWS_SIZE=0.

Below are the (condensed) raw results for running shr with a certain WINDOW_SIZE. There does seem to be some merit to the idea: for U2048, U8192 and U16384 the running time drops significantly. But the drop is not consistent (e.g., WS=2 is larger than WS=1 for U2048, U8192 and U16384, no impact on U4096).

I understand that I need more consistent performance to make this speedup actually stick 🤔

right shift/U256/
WINDOW_SIZE: 0      time:   [15.822 ns 15.831 ns 15.840 ns]
WINDOW_SIZE: 1      time:   [15.534 ns 15.544 ns 15.554 ns]
WINDOW_SIZE: 2      time:   [15.888 ns 15.894 ns 15.900 ns]
WINDOW_SIZE: 3      time:   [15.886 ns 15.893 ns 15.900 ns]
WINDOW_SIZE: 4      time:   [15.890 ns 15.897 ns 15.904 ns]
WINDOW_SIZE: 5      time:   [15.936 ns 15.946 ns 15.956 ns]

right shift/U512/
WINDOW_SIZE: 0      time:   [32.916 ns 32.938 ns 32.961 ns]
WINDOW_SIZE: 1      time:   [30.961 ns 30.974 ns 30.988 ns]
WINDOW_SIZE: 2      time:   [32.949 ns 32.962 ns 32.974 ns]
WINDOW_SIZE: 3      time:   [34.661 ns 35.130 ns 35.783 ns]
WINDOW_SIZE: 4      time:   [34.262 ns 34.350 ns 34.443 ns]
WINDOW_SIZE: 5      time:   [35.019 ns 35.121 ns 35.231 ns]

right shift/U1024/
WINDOW_SIZE: 0     time:   [51.829 ns 51.847 ns 51.864 ns]
WINDOW_SIZE: 1     time:   [51.862 ns 51.885 ns 51.910 ns]
WINDOW_SIZE: 2     time:   [51.422 ns 51.435 ns 51.448 ns]
WINDOW_SIZE: 3     time:   [50.870 ns 50.891 ns 50.914 ns]
WINDOW_SIZE: 4     time:   [50.867 ns 50.883 ns 50.898 ns]
WINDOW_SIZE: 5     time:   [50.820 ns 50.833 ns 50.846 ns]

right shift/U2048/
WINDOW_SIZE: 0     time:   [167.11 ns 167.17 ns 167.23 ns]
WINDOW_SIZE: 1     time:   [147.72 ns 147.78 ns 147.82 ns]
WINDOW_SIZE: 2     time:   [173.59 ns 173.68 ns 173.77 ns]
WINDOW_SIZE: 3     time:   [114.72 ns 114.77 ns 114.83 ns]
WINDOW_SIZE: 4     time:   [112.92 ns 113.12 ns 113.34 ns]
WINDOW_SIZE: 5     time:   [114.96 ns 115.03 ns 115.10 ns]

right shift/U4096/
WINDOW_SIZE: 0     time:   [236.76 ns 237.06 ns 237.33 ns]
WINDOW_SIZE: 1     time:   [237.82 ns 238.08 ns 238.32 ns]
WINDOW_SIZE: 2     time:   [234.11 ns 234.41 ns 234.68 ns]
WINDOW_SIZE: 3     time:   [232.93 ns 233.20 ns 233.45 ns]
WINDOW_SIZE: 4     time:   [234.26 ns 234.50 ns 234.72 ns]
WINDOW_SIZE: 5     time:   [230.70 ns 231.10 ns 231.51 ns]

right shift/U8192/
WINDOW_SIZE: 0     time:   [831.95 ns 833.59 ns 835.47 ns]
WINDOW_SIZE: 1     time:   [788.52 ns 789.06 ns 789.62 ns]
WINDOW_SIZE: 2     time:   [847.16 ns 847.81 ns 848.59 ns]
WINDOW_SIZE: 3     time:   [777.58 ns 778.05 ns 778.49 ns]
WINDOW_SIZE: 4     time:   [698.27 ns 698.73 ns 699.21 ns]
WINDOW_SIZE: 5     time:   [623.64 ns 624.28 ns 624.90 ns]

right shift/U16384/
WINDOW_SIZE: 0    time:   [1.6641 µs 1.6650 µs 1.6659 µs]
WINDOW_SIZE: 1    time:   [1.5565 µs 1.5608 µs 1.5657 µs]
WINDOW_SIZE: 2    time:   [1.4930 µs 1.4952 µs 1.4975 µs]
WINDOW_SIZE: 3    time:   [1.6327 µs 1.6339 µs 1.6350 µs]
WINDOW_SIZE: 4    time:   [1.5341 µs 1.5356 µs 1.5374 µs]
WINDOW_SIZE: 5    time:   [1.4262 µs 1.4278 µs 1.4296 µs]

@fjarri
Copy link
Contributor

fjarri commented Jan 24, 2025

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does Rust compiler introduce some checks for >> on primitives?

@tarcieri
Copy link
Member

tarcieri commented Jan 24, 2025

The only branch I'm aware of is the one for the potential panic condition, however we don't consider conditions that cause panics to be "variable time" even though there is a branch, because the only time the branch is taken will crash the program, so the non-crashing case still runs in constant time. Also, that branch can be elided by the compiler in many conditions, especially as the code is annotated #[inline(always)].

Edit: also, notably it's so aggressively inlined by LLVM that I'm having trouble making Godbolt even generate code for it

src/limb/shr.rs Outdated
@@ -72,6 +102,60 @@ mod tests {
assert_eq!(Limb(16) >> 2, Limb(4));
}

#[cfg(target_pointer_width = "64")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's more reliable to use proptest in such situations instead of a number of hand-written tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you point me to a good source on proptests? I'm unfamiliar with the concept but would love to learn about it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can read about it here: https://proptest-rs.github.io/proptest/proptest/index.html

If you look under the toplevel tests/ directory there are several files named .proptest-regressions. You can look at any of the correspondingly named tests for some proptest examples, e.g. https://github.com/RustCrypto/crypto-bigint/blob/master/tests/boxed_monty_form.rs#L37-L155

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them. Let me know what you think!

@erik-3milabs
Copy link
Contributor Author

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does the Rust compiler introduce some checks for >> on primitives?

My apologies; perhaps I was too quick to jump to a conclusion here. Taking a step back, a better statement is:

Either Limb::shr is vartime, or it is possible to speed up Uint::shr even further.

Why I thought Limb::shr is vartime...

Allow me to elaborate. I jumped to my earlier conclusion because of how Uint::shr has been implemented until now. The "old" implementation makes successive calls to Uint::shr_vartime(1<<i) for i=0, 1, 2, .... and then Uint::selects the result if the ith bit was set in shift. The fact that Uint::shr leverages it as a subroute strongly suggests that Uint::shr_vartime is a very non-constant time algorithm, and that not a single part of it is constant time.

Moreover, the idea to

fn shr(&self, shift: u32) -> Self {
     let (bit_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);
     self.intra_limb_shr(bit_shift).limb_shr(limb_shift)
}

was so apparent to me that I assumed there must be a good reason this has not been implemented 🙈 After all, first ideas are usually terrible ideas, especially when working on a large repo like this with much more intelligent brains than mine working on them.

... and how we might be able to speed up Uint::shr even further.

I did some benchmarking of my own (I'm not familiar enough with compilers such as Godbolt to use that stuff) and got very stable results when using Limb::shr. This confirms that it might indeed be constant time.

When we use it to replace the Limb::carrying_shr call in Uint::intra_limb_carrying_shr, we save another 44ns when benchmarking on my machine, dropping U2048::shr to ~70ns. Not bad if you ask me! 😄

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 27, 2025

@tarcieri @fjarri
I updated the code and improved the benchmark set. Below are my local results.

The table lists

  • vartime: the vartime performance,
  • old: the old shr implementation,
  • split: splitting shift into its "intra limb" component, using the faster Limb::carrying_shr here, and
  • fast split: splitting shift as well as speeding up full limb shifting.

@fjarri, you asked

How much does this part contribute to the overall speedup?

To answer that question: I now use this technique for all limb shifts (not just the eight-limb window we had before) in the "fast split" variant. Your answer is answered by observing the difference in the last two columns.

I added a footnote to highlight one result I don't understand... 🤔

limbs vartime old split fast split
4 2.5372 ns 85.502 ns 8.8319 ns 8.8421 ns
8 4.3408 ns 116.84 ns 13.628 ns 14.009 ns
16 8.0908 ns 164.06 ns 29.176 ns 29.036 ns
32 16.188 ns 290.44 ns 195.14 ns 1 70.491 ns
64 42.217 ns 576.61 ns 141.09 ns 136.71 ns
128 72.712 ns 1.5387 µs 660.17 ns 281.96 ns
256 182.35 ns 2.8131 µs 1.2757 µs 665.62 ns

And the raw data, in case you'd like to look at the variances:

right shift/overflowing_shr_vartime/4
                        time:   [2.3705 ns 2.5372 ns 2.6880 ns]
                        change: [-2.6951% +4.9023% +13.295%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
right shift/overflowing_shr/4
                        time:   [85.384 ns 85.502 ns 85.674 ns]
                        change: [-0.0424% +0.2264% +0.6129%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
right shift/split_overflowing_shr/4
                        time:   [8.8258 ns 8.8319 ns 8.8377 ns]
                        change: [-1.7688% -1.5376% -1.3147%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/fast_split_overflowing_shr/4
                        time:   [8.8354 ns 8.8421 ns 8.8487 ns]
                        change: [-0.5675% -0.1341% +0.2004%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/8
                        time:   [4.1154 ns 4.3408 ns 4.5350 ns]
                        change: [-4.0490% +2.0512% +8.4408%] (p = 0.52 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/8
                        time:   [116.79 ns 116.84 ns 116.90 ns]
                        change: [+0.0388% +0.1941% +0.3382%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe
right shift/split_overflowing_shr/8
                        time:   [13.600 ns 13.628 ns 13.654 ns]
                        change: [+1.2040% +1.6944% +2.0893%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/fast_split_overflowing_shr/8
                        time:   [13.873 ns 14.009 ns 14.155 ns]
                        change: [-2.5360% -1.9567% -1.3958%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/overflowing_shr_vartime/16
                        time:   [7.6717 ns 8.0908 ns 8.4453 ns]
                        change: [-6.9666% -1.1241% +5.3446%] (p = 0.72 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/16
                        time:   [164.02 ns 164.06 ns 164.12 ns]
                        change: [-0.5762% -0.4384% -0.3061%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/16
                        time:   [29.108 ns 29.176 ns 29.265 ns]
                        change: [-0.2187% +0.2073% +0.6774%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
right shift/fast_split_overflowing_shr/16
                        time:   [29.011 ns 29.036 ns 29.060 ns]
                        change: [-1.0360% -0.6637% -0.3492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/32
                        time:   [15.419 ns 16.188 ns 16.845 ns]
                        change: [-7.6622% -2.9743% +1.9597%] (p = 0.22 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/32
                        time:   [290.37 ns 290.44 ns 290.53 ns]
                        change: [-2.0446% -1.9147% -1.8026%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low severe
  1 (1.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/32
                        time:   [195.10 ns 195.14 ns 195.19 ns]
                        change: [-2.6424% -2.4363% -2.2503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/fast_split_overflowing_shr/32
                        time:   [70.229 ns 70.491 ns 70.755 ns]
                        change: [-9.4592% -9.2123% -8.9452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  11 (11.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/64
                        time:   [40.703 ns 42.217 ns 43.531 ns]
                        change: [-5.7245% -2.3106% +1.2632%] (p = 0.22 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/overflowing_shr/64
                        time:   [576.30 ns 576.61 ns 577.01 ns]
                        change: [+0.1246% +0.2759% +0.4082%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
right shift/split_overflowing_shr/64
                        time:   [140.89 ns 141.09 ns 141.27 ns]
                        change: [+0.0042% +0.2691% +0.4997%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
right shift/fast_split_overflowing_shr/64
                        time:   [136.50 ns 136.71 ns 136.89 ns]
                        change: [+0.0655% +0.2786% +0.4818%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
right shift/overflowing_shr_vartime/128
                        time:   [69.208 ns 72.712 ns 75.757 ns]
                        change: [-5.4828% -0.2478% +4.8264%] (p = 0.93 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/128
                        time:   [1.5383 µs 1.5387 µs 1.5391 µs]
                        change: [+5.4762% +6.1564% +6.6780%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/split_overflowing_shr/128
                        time:   [659.83 ns 660.17 ns 660.51 ns]
                        change: [+4.7803% +4.9904% +5.1604%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
right shift/fast_split_overflowing_shr/128
                        time:   [279.45 ns 281.96 ns 284.17 ns]
                        change: [-1.7589% -0.9949% -0.1996%] (p = 0.02 < 0.05)
                        Change within noise threshold.
right shift/overflowing_shr_vartime/256
                        time:   [176.04 ns 182.35 ns 187.73 ns]
                        change: [-1.6892% +1.9530% +5.5595%] (p = 0.31 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/256
                        time:   [2.8119 µs 2.8131 µs 2.8144 µs]
                        change: [+1.1322% +1.3257% +1.4943%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/split_overflowing_shr/256
                        time:   [1.2748 µs 1.2757 µs 1.2766 µs]
                        change: [-0.9445% -0.7247% -0.5119%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
right shift/fast_split_overflowing_shr/256
                        time:   [662.83 ns 665.62 ns 668.21 ns]
                        change: [+0.2058% +0.5884% +0.9747%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

Footnotes

  1. this value concerns me...

@@ -165,7 +165,7 @@ impl<const LIMBS: usize> Uint<LIMBS> {
}

/// Borrow the limbs of this [`Uint`] mutably.
pub fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {
pub const fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

src/uint/shr.rs Outdated
///
/// Returns `None` if `shift >= Self::BITS`.
pub const fn split_overflowing_shr(&self, shift: u32) -> ConstCtOption<Self> {
let (intra_limb_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: clippy has an integer_division_remainder_used lint it would be nice to eventually turn on in this crate, if you can find a solution which doesn't use these operators

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@tarcieri
Copy link
Member

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

@erik-3milabs
Copy link
Contributor Author

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

That was exactly what I was thinking 😅 @fjarri do you have a better proposal?

src/uint/shr.rs Outdated
///
/// Panics if `shift >= Limb::BITS`.
#[inline(always)]
const fn intra_limb_carrying_shr_internal(&self, shift: u32) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: not being pub makes this implicitly internal

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@tarcieri
Copy link
Member

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

Comment on lines +121 to +125
/// Split `shift` into `shift % Limb::BITS` (its intra-limb-shift component), and
/// `shift / Limb::BITS` (its limb-shift component).
///
/// This function achieves this without using a division/remainder operation.
pub(crate) const fn decompose_shift(shift: u32) -> (u32, u32) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this function is used in both shl.rs and shr.rs, is there perhaps a better spot to put this?

@erik-3milabs
Copy link
Contributor Author

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

I still can't think of anything better than intra_limb_*. I made it private for now.

I went ahead and "mirrored" the implementation to replace the existing Uint::shl implementation.

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 30, 2025

@tarcieri One thing before we merge this:

On my machine, the benchmarks are fluctuating a teeny-tiny bit: it looks like

Uint::<LIMBS>>::ONE.shr(Uint::<LIMBS>>::BITS/2 + 10)

is consistently slower than

Uint::<LIMBS>>::ONE.shr(10)

The difference is almost nothing (~0.5%) but it is there.
What is the tolerance this library works with?

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 30, 2025

Hmm, the compiler optimizes out my zero shift 🤔

The zero shift is (partially) optimized out:

left shift/intra_limb_carrying_shl(0)/64
                        time:   [55.765 ns 56.493 ns 57.146 ns]
                        change: [-1.1124% +0.0880% +1.3286%] (p = 0.88 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
left shift/intra_limb_carrying_shl(1)/64
                        time:   [56.976 ns 57.773 ns 58.464 ns]
                        change: [-1.0465% +0.2521% +1.5104%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/64
                        time:   [56.803 ns 57.610 ns 58.300 ns]
                        change: [-1.2401% -0.0472% +1.2422%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(0)/128
                        time:   [103.58 ns 106.12 ns 108.38 ns]
                        change: [-3.7659% -0.9938% +1.7795%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(1)/128
                        time:   [143.28 ns 145.11 ns 146.72 ns]
                        change: [-1.4626% -0.2726% +0.9269%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/128
                        time:   [143.67 ns 145.68 ns 147.50 ns]
                        change: [-1.1970% +0.0733% +1.4213%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(0)/256
                        time:   [211.32 ns 216.86 ns 221.59 ns]
                        change: [-3.1180% -0.3510% +2.4557%] (p = 0.79 > 0.05)
                        No change in performance detected.
left shift/intra_limb_carrying_shl(1)/256
                        time:   [278.80 ns 282.88 ns 286.51 ns]
                        change: [-1.9677% -0.3737% +1.2130%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/256
                        time:   [279.81 ns 283.84 ns 287.33 ns]
                        change: [-1.3217% +0.2101% +1.8418%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

@erik-3milabs
Copy link
Contributor Author

erik-3milabs commented Jan 30, 2025

full_limb_overflowing_shl might also not be as constant time as we were hoping. The full_limb_overflowing_shl(LIMBS-1) is always a (significant) outlier. How does that work, though? 🤔

left shift/full_limb_overflowing_shl(0)/64
                        time:   [133.76 ns 134.28 ns 134.75 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/64
                        time:   [132.82 ns 133.30 ns 133.74 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/full_limb_overflowing_shl(5)/64
                        time:   [133.07 ns 133.52 ns 133.96 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/64
                        time:   [134.69 ns 135.09 ns 135.48 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(0)/128
                        time:   [269.71 ns 271.83 ns 273.78 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(1)/128
                        time:   [269.01 ns 270.96 ns 272.70 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/128
                        time:   [269.63 ns 271.79 ns 273.74 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/128
                        time:   [275.54 ns 277.45 ns 279.22 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(0)/256
                        time:   [636.37 ns 638.88 ns 641.05 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/256
                        time:   [635.93 ns 638.91 ns 641.63 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/256
                        time:   [636.87 ns 639.65 ns 642.13 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(LIMBS-1)/256
                        time:   [647.82 ns 650.69 ns 653.35 ns]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

@erik-3milabs
Copy link
Contributor Author

I think this idea just went bust.

I benchmarked intra_limb_carrying_shl some more. It has become evident that this function does not execute in constant time. Perhaps it is the compiler, perhaps the CPU. Either way: it makes this code unusable.

I'm out of ideas for now. Unless one of you has an idea on preventing this, I propose we close this PR for now.

@erik-3milabs erik-3milabs marked this pull request as draft January 31, 2025 11:30
@erik-3milabs erik-3milabs mentioned this pull request Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants