Speedup `Uint::shr` #753

erik-3milabs · 2025-01-24T15:40:33Z

Refactor Uint::shr, achieving a >60% speed up for U2048.

Benchmarks (run on commit `b2dcb20`)

right shift/shr, U2048  time:   [292.04 ns 292.16 ns 292.30 ns]
                        change: [-0.2073% -0.0666% +0.0669%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe
right shift/new shr, U2048
                        time:   [114.34 ns 114.48 ns 114.62 ns]
                        change: [+0.8375% +1.0410% +1.2401%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild

erik-3milabs · 2025-01-24T15:43:00Z

@tarcieri, can you think of ways to better integrate the (helper) functions I wrote? This might improve the quality of the code base and extend the performance speed up to other shr operations.

Also, once you're more/less on board with this PR, I'd be happy to code up the same improvement for shl.

erik-3milabs · 2025-01-24T16:36:06Z

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

We could rename the current implementation as shr_vartime and construct a new shr as follows:

pub const fn shr(self, shift: u32) -> Self {
    assert!(shift < Self::BITS);
    self.carrying_shr(shift).0
}

This would not change the interface of shr, meaning it could fit in 0.6.1

tarcieri · 2025-01-24T17:10:58Z

it becomes painfully apparent that Limb::shr is not constant time

@erik-3milabs that's unfortunate to hear, can you open a separate issue? It sounds like a potential security vulnerability

fjarri · 2025-01-24T19:12:37Z

src/uint/shr.rs

        let mut result = *self;
        let mut i = 0;
+
+        // Speed up shifting upto 8 limbs.


How much does this part contribute to the overall speedup?

Hmm, good question. I'd expect this to improve the performance of shr for large Uint instances. However, my benchmark are failing to provide clear evidence of this effect.

I generalized the full_limb_shr over the WINDOW_SIZE (hardcoded to 4 in this PR); I made sure to completely remove the speed-up for WINDOWS_SIZE=0.

Below are the (condensed) raw results for running shr with a certain WINDOW_SIZE. There does seem to be some merit to the idea: for U2048, U8192 and U16384 the running time drops significantly. But the drop is not consistent (e.g., WS=2 is larger than WS=1 for U2048, U8192 and U16384, no impact on U4096).

I understand that I need more consistent performance to make this speedup actually stick 🤔

right shift/U256/ WINDOW_SIZE: 0 time: [15.822 ns 15.831 ns 15.840 ns] WINDOW_SIZE: 1 time: [15.534 ns 15.544 ns 15.554 ns] WINDOW_SIZE: 2 time: [15.888 ns 15.894 ns 15.900 ns] WINDOW_SIZE: 3 time: [15.886 ns 15.893 ns 15.900 ns] WINDOW_SIZE: 4 time: [15.890 ns 15.897 ns 15.904 ns] WINDOW_SIZE: 5 time: [15.936 ns 15.946 ns 15.956 ns] right shift/U512/ WINDOW_SIZE: 0 time: [32.916 ns 32.938 ns 32.961 ns] WINDOW_SIZE: 1 time: [30.961 ns 30.974 ns 30.988 ns] WINDOW_SIZE: 2 time: [32.949 ns 32.962 ns 32.974 ns] WINDOW_SIZE: 3 time: [34.661 ns 35.130 ns 35.783 ns] WINDOW_SIZE: 4 time: [34.262 ns 34.350 ns 34.443 ns] WINDOW_SIZE: 5 time: [35.019 ns 35.121 ns 35.231 ns] right shift/U1024/ WINDOW_SIZE: 0 time: [51.829 ns 51.847 ns 51.864 ns] WINDOW_SIZE: 1 time: [51.862 ns 51.885 ns 51.910 ns] WINDOW_SIZE: 2 time: [51.422 ns 51.435 ns 51.448 ns] WINDOW_SIZE: 3 time: [50.870 ns 50.891 ns 50.914 ns] WINDOW_SIZE: 4 time: [50.867 ns 50.883 ns 50.898 ns] WINDOW_SIZE: 5 time: [50.820 ns 50.833 ns 50.846 ns] right shift/U2048/ WINDOW_SIZE: 0 time: [167.11 ns 167.17 ns 167.23 ns] WINDOW_SIZE: 1 time: [147.72 ns 147.78 ns 147.82 ns] WINDOW_SIZE: 2 time: [173.59 ns 173.68 ns 173.77 ns] WINDOW_SIZE: 3 time: [114.72 ns 114.77 ns 114.83 ns] WINDOW_SIZE: 4 time: [112.92 ns 113.12 ns 113.34 ns] WINDOW_SIZE: 5 time: [114.96 ns 115.03 ns 115.10 ns] right shift/U4096/ WINDOW_SIZE: 0 time: [236.76 ns 237.06 ns 237.33 ns] WINDOW_SIZE: 1 time: [237.82 ns 238.08 ns 238.32 ns] WINDOW_SIZE: 2 time: [234.11 ns 234.41 ns 234.68 ns] WINDOW_SIZE: 3 time: [232.93 ns 233.20 ns 233.45 ns] WINDOW_SIZE: 4 time: [234.26 ns 234.50 ns 234.72 ns] WINDOW_SIZE: 5 time: [230.70 ns 231.10 ns 231.51 ns] right shift/U8192/ WINDOW_SIZE: 0 time: [831.95 ns 833.59 ns 835.47 ns] WINDOW_SIZE: 1 time: [788.52 ns 789.06 ns 789.62 ns] WINDOW_SIZE: 2 time: [847.16 ns 847.81 ns 848.59 ns] WINDOW_SIZE: 3 time: [777.58 ns 778.05 ns 778.49 ns] WINDOW_SIZE: 4 time: [698.27 ns 698.73 ns 699.21 ns] WINDOW_SIZE: 5 time: [623.64 ns 624.28 ns 624.90 ns] right shift/U16384/ WINDOW_SIZE: 0 time: [1.6641 µs 1.6650 µs 1.6659 µs] WINDOW_SIZE: 1 time: [1.5565 µs 1.5608 µs 1.5657 µs] WINDOW_SIZE: 2 time: [1.4930 µs 1.4952 µs 1.4975 µs] WINDOW_SIZE: 3 time: [1.6327 µs 1.6339 µs 1.6350 µs] WINDOW_SIZE: 4 time: [1.5341 µs 1.5356 µs 1.5374 µs] WINDOW_SIZE: 5 time: [1.4262 µs 1.4278 µs 1.4296 µs]

fjarri · 2025-01-24T19:13:42Z

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does Rust compiler introduce some checks for >> on primitives?

tarcieri · 2025-01-24T19:16:19Z

The only branch I'm aware of is the one for the potential panic condition, however we don't consider conditions that cause panics to be "variable time" even though there is a branch, because the only time the branch is taken will crash the program, so the non-crashing case still runs in constant time. Also, that branch can be elided by the compiler in many conditions, especially as the code is annotated #[inline(always)].

Edit: also, notably it's so aggressively inlined by LLVM that I'm having trouble making Godbolt even generate code for it

fjarri · 2025-01-24T19:16:43Z

src/limb/shr.rs

@@ -72,6 +102,60 @@ mod tests {
        assert_eq!(Limb(16) >> 2, Limb(4));
    }

+    #[cfg(target_pointer_width = "64")]


I wonder if it's more reliable to use proptest in such situations instead of a number of hand-written tests.

Could you point me to a good source on proptests? I'm unfamiliar with the concept but would love to learn about it!

You can read about it here: https://proptest-rs.github.io/proptest/proptest/index.html

If you look under the toplevel tests/ directory there are several files named .proptest-regressions. You can look at any of the correspondingly named tests for some proptest examples, e.g. https://github.com/RustCrypto/crypto-bigint/blob/master/tests/boxed_monty_form.rs#L37-L155

I added them. Let me know what you think!

erik-3milabs · 2025-01-24T21:20:22Z

Also, with this implementation in place, it becomes painfully apparent that Limb::shr is not constant time, unlike what its name suggests.

Could you elaborate? Does the Rust compiler introduce some checks for >> on primitives?

My apologies; perhaps I was too quick to jump to a conclusion here. Taking a step back, a better statement is:

Either Limb::shr is vartime, or it is possible to speed up Uint::shr even further.

Why I thought `Limb::shr` is vartime...

Allow me to elaborate. I jumped to my earlier conclusion because of how Uint::shr has been implemented until now. The "old" implementation makes successive calls to Uint::shr_vartime(1<<i) for i=0, 1, 2, .... and then Uint::selects the result if the ith bit was set in shift. The fact that Uint::shr leverages it as a subroute strongly suggests that Uint::shr_vartime is a very non-constant time algorithm, and that not a single part of it is constant time.

Moreover, the idea to

fn shr(&self, shift: u32) -> Self {
     let (bit_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);
     self.intra_limb_shr(bit_shift).limb_shr(limb_shift)
}

was so apparent to me that I assumed there must be a good reason this has not been implemented 🙈 After all, first ideas are usually terrible ideas, especially when working on a large repo like this with much more intelligent brains than mine working on them.

... and how we might be able to speed up `Uint::shr` even further.

I did some benchmarking of my own (I'm not familiar enough with compilers such as Godbolt to use that stuff) and got very stable results when using Limb::shr. This confirms that it might indeed be constant time.

When we use it to replace the Limb::carrying_shr call in Uint::intra_limb_carrying_shr, we save another 44ns when benchmarking on my machine, dropping U2048::shr to ~70ns. Not bad if you ask me! 😄

erik-3milabs · 2025-01-27T17:46:29Z

@tarcieri @fjarri
I updated the code and improved the benchmark set. Below are my local results.

The table lists

vartime: the vartime performance,
old: the old shr implementation,
split: splitting shift into its "intra limb" component, using the faster Limb::carrying_shr here, and
fast split: splitting shift as well as speeding up full limb shifting.

@fjarri, you asked

How much does this part contribute to the overall speedup?

To answer that question: I now use this technique for all limb shifts (not just the eight-limb window we had before) in the "fast split" variant. Your answer is answered by observing the difference in the last two columns.

I added a footnote to highlight one result I don't understand... 🤔

limbs	vartime	old	split	fast split
4	2.5372 ns	85.502 ns	8.8319 ns	8.8421 ns
8	4.3408 ns	116.84 ns	13.628 ns	14.009 ns
16	8.0908 ns	164.06 ns	29.176 ns	29.036 ns
32	16.188 ns	290.44 ns	195.14 ns ¹	70.491 ns
64	42.217 ns	576.61 ns	141.09 ns	136.71 ns
128	72.712 ns	1.5387 µs	660.17 ns	281.96 ns
256	182.35 ns	2.8131 µs	1.2757 µs	665.62 ns

And the raw data, in case you'd like to look at the variances:

right shift/overflowing_shr_vartime/4
                        time:   [2.3705 ns 2.5372 ns 2.6880 ns]
                        change: [-2.6951% +4.9023% +13.295%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
right shift/overflowing_shr/4
                        time:   [85.384 ns 85.502 ns 85.674 ns]
                        change: [-0.0424% +0.2264% +0.6129%] (p = 0.18 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
right shift/split_overflowing_shr/4
                        time:   [8.8258 ns 8.8319 ns 8.8377 ns]
                        change: [-1.7688% -1.5376% -1.3147%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/fast_split_overflowing_shr/4
                        time:   [8.8354 ns 8.8421 ns 8.8487 ns]
                        change: [-0.5675% -0.1341% +0.2004%] (p = 0.55 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/8
                        time:   [4.1154 ns 4.3408 ns 4.5350 ns]
                        change: [-4.0490% +2.0512% +8.4408%] (p = 0.52 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/8
                        time:   [116.79 ns 116.84 ns 116.90 ns]
                        change: [+0.0388% +0.1941% +0.3382%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  11 (11.00%) high severe
right shift/split_overflowing_shr/8
                        time:   [13.600 ns 13.628 ns 13.654 ns]
                        change: [+1.2040% +1.6944% +2.0893%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/fast_split_overflowing_shr/8
                        time:   [13.873 ns 14.009 ns 14.155 ns]
                        change: [-2.5360% -1.9567% -1.3958%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/overflowing_shr_vartime/16
                        time:   [7.6717 ns 8.0908 ns 8.4453 ns]
                        change: [-6.9666% -1.1241% +5.3446%] (p = 0.72 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/16
                        time:   [164.02 ns 164.06 ns 164.12 ns]
                        change: [-0.5762% -0.4384% -0.3061%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/16
                        time:   [29.108 ns 29.176 ns 29.265 ns]
                        change: [-0.2187% +0.2073% +0.6774%] (p = 0.36 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe
right shift/fast_split_overflowing_shr/16
                        time:   [29.011 ns 29.036 ns 29.060 ns]
                        change: [-1.0360% -0.6637% -0.3492%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/32
                        time:   [15.419 ns 16.188 ns 16.845 ns]
                        change: [-7.6622% -2.9743% +1.9597%] (p = 0.22 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/32
                        time:   [290.37 ns 290.44 ns 290.53 ns]
                        change: [-2.0446% -1.9147% -1.8026%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low severe
  1 (1.00%) high mild
  6 (6.00%) high severe
right shift/split_overflowing_shr/32
                        time:   [195.10 ns 195.14 ns 195.19 ns]
                        change: [-2.6424% -2.4363% -2.2503%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/fast_split_overflowing_shr/32
                        time:   [70.229 ns 70.491 ns 70.755 ns]
                        change: [-9.4592% -9.2123% -8.9452%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  11 (11.00%) high mild
  5 (5.00%) high severe
right shift/overflowing_shr_vartime/64
                        time:   [40.703 ns 42.217 ns 43.531 ns]
                        change: [-5.7245% -2.3106% +1.2632%] (p = 0.22 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
right shift/overflowing_shr/64
                        time:   [576.30 ns 576.61 ns 577.01 ns]
                        change: [+0.1246% +0.2759% +0.4082%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  5 (5.00%) high severe
right shift/split_overflowing_shr/64
                        time:   [140.89 ns 141.09 ns 141.27 ns]
                        change: [+0.0042% +0.2691% +0.4997%] (p = 0.04 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
right shift/fast_split_overflowing_shr/64
                        time:   [136.50 ns 136.71 ns 136.89 ns]
                        change: [+0.0655% +0.2786% +0.4818%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
right shift/overflowing_shr_vartime/128
                        time:   [69.208 ns 72.712 ns 75.757 ns]
                        change: [-5.4828% -0.2478% +4.8264%] (p = 0.93 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/128
                        time:   [1.5383 µs 1.5387 µs 1.5391 µs]
                        change: [+5.4762% +6.1564% +6.6780%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 20 outliers among 100 measurements (20.00%)
  9 (9.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
right shift/split_overflowing_shr/128
                        time:   [659.83 ns 660.17 ns 660.51 ns]
                        change: [+4.7803% +4.9904% +5.1604%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  2 (2.00%) high severe
right shift/fast_split_overflowing_shr/128
                        time:   [279.45 ns 281.96 ns 284.17 ns]
                        change: [-1.7589% -0.9949% -0.1996%] (p = 0.02 < 0.05)
                        Change within noise threshold.
right shift/overflowing_shr_vartime/256
                        time:   [176.04 ns 182.35 ns 187.73 ns]
                        change: [-1.6892% +1.9530% +5.5595%] (p = 0.31 > 0.05)
                        No change in performance detected.
right shift/overflowing_shr/256
                        time:   [2.8119 µs 2.8131 µs 2.8144 µs]
                        change: [+1.1322% +1.3257% +1.4943%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) low severe
  2 (2.00%) low mild
  1 (1.00%) high mild
  4 (4.00%) high severe
right shift/split_overflowing_shr/256
                        time:   [1.2748 µs 1.2757 µs 1.2766 µs]
                        change: [-0.9445% -0.7247% -0.5119%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  5 (5.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe
right shift/fast_split_overflowing_shr/256
                        time:   [662.83 ns 665.62 ns 668.21 ns]
                        change: [+0.2058% +0.5884% +0.9747%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) low mild
  2 (2.00%) high mild

this value concerns me... ↩

tarcieri · 2025-01-28T23:25:12Z

src/uint.rs

@@ -165,7 +165,7 @@ impl<const LIMBS: usize> Uint<LIMBS> {
    }

    /// Borrow the limbs of this [`Uint`] mutably.
-    pub fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {
+    pub const fn as_limbs_mut(&mut self) -> &mut [Limb; LIMBS] {


tarcieri · 2025-01-28T23:59:58Z

src/uint/shr.rs

+    ///
+    /// Returns `None` if `shift >= Self::BITS`.
+    pub const fn split_overflowing_shr(&self, shift: u32) -> ConstCtOption<Self> {
+        let (intra_limb_shift, limb_shift) = (shift % Limb::BITS, shift / Limb::BITS);


FYI: clippy has an integer_division_remainder_used lint it would be nice to eventually turn on in this crate, if you can find a solution which doesn't use these operators

tarcieri · 2025-01-29T00:01:10Z

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

erik-3milabs · 2025-01-29T15:51:55Z

Not sure about the intra_limb_* naming scheme but I'm not sure I have a better suggestion

That was exactly what I was thinking 😅 @fjarri do you have a better proposal?

tarcieri · 2025-01-29T15:54:04Z

src/uint/shr.rs

+    ///
+    /// Panics if `shift >= Limb::BITS`.
+    #[inline(always)]
+    const fn intra_limb_carrying_shr_internal(&self, shift: u32) -> Self {


Nit: not being pub makes this implicitly internal

tarcieri · 2025-01-29T16:03:29Z

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

erik-3milabs · 2025-01-30T11:13:20Z

src/uint/shl.rs

+    /// Split `shift` into `shift % Limb::BITS` (its intra-limb-shift component), and
+    /// `shift / Limb::BITS` (its limb-shift component).
+    ///
+    /// This function achieves this without using a division/remainder operation.
+    pub(crate) const fn decompose_shift(shift: u32) -> (u32, u32) {


Given that this function is used in both shl.rs and shr.rs, is there perhaps a better spot to put this?

erik-3milabs · 2025-01-30T11:16:20Z

@erik-3milabs one option which would let us merge this PR without first bikeshedding the intra_limb_* naming would be to remove pub from those functions for now until we're happy with the name (or give up)

I still can't think of anything better than intra_limb_*. I made it private for now.

I went ahead and "mirrored" the implementation to replace the existing Uint::shl implementation.

erik-3milabs · 2025-01-30T11:23:43Z

@tarcieri One thing before we merge this:

On my machine, the benchmarks are fluctuating a teeny-tiny bit: it looks like

Uint::<LIMBS>>::ONE.shr(Uint::<LIMBS>>::BITS/2 + 10)

is consistently slower than

Uint::<LIMBS>>::ONE.shr(10)

The difference is almost nothing (~0.5%) but it is there.
What is the tolerance this library works with?

erik-3milabs · 2025-01-30T11:48:21Z

Hmm, the compiler optimizes out my zero shift 🤔

The zero shift is (partially) optimized out:

left shift/intra_limb_carrying_shl(0)/64
                        time:   [55.765 ns 56.493 ns 57.146 ns]
                        change: [-1.1124% +0.0880% +1.3286%] (p = 0.88 > 0.05)
                        No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
left shift/intra_limb_carrying_shl(1)/64
                        time:   [56.976 ns 57.773 ns 58.464 ns]
                        change: [-1.0465% +0.2521% +1.5104%] (p = 0.70 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/64
                        time:   [56.803 ns 57.610 ns 58.300 ns]
                        change: [-1.2401% -0.0472% +1.2422%] (p = 0.94 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(0)/128
                        time:   [103.58 ns 106.12 ns 108.38 ns]
                        change: [-3.7659% -0.9938% +1.7795%] (p = 0.46 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/intra_limb_carrying_shl(1)/128
                        time:   [143.28 ns 145.11 ns 146.72 ns]
                        change: [-1.4626% -0.2726% +0.9269%] (p = 0.66 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/128
                        time:   [143.67 ns 145.68 ns 147.50 ns]
                        change: [-1.1970% +0.0733% +1.4213%] (p = 0.92 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/intra_limb_carrying_shl(0)/256
                        time:   [211.32 ns 216.86 ns 221.59 ns]
                        change: [-3.1180% -0.3510% +2.4557%] (p = 0.79 > 0.05)
                        No change in performance detected.
left shift/intra_limb_carrying_shl(1)/256
                        time:   [278.80 ns 282.88 ns 286.51 ns]
                        change: [-1.9677% -0.3737% +1.2130%] (p = 0.64 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/intra_limb_carrying_shl(Limb::BITS-1)/256
                        time:   [279.81 ns 283.84 ns 287.33 ns]
                        change: [-1.3217% +0.2101% +1.8418%] (p = 0.79 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

erik-3milabs · 2025-01-30T12:09:27Z

full_limb_overflowing_shl might also not be as constant time as we were hoping. The full_limb_overflowing_shl(LIMBS-1) is always a (significant) outlier. How does that work, though? 🤔

left shift/full_limb_overflowing_shl(0)/64
                        time:   [133.76 ns 134.28 ns 134.75 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/64
                        time:   [132.82 ns 133.30 ns 133.74 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
left shift/full_limb_overflowing_shl(5)/64
                        time:   [133.07 ns 133.52 ns 133.96 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/64
                        time:   [134.69 ns 135.09 ns 135.48 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
left shift/full_limb_overflowing_shl(0)/128
                        time:   [269.71 ns 271.83 ns 273.78 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(1)/128
                        time:   [269.01 ns 270.96 ns 272.70 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/128
                        time:   [269.63 ns 271.79 ns 273.74 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
left shift/full_limb_overflowing_shl(LIMBS-1)/128
                        time:   [275.54 ns 277.45 ns 279.22 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(0)/256
                        time:   [636.37 ns 638.88 ns 641.05 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
left shift/full_limb_overflowing_shl(1)/256
                        time:   [635.93 ns 638.91 ns 641.63 ns]
Found 8 outliers among 100 measurements (8.00%)
  8 (8.00%) high mild
left shift/full_limb_overflowing_shl(5)/256
                        time:   [636.87 ns 639.65 ns 642.13 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
left shift/full_limb_overflowing_shl(LIMBS-1)/256
                        time:   [647.82 ns 650.69 ns 653.35 ns]
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

erik-3milabs · 2025-01-31T11:22:35Z

I think this idea just went bust.

I benchmarked intra_limb_carrying_shl some more. It has become evident that this function does not execute in constant time. Perhaps it is the compiler, perhaps the CPU. Either way: it makes this code unusable.

I'm out of ideas for now. Unless one of you has an idea on preventing this, I propose we close this PR for now.

erik-3milabs force-pushed the speedup_shr branch from 35e4f29 to a0afa4b Compare January 24, 2025 15:51

erik-3milabs marked this pull request as draft January 24, 2025 16:12

erik-3milabs force-pushed the speedup_shr branch 3 times, most recently from 201733e to 1579e18 Compare January 24, 2025 16:33

erik-3milabs marked this pull request as ready for review January 24, 2025 16:39

fjarri reviewed Jan 24, 2025

View reviewed changes

erik-3milabs added 4 commits January 27, 2025 17:46

Make Uint::as_limbs_mut const

509c51c

Implement Limb::carrying_shr

3b78bd5

Implement Uint::split_overflowing_shr

16f59fc

Introduce Uint::fast_split_overflowing_shr

068b054

erik-3milabs force-pushed the speedup_shr branch from 1579e18 to 068b054 Compare January 27, 2025 17:40

Fix benches

dd28ce4

tarcieri reviewed Jan 28, 2025

View reviewed changes

erik-3milabs added 2 commits January 29, 2025 16:47

Remove integer division/remainder from split_overflowing_shr

a5dbc74

Add annotations

6858e22

tarcieri reviewed Jan 29, 2025

View reviewed changes

tarcieri approved these changes Jan 29, 2025

View reviewed changes

erik-3milabs added 13 commits January 30, 2025 10:38

Replace Uint::shr with the new implementation.

fb15f24

Fix Uint::intra_limb_shr naming

88f1972

Fix Uint::full_limb_shr naming

d24b299

Update Uint::shr benchmarking

a49f889

Implement Limb::carrying_shl

271eadb

Speed up Uint::shl

3d7c87e

Remove duplicate shr/shl code

6949c33

Remove duplicate Uint::shl code

81de22d

Deprecate Limb::shl1

d438508

Remove duplicate Uint::shr code

aeb0f73

Deprecate Limb::shr1

b138396

Fix fmt

fb01e75

Expand Uint::shl benchmarking

ada14bc

erik-3milabs commented Jan 30, 2025

View reviewed changes

Fix fmt

3a1eb7a

erik-3milabs mentioned this pull request Jan 31, 2025

Make Uint::as_limbs_mut const #757

Merged

erik-3milabs marked this pull request as draft January 31, 2025 11:30

erik-3milabs mentioned this pull request Jan 31, 2025

Binary GCD #755

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup `Uint::shr` #753

Speedup `Uint::shr` #753

erik-3milabs commented Jan 24, 2025 •

edited

Loading

erik-3milabs commented Jan 24, 2025 •

edited

Loading

erik-3milabs commented Jan 24, 2025 •

edited

Loading

tarcieri commented Jan 24, 2025

fjarri Jan 24, 2025

erik-3milabs Jan 24, 2025

fjarri commented Jan 24, 2025

tarcieri commented Jan 24, 2025 •

edited

Loading

fjarri Jan 24, 2025

erik-3milabs Jan 24, 2025

tarcieri Jan 24, 2025

erik-3milabs Jan 27, 2025

erik-3milabs commented Jan 24, 2025

erik-3milabs commented Jan 27, 2025 •

edited

Loading

tarcieri Jan 28, 2025

tarcieri Jan 28, 2025

erik-3milabs Jan 29, 2025

tarcieri commented Jan 29, 2025

erik-3milabs commented Jan 29, 2025

tarcieri Jan 29, 2025

erik-3milabs Jan 30, 2025

tarcieri commented Jan 29, 2025

erik-3milabs Jan 30, 2025

erik-3milabs commented Jan 30, 2025

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 31, 2025

Speedup Uint::shr #753

Are you sure you want to change the base?

Speedup Uint::shr #753

Conversation

erik-3milabs commented Jan 24, 2025 • edited Loading

Benchmarks (run on commit b2dcb20)

erik-3milabs commented Jan 24, 2025 • edited Loading

erik-3milabs commented Jan 24, 2025 • edited Loading

tarcieri commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fjarri commented Jan 24, 2025

tarcieri commented Jan 24, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

erik-3milabs commented Jan 24, 2025

Why I thought Limb::shr is vartime...

... and how we might be able to speed up Uint::shr even further.

erik-3milabs commented Jan 27, 2025 • edited Loading

Footnotes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented Jan 29, 2025

erik-3milabs commented Jan 29, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarcieri commented Jan 29, 2025

Choose a reason for hiding this comment

erik-3milabs commented Jan 30, 2025

erik-3milabs commented Jan 30, 2025 • edited Loading

erik-3milabs commented Jan 30, 2025 • edited Loading

erik-3milabs commented Jan 30, 2025 • edited Loading

erik-3milabs commented Jan 31, 2025

Speedup `Uint::shr` #753

Speedup `Uint::shr` #753

erik-3milabs commented Jan 24, 2025 •

edited

Loading

Benchmarks (run on commit `b2dcb20`)

erik-3milabs commented Jan 24, 2025 •

edited

Loading

erik-3milabs commented Jan 24, 2025 •

edited

Loading

tarcieri commented Jan 24, 2025 •

edited

Loading

Why I thought `Limb::shr` is vartime...

... and how we might be able to speed up `Uint::shr` even further.

erik-3milabs commented Jan 27, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading

erik-3milabs commented Jan 30, 2025 •

edited

Loading