SIMD/MIMD enhancements discussion #400

louis-langholtz · 2017-07-30T15:53:03Z

louis-langholtz
Jul 30, 2017
Maintainer

Let's use this issue for any conversation regarding SIMD/MIMD enhancements.

louis-langholtz · 2017-07-30T15:56:39Z

louis-langholtz
Jul 30, 2017
Maintainer Author

Update:

Results have been disappointing so far in terms of seeing any speedups.

I believe this is in part because:

the simulations I've run are smaller (perhaps too small to see any noticable benefit); and
up until recently PlayRho didn't have any real benchmarking oriented testing.

Now PlayRho has the Benchmark application which uses Google's C++ benchmarking library which gives PlayRho a better basis for any SIMD/MIMD work.

0 replies

louis-langholtz · 2017-07-30T16:34:07Z

louis-langholtz
Jul 30, 2017
Maintainer Author

Something I think that's important to keep in mind regarding SIMD/MIMD enhancements, is the cache locality or locality of reference problem. Someone could spend weeks writing the most ultimately OpenCL/CUDA code only to see that code's performance blown out of the water by local CPU code that was much better at using memory in cache friendly ways.

I've been looking into how memory is used and have found things out in this regard that were surprising and very interesting to me. Conceptually speaking, getting data to and from external compute resources like the GPU, seems to also suffers from a version of the cache locality problem.

Some of the changes I've made to PlayRho have been data-oriented in nature towards addressing simulation speed based on memory access patterns and optimizing cache locality.

This all gets increasingly hardware specific alarmingly fast to me. For example, would it make more sense to write code explicitly using Streaming SIMD Extensions or spend that time working on improving the optimizations available in the GNU C++ and/or Clang compiler?

Anyways... stuff I've been thinking about related to this.

0 replies

Alan-FGR · 2017-08-01T02:14:23Z

Alan-FGR
Aug 1, 2017

It's funny how Chipmunk not only scales quite considerably when enabling multi-threading (using the hasty space), but also it benefits from hw acceleration on mobile platforms. According to @Genbox though the improvements in Box2D were marginal at best or negative, what is really puzzling, but then again they're quite different overall.

0 replies

Genbox · 2017-08-01T18:34:22Z

Genbox
Aug 1, 2017

@louis-langholtz I also found that branch prediction misses where a big hit in the solver code. I'm sure some clever arithmetic could produce a less branchy version or even make some of it completely branchless, which would optimize the CPU prefetching.

I think the solver code is the best candidate for SIMD since it gets large blocks of data and uses matrix heavy options, which can be done in SIMD fashion. The trick is to do as many operations as possible to fill out the SIMD register, which can be up to 512 bits with AVX-512. However, you are right that it makes the code very hardware specific and very difficult to port, so I would only do it for the hot paths in the solver or maybe narrow phase.

0 replies

Genbox · 2017-08-01T18:35:55Z

Genbox
Aug 1, 2017

Note: Intel VTune is amazing for analyzing low level operations on the CPU

0 replies

louis-langholtz · 2017-08-02T15:51:30Z

louis-langholtz
Aug 2, 2017
Maintainer Author

@Genbox Any chance you can take a screen capture or something similar of the branch prediction miss info? Is that from running Intel VTune on PlayRho or on Erin's Box2D code or some other solver code? I'd love to review that info since I've tested some branch eliminations in what I consider solver code and saw no reliably repeatable speedups. I've also had some potential compile-time optimizations in mind that would eliminate some branching so any branch miss info I could get/see I think would be very helpful. Incidentally, I'd love to get my hands on VTune even for the 30-day trial period but I have some hardware resource limits that would make using it harder.

OTOH, there's sin/cos calculations going on inside the world step code path which I've been considering replacing instead with unit vector math (using the UnitVec2 type). std::sin and std::cos ops are significantly slower than the std::sqrt operation which was a surprise to me to see.

In any case, I'd like to increase the benchmarks done especially to fill in the gap between the lowest level code (like floating point division times) and highest level code (like times for whole simulations coming to rest).

0 replies

Genbox · 2017-08-02T18:14:03Z

Genbox
Aug 2, 2017

I used it on Velcro with the 30 day evaluation. It does not care about the programming language since it focuses on the instructions on the CPU and not much else. What kind of hardware limits are we talking about?

Replace sin/cos where? In the b2Rot class? Sin and cos should be implemented as instructions by the compiler. All CPUs should have the FSIN and FCOS instructions which nowaday is implemented like a lookup table with extrapolation for extra precision if I remember correctly. That means they are not going to get any faster, so if you replace them (also see this) with sqrt, it would gain a speedup.

On the topic of sin/cos, I did this little thingy some years ago. The CPU instructions should be much faster than those nowadays.

0 replies

louis-langholtz · 2017-08-02T18:57:28Z

louis-langholtz
Aug 2, 2017
Maintainer Author

@Genbox Replace sin/cos calculations being done inside the world step. More specifically inside the regular phase and the TOI phase processing. You can see these calculations occurring for every body that's been islanded as part of the broad-phase synchronization.

I'm not sure that the calculations can be replaced with UnitVec2-math (b2Rot-math in Erin's Box2D). But if it can be, I think the extra four bytes of data size would be worth it if it results in a net speed-up of the world step method (which I suspect it would). Another cost (besides having to store a total of 8-bytes instead of just 4), would be that location information would no longer be directly accessible to the user as an angle. I think that'd be a fair trade-off though as users really don't need to know how many times a body has turned, only what orientation it's at.

This would also solve the problem of normalizing the angle.

0 replies

louis-langholtz · 2017-09-29T14:50:15Z

louis-langholtz
Sep 29, 2017
Maintainer Author

In case no one has linked it already... https://github.com/wolfviking0/webcl-box2d is an OpenCL accelerated implementation of Box2D. No idea of how much that implementation helps but maybe it can provide helpful insight from a conceptual/API level.

0 replies

louis-langholtz · 2017-11-20T19:29:18Z

louis-langholtz
Nov 20, 2017
Maintainer Author

AsyncFutureDeferred                      274 ns        274 ns    2564920
AsyncFutureAsync                       21610 ns      19293 ns      36708
ThreadCreateAndDestroy                 24111 ns      18605 ns      37631
MultiThreadQD                          13942 ns       7346 ns      98785

These are timing results from the Benchmark application for using various mechanisms from the C++ Thread Support Library.

By comparison, here's the results I get for 1000 single-precision divisions and 1000 double-precision std::atan2 calls (using random data generated outside of the timing):

FloatDiv/1000                           1986 ns       1986 ns     343762
DoubleAtan2/1000                       25093 ns      25088 ns      27938

What I make of these data points based on the code they're testing, is that the time cost of using any blocking operation from the thread support library is in the ballpark of doing around 7,000 single precision floating point division operations or some 500 std::atan2(double, double) operations. This is on a box running macOS.

Now that I have cmake building working as well as it does now for Windows, I'll try getting the Benchmark building and running on Windows to get numbers for that platform.

0 replies

ninnghazad · 2018-01-20T14:36:12Z

ninnghazad
Jan 20, 2018

Just dropping this https://github.com/QuantStack/xsimd into the discussion. It's an abstraction wrapper for SIMD operations across the common instruction sets. Came across this at work and found it quite usefull. Also tried other wrappers, but i found this one to have the nicest API. With all wrappers,as well of course using SIMD instructions directly (which would advice against), it isn't just a matter of dropping in the headers and expecting it to just work. However, using SSE* or even better AVX through this wrapper offers some serious speedups. In how far these are applicable to PlayRho's calculations i do not know - and it might well be only worth it for simulations with a larger number of objects.
Basically stored floats/doubles would have to be aligned (wrapper contains simple alignment allocator for STL), and then single operations can be rewritten to utilise SIMD registers by loading from the aligned storage and using the provided datatypes' operators. So it is not needed to completely rewrite basic layout of programs to leverage these capabilities.

0 replies

Alan-FGR · 2018-01-20T22:30:17Z

Alan-FGR
Jan 20, 2018

Interesting link.
Yeah, when it comes to SIMD it's not just a matter of replacing ops, you have to mind your data and submit appropriately. Another point is mobile support I guess, I'm not sure whether it would be practical (or possible) to use a library that supports (abstracts) multiple architectures when it comes to hardware acceleration and I think mobile is where that's really going to make a huge difference. But then again, I'm personally not interested on mobiles and I'm not sure if that's priority for @louis-langholtz.

0 replies

ninnghazad · 2018-01-21T19:29:14Z

ninnghazad
Jan 21, 2018

Xsimd does offer ARM NEON support i think, but i too am not interested in mobile. i use it on xeon cpus and it does make a big difference where applicable. Where applicable is key here. Don't know if it is the right path for playrho, but should simd be integrated, i don't think using any instructionset without abstraction would be a good idea.

0 replies

moonheart08 · 2018-04-03T18:40:36Z

moonheart08
Apr 3, 2018

Just going to note, because it hasn't been mentioned, but it's possible to force a data line into the cache using prefetch instructions.
GCC/Clang even has a builtin for it, __builtin_prefetch

0 replies

louis-langholtz · 2020-10-18T03:59:04Z

louis-langholtz
Oct 18, 2020
Maintainer Author

With pull request #346 merged now, I'm excited to recognize that the compilation firewall and switch to value semantics it introduced will facilitate more experimentation with this and other optimization ideas. Hooray!!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD/MIMD enhancements discussion #400

{{title}}

Replies: 15 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

SIMD/MIMD enhancements discussion #400

louis-langholtz Jul 30, 2017 Maintainer

Replies: 15 comments

louis-langholtz Jul 30, 2017 Maintainer Author

louis-langholtz Jul 30, 2017 Maintainer Author

louis-langholtz Aug 2, 2017 Maintainer Author

louis-langholtz Aug 2, 2017 Maintainer Author

louis-langholtz Sep 29, 2017 Maintainer Author

louis-langholtz Nov 20, 2017 Maintainer Author

louis-langholtz Oct 18, 2020 Maintainer Author

louis-langholtz
Jul 30, 2017
Maintainer

louis-langholtz
Jul 30, 2017
Maintainer Author

louis-langholtz
Jul 30, 2017
Maintainer Author

louis-langholtz
Aug 2, 2017
Maintainer Author

louis-langholtz
Aug 2, 2017
Maintainer Author

louis-langholtz
Sep 29, 2017
Maintainer Author

louis-langholtz
Nov 20, 2017
Maintainer Author

louis-langholtz
Oct 18, 2020
Maintainer Author