SIMD/MIMD enhancements discussion #400
Replies: 15 comments
-
Update: Results have been disappointing so far in terms of seeing any speedups. I believe this is in part because:
Now PlayRho has the Benchmark application which uses Google's C++ benchmarking library which gives PlayRho a better basis for any SIMD/MIMD work. |
Beta Was this translation helpful? Give feedback.
-
Something I think that's important to keep in mind regarding SIMD/MIMD enhancements, is the cache locality or locality of reference problem. Someone could spend weeks writing the most ultimately OpenCL/CUDA code only to see that code's performance blown out of the water by local CPU code that was much better at using memory in cache friendly ways. I've been looking into how memory is used and have found things out in this regard that were surprising and very interesting to me. Conceptually speaking, getting data to and from external compute resources like the GPU, seems to also suffers from a version of the cache locality problem. Some of the changes I've made to PlayRho have been data-oriented in nature towards addressing simulation speed based on memory access patterns and optimizing cache locality. This all gets increasingly hardware specific alarmingly fast to me. For example, would it make more sense to write code explicitly using Streaming SIMD Extensions or spend that time working on improving the optimizations available in the GNU C++ and/or Clang compiler? Anyways... stuff I've been thinking about related to this. |
Beta Was this translation helpful? Give feedback.
-
It's funny how Chipmunk not only scales quite considerably when enabling multi-threading (using the hasty space), but also it benefits from hw acceleration on mobile platforms. According to @Genbox though the improvements in Box2D were marginal at best or negative, what is really puzzling, but then again they're quite different overall. |
Beta Was this translation helpful? Give feedback.
-
@louis-langholtz I also found that branch prediction misses where a big hit in the solver code. I'm sure some clever arithmetic could produce a less branchy version or even make some of it completely branchless, which would optimize the CPU prefetching. I think the solver code is the best candidate for SIMD since it gets large blocks of data and uses matrix heavy options, which can be done in SIMD fashion. The trick is to do as many operations as possible to fill out the SIMD register, which can be up to 512 bits with AVX-512. However, you are right that it makes the code very hardware specific and very difficult to port, so I would only do it for the hot paths in the solver or maybe narrow phase. |
Beta Was this translation helpful? Give feedback.
-
Note: Intel VTune is amazing for analyzing low level operations on the CPU |
Beta Was this translation helpful? Give feedback.
-
@Genbox Any chance you can take a screen capture or something similar of the branch prediction miss info? Is that from running Intel VTune on PlayRho or on Erin's Box2D code or some other solver code? I'd love to review that info since I've tested some branch eliminations in what I consider solver code and saw no reliably repeatable speedups. I've also had some potential compile-time optimizations in mind that would eliminate some branching so any branch miss info I could get/see I think would be very helpful. Incidentally, I'd love to get my hands on VTune even for the 30-day trial period but I have some hardware resource limits that would make using it harder. OTOH, there's sin/cos calculations going on inside the world step code path which I've been considering replacing instead with unit vector math (using the In any case, I'd like to increase the benchmarks done especially to fill in the gap between the lowest level code (like floating point division times) and highest level code (like times for whole simulations coming to rest). |
Beta Was this translation helpful? Give feedback.
-
I used it on Velcro with the 30 day evaluation. It does not care about the programming language since it focuses on the instructions on the CPU and not much else. What kind of hardware limits are we talking about? Replace sin/cos where? In the b2Rot class? Sin and cos should be implemented as instructions by the compiler. All CPUs should have the FSIN and FCOS instructions which nowaday is implemented like a lookup table with extrapolation for extra precision if I remember correctly. That means they are not going to get any faster, so if you replace them (also see this) with sqrt, it would gain a speedup. On the topic of sin/cos, I did this little thingy some years ago. The CPU instructions should be much faster than those nowadays. |
Beta Was this translation helpful? Give feedback.
-
@Genbox Replace sin/cos calculations being done inside the world step. More specifically inside the regular phase and the TOI phase processing. You can see these calculations occurring for every body that's been islanded as part of the broad-phase synchronization. I'm not sure that the calculations can be replaced with This would also solve the problem of normalizing the angle. |
Beta Was this translation helpful? Give feedback.
-
In case no one has linked it already... https://github.com/wolfviking0/webcl-box2d is an OpenCL accelerated implementation of Box2D. No idea of how much that implementation helps but maybe it can provide helpful insight from a conceptual/API level. |
Beta Was this translation helpful? Give feedback.
-
These are timing results from the Benchmark application for using various mechanisms from the C++ Thread Support Library. By comparison, here's the results I get for 1000 single-precision divisions and 1000 double-precision
What I make of these data points based on the code they're testing, is that the time cost of using any blocking operation from the thread support library is in the ballpark of doing around 7,000 single precision floating point division operations or some 500 Now that I have cmake building working as well as it does now for Windows, I'll try getting the Benchmark building and running on Windows to get numbers for that platform. |
Beta Was this translation helpful? Give feedback.
-
Just dropping this https://github.com/QuantStack/xsimd into the discussion. It's an abstraction wrapper for SIMD operations across the common instruction sets. Came across this at work and found it quite usefull. Also tried other wrappers, but i found this one to have the nicest API. With all wrappers,as well of course using SIMD instructions directly (which would advice against), it isn't just a matter of dropping in the headers and expecting it to just work. However, using SSE* or even better AVX through this wrapper offers some serious speedups. In how far these are applicable to PlayRho's calculations i do not know - and it might well be only worth it for simulations with a larger number of objects. |
Beta Was this translation helpful? Give feedback.
-
Interesting link. |
Beta Was this translation helpful? Give feedback.
-
Xsimd does offer ARM NEON support i think, but i too am not interested in mobile. i use it on xeon cpus and it does make a big difference where applicable. Where applicable is key here. Don't know if it is the right path for playrho, but should simd be integrated, i don't think using any instructionset without abstraction would be a good idea. |
Beta Was this translation helpful? Give feedback.
-
Just going to note, because it hasn't been mentioned, but it's possible to force a data line into the cache using prefetch instructions. |
Beta Was this translation helpful? Give feedback.
-
With pull request #346 merged now, I'm excited to recognize that the compilation firewall and switch to value semantics it introduced will facilitate more experimentation with this and other optimization ideas. Hooray!! |
Beta Was this translation helpful? Give feedback.
-
Let's use this issue for any conversation regarding SIMD/MIMD enhancements.
Beta Was this translation helpful? Give feedback.
All reactions