Skip to content

Commit

Permalink
Rename :scale: -> :width:
Browse files Browse the repository at this point in the history
  • Loading branch information
FrancescAlted committed Dec 26, 2024
1 parent 425f28c commit 4bb4d45
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 16 deletions.
10 changes: 5 additions & 5 deletions posts/arm-memory-walls-followup.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ ARM Plans for Improving CPU Performance
So with ARM CPUs dominating the world of mobile and embedded, the question is whether ARM would be interested in having a stab at the client market (laptops and PC desktops) and, by extension, to the server computing market during the 2020s decade or they would renounce to that because they comfortable enough with the current situation? In 2018 ARM provided an important hint to answer this question: they really want to push hard for the client market with the `introduction of the Cortex A76 CPU <https://www.anandtech.com/show/13226/arm-unveils-client-cpu-performance-roadmap>`_ which aspires to redefine the capability of ARM to compete with Intel at its own game:

.. image:: /images/arm-memory-walls-followup/arm-compute-plans.png
:scale: 75 %
:width: 75%
:align: center

On the other hand, the fact that ARM is not just providing licenses to use its IP cores, but also the possibility to buy an architectural licence for vendors to design their own CPU cores using the ARM instruction sets makes possible that other players like Apple, AppliedMicro, Broadcom, Cavium (now Marvell), Nvidia, Qualcomm, and Samsung Electronics can produce ARM CPUs that can be adapted to be used in different scenarios. One example that is interesting for this discussion is Marvell who, with its ThunderX2 CPU, is already entering into the computing servers market --actually, a new super-computer with more than 100,000 ThunderX2 cores has recently entered into the `TOP500 ranking <https://t.co/LM2wXQrXm8>`_; this is the first time that an ARM-based computer enters that list, overwhelmingly dominated by Intel architectures for almost two decades now.
Expand All @@ -38,10 +38,10 @@ Here we are going to analyze `Huawei's Kirin 980 CPU <https://www.anandtech.com/
ARM is saying that they designed the `A76 to be a competitor of the Intel Skylake Core i5 <https://arstechnica.com/gadgets/2018/06/arm-promises-laptop-level-performance-in-2019/>`_, so this is what we are going to check here. For this, we are going to compare a Kirin 980 in a Huawei Mate 20 phone against a Core i5 included in a MacBook Pro (late 2016). Here it is the side-by-side performance for the precipitation dataset that I used in the `previous blog <http://blosc.org/posts/breaking-memory-walls/>`_:

.. |rainfall-kirin980| image:: /images/arm-memory-walls-followup/kirin980-rainfall-lz4-9.png
:scale: 70 %
:width: 70%

.. |rainfall-i5laptop| image:: /images/arm-memory-walls-followup/i5laptop-lz4-9.png
:scale: 70 %
:width: 70%

+---------------------+---------------------+
| |rainfall-kirin980| | |rainfall-i5laptop| |
Expand All @@ -62,10 +62,10 @@ The second way in which ARM sells licenses is the so-called *architectural licen
So as to check how powerful a ThunderX2 can be, we are going to compare `ThunderX2 CN9975 <https://en.wikichip.org/wiki/cavium/thunderx2/cn9975>`_ (actually a box with 2 instances of it, each containing 28 cores) against one of its natural competitor, the Intel Scalable Gold 5120 (actually a box with 2 instances of it, each containing 14 cores):

.. |rainfall-thunderx2| image:: /images/arm-memory-walls-followup/thunderx2-rainfall-lz4-9.png
:scale: 70 %
:width: 70%

.. |rainfall-scalable| image:: /images/arm-memory-walls-followup/scalable-rainfall-lz4-9.png
:scale: 70 %
:width: 70%

+----------------------+---------------------+
| |rainfall-thunderx2| | |rainfall-scalable| |
Expand Down
6 changes: 3 additions & 3 deletions posts/blosc2-meets-rome.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
On August 7, 2019, AMD released a new generation of its series of EPYC processors, the EPYC 7002, also known as Rome, which are based on the new `Zen 2 <https://en.wikipedia.org/wiki/Zen_2>`_ micro-architecture. Zen 2 is a significant departure from the physical design paradigm of AMD's previous Zen architectures, mainly in that the I/O components of the CPU are laid out on a separate die, different from computing dies; this is quite different from Naples (aka EPYC 7001), its antecessor in the EPYC series:

.. image:: /images/blosc2-meets-rome/amd-rome-arch-multi-die.png
:scale: 33 %
:width: 33%
:align: center

Such a separation of dies for I/O and computing has quite `large consequences in terms of scalability when accessing memory <https://www.anandtech.com/show/15044/the-amd-ryzen-threadripper-3960x-and-3970x-review-24-and-32-cores-on-7nm/3>`_, which is critical for Blosc operation, and here we want to check how Blosc and AMD Rome couple behaves. As there is no replacement for experimentation, we are going to use the same benchmark that was introduced in our previous `Breaking Down Memory Walls <https://blosc.org/posts/breaking-memory-walls/>`_. This essentially boils down to compute an aggregation with a simple loop like:
Expand All @@ -36,7 +36,7 @@ The synthetic data chosen for this benchmark allows to be compressed/decompresse
After some experiments, and as usual for synthetic datasets, the codec inside Blosc2 that has shown the best speed while keeping a decent compression ratio (54.6x), has been BloscLZ with compression level 3. Here are the results:

.. image:: /images/blosc2-meets-rome/sum_openmp_synthetic-blosclz-3.png
:scale: 50 %
:width: 50%
:align: center

As we can see, the uncompressed dataset scales pretty well until 8 threads, where it hits the memory wall for this machine (around 74 GB/s). On its hand, even if data compressed with Blosc2 (in combination with BloscLZ codec) shows less performance initially, it scales quite smoothly up to 12 threads, where it reaches a higher performance than its uncompressed counterpart (and reaching the 90 GB/s mark).
Expand All @@ -49,7 +49,7 @@ Aggregating the Precipitation Dataset on AMD EPYC 7402 24-Core
Now it is time to check the performance of the aggregation with the 100 million values dataset coming from a `precipitation dataset from Central Europe <http://reanalysis.meteo.uni-bonn.de/>`_. Computing the aggregation of this data is representative of a catchment average of precipitation over a drainage area. This time, the best codec inside Blosc2 was determined to be LZ4 with compression level 9:

.. image:: /images/blosc2-meets-rome/sum_openmp_rainfall-lz4-9-lz4-9-ipp.png
:scale: 50 %
:width: 50%
:align: center

As expected, the uncompressed aggregation scales pretty much the same than for the synthetic dataset (in the end, the Arithmetic and Logical Unit in the CPU is completely agnostic on what kind of data it operates with). But on its hand, the compressed dataset scales more slowly, but more steadily towards hitting a maximum at 48 threads, where it reaches almost the same speed than the uncompressed dataset, which is quite a feat, provided the high memory bandwidth of this machine (~74 GB/s).
Expand Down
16 changes: 8 additions & 8 deletions posts/breaking-down-memory-walls.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ If you are curious on how the super-chunk can be created and used, just check th
Regarding the computing algorithm, I will use one that follows the principles of the blocking computing technique: for every chunk, bring it to the CPU, decompress it (so that it stays in cache), run all the necessary operations on it, and then proceed to the next chunk:

.. image:: /images/breaking-down-memory-walls/blocking-technique.png
:scale: 25 %
:width: 25%
:align: center

For implementation details, have a look at the `benchmark sources <https://github.com/Blosc/c-blosc2/blob/master/bench/sum_openmp.c#L191-L209>`_.
Expand All @@ -89,10 +89,10 @@ Choosing the Compression Codec
When determining the best codec to use inside Blosc2 (it has support for BloscLZ, LZ4, LZ4HC, Zstd, Zlib and Lizard), it turns out that they behave quite differently, both in terms of compression and speed, with the dataset they have to compress *and* with the CPU architecture in which they run. This is quite usual, and the reason why you should always try to find the best codec for your use case. Here we have how the different codecs behaves for our precipitation dataset in terms of decompression speed for our reference platform (Intel Xeon E3-1245):

.. |i7server-codecs| image:: /images/breaking-down-memory-walls/i7server-rainfall-codecs.png
:scale: 70 %
:width: 70%

.. |rainfall-cr| image:: /images/breaking-down-memory-walls/rainfall-cr.png
:scale: 70 %
:width: 70%

+-------------------+-------------------+
| |i7server-codecs| | |rainfall-cr| |
Expand All @@ -115,7 +115,7 @@ Reference CPU: Intel Xeon E3-1245 v5 4-Core processor @ 3.50GHz
This is a mainstream, somewhat 'small' processor for servers that has an excellent price/performance ratio. Its main virtue is that, due to its small core count, the CPU can be run at considerably high clock speeds which, combined with a high IPC (Instructions Per Clock) count, delivers considerable computational power. These results are a good baseline reference point for comparing other CPUs packing a larger number of cores (and hence, lower clock speeds). Here it is how it performs:

.. image:: /images/breaking-down-memory-walls/i7server-rainfall-lz4hc-9.png
:scale: 75 %
:width: 75%
:align: center

We see here that, even though the uncompressed dataset does not scale too well, the compressed dataset shows a nice scalability even when using using hyperthreading (> 4 threads); this is a remarkable fact for a feature (hyperthreading) that, despite marketing promises, does not always deliver 2x the performance of the physical cores. With that, the performance peak for the compressed precipitation dataset (22 GB/s, using LZ4HC) is really close to the uncompressed one (27 GB/s); quite an achievement for a CPU with just 4 physical cores.
Expand All @@ -127,7 +127,7 @@ AMD EPYC 7401P 24-Core Processor @ 2.0GHz
This CPU implements EPYC, one of the most powerful architectures ever created by AMD. It packs 24 physical cores, although internally they are split into 2 blocks with 12 cores each. Here is how it behaves:

.. image:: /images/breaking-down-memory-walls/epyc-rainfall-lz4-9.png
:scale: 75 %
:width: 75%
:align: center

Stalling at 4/8 threads, the EPYC scalability for the uncompressed dataset is definitely not good. On its hand, the compressed dataset behaves quite differently: it shows a nice scalability through the whole range of cores in the CPU (again, even when using hyperthreading), achieving the best performance (45 GB/s, using LZ4) at precisely 48 threads, well above the maximum performance reached by the uncompressed dataset (30 GB/s).
Expand All @@ -139,7 +139,7 @@ Intel Scalable Gold 5120 2x 14-Core Processor @ 2.2GHz
Here we have one of the latest and most powerful CPU architectures developed by Intel. We are testing it here within a machine with 2 CPUs, each containing 14 cores. Here’s it how it performed:

.. image:: /images/breaking-down-memory-walls/scalable-rainfall-lz4-9.png
:scale: 75 %
:width: 75%
:align: center

In this case, and stalling at 24/28 threads, the Intel Scalable shows a quite remarkable scalability for the uncompressed dataset (apparently, Intel has finally chosen a good name for an architecture; well done guys!). More importantly, it also reveals an even nicer scalability on the compressed dataset, all the way up to 56 threads (which is expected provided the 2x 14-core CPUs with hyperthreading); this is a remarkable feat for such a memory bandwidth beast. In absolute terms, the compressed dataset achieves a performance (68 GB/s, using LZ4) that is very close to the uncompressed one (72 GB/s).
Expand All @@ -150,7 +150,7 @@ Cavium ARMv8 2x 48-Core
We are used to seeing ARM architectures powering most of our phones and tablets, but seeing them performing computational duties is far more uncommon. This does not mean that there are not ARM implementations that cannot power big servers. Cavium, with its 48-core in a single CPU, is an example of a server-grade chip. In this case we are looking at a machine with two of these CPUs:

.. image:: /images/breaking-down-memory-walls/cavium-rainfall-blosclz-9.png
:scale: 75 %
:width: 75%
:align: center

Again, we see a nice scalability (while a bit bumpy) for the uncompressed dataset, reaching its maximum (35 GB/s) at 40 threads. Regarding the compressed dataset, it scales much more smoothly, and we see how the performance peaks at 64 threads (15 GB/s, using BloscLZ) and then drops significantly after that point (even if the CPU still has enough cores to continue the scaling; I am not sure why is that). Incidentally, the BloscLZ codec being the best performer here is not a coincidence as it recently received a lot of fine-tuning for ARM.
Expand All @@ -161,7 +161,7 @@ What We Learned

We have explored how to use compression in an nearly optimal way to perform a very simple task: compute an aggregation out of a large dataset. With a basic understanding of the cache and memory subsystem, and by using appropriate compressed data structures (the super-chunk), we have seen how we can easily produce code that enables modern CPUs to perform operations on compressed data at a speed that approaches the speed of the same operations on uncompressed data (and sometimes exceeding it). More in particular:

1. Performance for the compressed dataset scales very well on the number of threads for all the CPUs (even hyperthreading seems very beneficial at that, which is a welcome surprise).
1. Performance for the compressed dataset scales very well on the number of threads for all the CPUs (even hyper-threading seems very beneficial at that, which is a welcome surprise).

2. The CPUs that benefit the most from compression are those with relatively low memory bandwidth and CPUs with many cores. In particular, the EPYC architecture is a good example and we have shown how the compressed dataset can operate 50% faster that the uncompressed one.

Expand Down

0 comments on commit 4bb4d45

Please sign in to comment.