Vector Sum Reduction #
We now we can add two SIMD vectors together element-by-element. However, I’m sure you can imagine scenarios where you might want to add all the elements in a vector together i.e., a sum-reduction. This page will explain how to do this with 128-bit and 256-bit vectors the approach is different for each vector width. I would show how to do this with 512-bit vectors, but I don’t have a CPU with AVX512 instructions handy.
Reducing 4 floats #
base case #
We’ll assume a 1D array that has every 8 elements reduced into an element of another array:
void reduce_base_sgl(const int nelem, float* a, float* b) {
for (int i = 0; i < nelem; i += 8) {
for (int j = 0; j < 8; ++j) {
b[i/8] += a[i+j]; // assume b is already zeroed
}
}
}
The loop tries to mimick cases where many small sum-reductions are performed.
128-bit vectorization (SSE) #
In the first demonstration of vectorizing the sumreduction loop above, we’ll make use of a few new functions:
__m128 v = _mm_set_ss(val); // creates a vector where the lowermost element is val, and the rest are 0
__m128 v = _mm_add_ss(a, b); // adds only the lowermost elements of a and b. The rest are same as a.
__m128 v = _mm_hadd_ps(a, b); // performs a "horizontal" add (see explanation below)
_mm_store_ss(&a, v); // stores the lower-most value in the vector
The _mm_hadd_xx
function will be the workhorse of the vectorized reduction. It takes two vectors and
adds pairs from each half of both vectors. The first element will be the sum of the second half of the
first vector, a2+a3
. The second element will be the first half of the first vector, a0+a1
. The third
and fourth elements will be the same, but for the second input vector.
// __m128 a = _mm_set_ps(a0, a1, a2, a3);
// __m128 b = _mm_set_ps(b0, b1, b2, b3);
_mm128 v = _mm_hadd_ps(a, b);
// v = [a2+a3, a0+a1, b2+b3, b0+b1];
Because hadd
adds two vectors in this manner, it can be executed twice in a row to sum all elements of
an __m128
vector:
// __m128 a = _mm_set_ps(a0, a1, a2, a3);
// first horizontal add
_mm128 v = _mm_hadd_ps(a, a);
// v = [a2+a3, a0+a1, a2+a3, a0+a1]
// second horizontal add
v = _mm_hadd_ps(v, v);
// v = [v2+v3, v0+v1, v2+v3, v0+v1]
// = [a2+a3+a0+a1, a2+a3+a0+a1, a2+a3+a0+a1, a2+a3+a0+a1]
The double hadd
functions results in 3 adds for every 4 elements (the two horizontal adds, and an add to the reduction
variable), compared to 8 adds in the serial sum of all the elements in the vector. Implementing these new functions to
vectorize our loop:
void reduce_128_sgl(const int nelem, float* a, float* b) {
for (int i = 0; i < nelem; i += 8) {
__m128 vb = _mm_set_ss(b[i/8]); // initializes a vector with the sum-reduction variable
for (int j = 0; j < 8; j += 4) { // start the reduction loop
__m128 va = _mm_loadu_ps(&a[i+j]); // load the data
va = _mm_hadd_ps(va, va); // first horizontal add
va = _mm_hadd_ps(va, va); // second horizontal add
vb = _mm_add_ss(va, vb); // add result to the sum-reduction vector variable
}
_mm_store_ss(&b[i/8], vb); // store the left-most value in the sum-reduction variable
}
}
The use of the __m128 vb
vector helps speed up the loop slightly by mitigating the need to increment
the float *b
array directly in the inner loop. In this case, it reduces the number of stores by one.
In the double precision version, it would reduce the stores by 3, since the inner loop would have 4 iterations instead of only 2. Note that with the double precision version, reducing a 128-bit double vector to one value doesn’t reduce the number of additions. Consequently, there shouldn’t be much difference between codes.
Performance (128-bit) #
Compilation command (using the Google benchmark framework):
g++ -o reduce.x reduce.cpp -msse4 -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -std=c++17 -O2
Single precision #
Comparing reduce_base_sgl (from ./reduce_base_sgl.json) to reduce_128_sgl (from ./reduce_128_sgl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------
[reduce_base_sgl vs. reduce_128_sgl]/4096 -0.2329 -0.2329 1625 1247 1621 1243
[reduce_base_sgl vs. reduce_128_sgl]/32768 -0.2644 -0.2644 13565 9978 13531 9953
[reduce_base_sgl vs. reduce_128_sgl]/262144 -0.2627 -0.2625 108413 79935 108135 79748
[reduce_base_sgl vs. reduce_128_sgl]/2097152 -0.2669 -0.2669 872616 639709 870408 638091
[reduce_base_sgl vs. reduce_128_sgl]/16777216 -0.1476 -0.1478 8405297 7164522 8385642 7146262
[reduce_base_sgl vs. reduce_128_sgl]/134217728 -0.1198 -0.1197 68570641 60358462 68394712 60204987
OVERALL_GEOMEAN -0.2179 -0.2179 0 0 0 0
The tests that fit on L1-L3 cache benefit the most with about 1.3x speedup. The largest test shows the worst speedup, at ~1.1x speedup. A 1.3x speedup is about in line with rough estimates, based on the ratios of adds per 8 elements (8/6 = 1.33).
Double Precision #
Comparing reduce_base_dbl (from ./reduce_base_dbl.json) to reduce_128_dbl (from ./reduce_128_dbl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------
[reduce_base_dbl vs. reduce_128_dbl]/4096 -0.1052 -0.1050 1627 1456 1623 1452
[reduce_base_dbl vs. reduce_128_dbl]/32768 -0.1106 -0.1106 13462 11973 13431 11945
[reduce_base_dbl vs. reduce_128_dbl]/262144 -0.0369 -0.0370 112035 107906 111772 107633
[reduce_base_dbl vs. reduce_128_dbl]/2097152 -0.0386 -0.0386 934021 898008 931648 895718
[reduce_base_dbl vs. reduce_128_dbl]/16777216 +0.0122 +0.0122 12864721 13022105 12831738 12988742
[reduce_base_dbl vs. reduce_128_dbl]/134217728 +0.0102 +0.0102 107276227 108369477 107002000 108088106
OVERALL_GEOMEAN -0.0461 -0.0461 0 0 0 0
As expected, the performance difference is minimal, although there appears to be some meaningful improvements to the test sizes that fit within L1 and L2 cache (i.e., the 4,096 and 32,768 element tests). I’m not sure on this, but I am guessing that the improvements are due to SIMD vectors being used to store the intermediate reduction data.
Another interesting observation is that the double precision version base version performs faster on the smallest case than the equivalent for the single precision version. This is probably CPU-specific, as I don’t observe this on my desktop CPU (Intel i7-10750H).
256-bit vectorization (AVX) #
256-bit vectors double the vector width before. This means 8 elements in a single precision (__m128
) vector, and 4
elements in a double precision vector. For single precision data, the _mm256_hadd_ps
function works like:
// __m128 a = _mm_set_ps(a0, a1, a2, a3, a4, a5, a6, a7);
// __m128 b = _mm_set_ps(b0, b1, b2, b3, b4, b5, b6, b7);
_mm256 v = _mm256_hadd_ps(a, b);
// v = [a0+a1, a2+a3, b0+b1, b2+b3, a4+a5, a6+a7, b4+b5, b6+b7];
If we were to apply this three times, in an attempt to apply the hadd
strategy to perform the reduction:
// __m128 a = _mm_set_ps(a0, a1, a2, a3, a4, a5, a6, a7);
// first horizontal add
_mm256 v = _mm256_hadd_ps(a, a);
// v = [a0+a1, a2+a3, a0+a1, a2+a3, a4+a5, a6+a7, a4+a5, a6+a7];
// second horizontal add
_mm256 v = _mm256_hadd_ps(v, v);
// v = [v0+v1, v2+v3, v0+v1, v2+v3, v4+v5, v6+v7, v4+v5, v6+v7];
// v = [a0+a1+a2+a3, a0+a1+a2+a3, a0+a1+a2+a3, a0+a1+a2+a3, a4+a5+a6+a7, a4+a5+a6+a7, a4+a5+a6+a7, a4+a5+a6+a7]
// third horizontal add
_mm256 v = _mm256_hadd_ps(v, v);
// v = [v0+v1, v2+v3, v0+v1, v2+v3, v4+v5, v6+v7, v4+v5, v6+v7];
// v [a0+a1+a2+a3+a0+a1+a2+a3, ... , ... , ... , a4+a5+a6+a7+a4+a5+a6+a7, ... , ... , ...]
Where the repeated entries are skipped in the last addition. Each element of the bottom 128 bits of the vector have
2*(a0+a1+a2+a3)
and the top 128 bits have 2*(a4+a5+a6+a7)
. Hopefully it’s clear that simply performing three hadd
s
in a row would not be possible to perform our reduction, at least for 8 elements of single precision data.
A similar problem occurs when performing two consecutive hadd
operations on double precision data. The
_mm256_hadd_pd
function works like:
// __m128d a = _mm256_set_pd(a0, a1, a2, a3);
// __m128d b = _mm256_set_pd(b0, b1, b2, b3);
_mm256 v = _mm256_hadd_pd(a, b);
// v = [a0+a1, b0+b1, a2+a3, b2+b3];
And so trying to apply this twice for the purposes of reduction:
// __m128d a = _mm256_set_pd(a0, a1, a2, a3);
// first horizontal add
_mm256 v = _mm256_hadd_pd(a, a);
// v = [a0+a1, a0+a1, b2+b3, b2+b3]
// second horizontal add
_mm256 v = _mm256_hadd_pd(v, v);
//v = [v0+v1, v0+v1, v2+v3, v2+v3]
// = [a0+a1+a0+a1, ..., a2+a3+a2+a3, ...]
Which should illustrate that double _mm256_hadd_pd
has the same limitation as _mm256_hadd_ps
, as the two halves of
the resultant 256-bit vector merely contains double of the sum of all elements in each of the corresponding 128-bit
halves.
__m256
vectors can be reduced by first splitting the vector into two __m128
vectors, adding the two halves together
and then performing the 128-bit vector reduction shown previously. For example, for single precision:
void reduce_256_sgl(const int nelem, float* a, float* b) {
for (int i = 0; i < nelem; i += 8) {
__m128 vb = _mm_set_ss(b[i/8]); // initializes a vector with the sum-reduction variable
for (int j = 0; j < 8; j += 8) { // start the reduction loop
__m256 va = _mm256_loadu_ps(&a[i+j]); // load the data
__m128 va_low = _mm256_castps256_ps128(va); // "truncate" the 256-bit vector to get the lower 128-bit vector
__m128 va_hi = _mm256_extractf128_ps(va, 1); // extract the upper 128-bit vector
va_low = _mm_add_ps(va_low, va_hi); // add the two halves together
va_low = _mm_hadd_ps(va_low, va_low); // first horizontal add
va_low = _mm_hadd_ps(va_low, va_low); // second horizontal add
vb = _mm_add_ss(va_low, vb); // add result to the sum-reduction vector variable
}
_mm_store_ss(&b[i/8], vb); // store the left-most value in the sum-reduction variable
}
}
Note that in this verison, 8 elements are loaded at a time into the 256-bit vector, which removes the need for the inner loop. But, the inner loop is kept so that it is more comparable to the base and 128-bit versions.
To extract the lower 128-bit part of the 256-bit vector, the _mm256_castps256_ps128
function is used. This function
casts an __m256
vector to a __m128
vector by truncating the upper 128-bits from the input vector. The upper half
is extracted using _mm256_extractf128_ps
. The second argument being 1
, is asking for the second 128-bit part of the
vector. Passing 0
would provide the first half, and could also be used to extract the lower 128-bit half, but
_mm256_castps256_ps128
is apparently faster. The next step is to add these two halves using the _mm_add_ps
function
as used previously. The remainder of the reduction follows the same double hadd
idiom used to reduce the 128-bit
single precision vectors.
The same strategy can be employed with double precision 256-bit vectors, except that only one _mm_hadd_pd
is used,
like when performing the reduction with 128-bit double precision vectors.
Performance (256-bit) #
Compilation command (using the Google benchmark framework):
g++ -o reduce.x reduce.cpp -mavx -isystem benchmark/include -Lbenchmark/build/src -lbenchmark -lpthread -std=c++17 -O2
Single precision #
Comparing reduce_base_sgl (from ./reduce_base_sgl.json) to reduce_256_sgl (from ./reduce_256_sgl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
--------------------------------------------------------------------------------------------------------------
[reduce_base_sgl vs. reduce_256_sgl]/4096 -0.5072 -0.5072 1625 801 1621 799
[reduce_base_sgl vs. reduce_256_sgl]/32768 -0.5265 -0.5264 13565 6423 13531 6408
[reduce_base_sgl vs. reduce_256_sgl]/262144 -0.5155 -0.5155 108413 52526 108135 52393
[reduce_base_sgl vs. reduce_256_sgl]/2097152 -0.4869 -0.4868 872616 447734 870408 446687
[reduce_base_sgl vs. reduce_256_sgl]/16777216 -0.2345 -0.2345 8405297 6434629 8385642 6419522
[reduce_base_sgl vs. reduce_256_sgl]/134217728 -0.1776 -0.1776 68570641 56392320 68394712 56245917
OVERALL_GEOMEAN -0.4240 -0.4240 0 0 0 0
Best speedup obtained is now ~2.1x and worst is ~1.2x. Given that for every 8 elements added, there are 4 addition operations (2 horizontal, 1 normal, and 1 to update the reduction variable), you would expect that this implementation would achieve around (8)/4 ~= 2x. speedup over the non-vectorized base version. The better performance for smaller problems might be because normal adds are cheaper than horizontal adds. The minimum speedup of ~1.6x is actually worse than that achieved by the equivalent 128-bit vector version.
Double precision #
Comparing reduce_base_dbl (from ./reduce_base_dbl.json) to reduce_256_dbl (from ./reduce_256_dbl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------
[reduce_base_dbl vs. reduce_256_dbl]/4096 -0.4083 -0.4083 1627 963 1623 960
[reduce_base_dbl vs. reduce_256_dbl]/32768 -0.3764 -0.3765 13462 8395 13431 8374
[reduce_base_dbl vs. reduce_256_dbl]/262144 -0.1858 -0.1857 112035 91225 111772 91011
[reduce_base_dbl vs. reduce_256_dbl]/2097152 -0.1667 -0.1665 934021 778325 931648 776501
[reduce_base_dbl vs. reduce_256_dbl]/16777216 -0.0118 -0.0116 12864721 12712606 12831738 12682778
[reduce_base_dbl vs. reduce_256_dbl]/134217728 -0.0017 -0.0017 107276227 107096199 107002000 106821947
OVERALL_GEOMEAN -0.2079 -0.2079 0 0 0 0
Here, the speedup is between ~1x and ~1.7x. The maximum speedup is significantly better than the equivalent for 128-bit vectors, but of course, gets worse as data gets more distant from the CPU due to the test sizes. The ~1.7x speedup achieved falls short of the 2x expected, probably because of the extractions of the 128-bit vectors from the 256-bit vector, as well as the relatively more expensive horizontal add operation.
Like with the single precision version, the test sizes that are in RAM, do not show any meaningful speedup over the 128-bit equivalent. Although it seems that this one is at least not any worse.
Conclusion #
The horizontal add intrinsics were introduced here, as well as other needed, to perform vectorized reductions of chunks of data in an array. With single precision data, the speedups achieved for all tests were meaningful. For tests that fit within cache, the achieved speedup was around expectations of using the horizontal add strategy. These results correspond with the upper end of the ranges.
Single | Double | |
---|---|---|
128-bit | 1.1x-1.3x | 1x-1.1x |
256-bit | 1.2x-2.4x | 1x-1.7x |
In contrast, double precision vectorization attempts with 128-bit vectors only achieved 1.1x at best, for problems that fit in cache, and did not obtain any speedup for tests that could only fit in RAM. The 256-bit vectorization attempt improved the upper end to 1.7x, but still failed to improve the speed for tests in RAM.
Faster sum-reduction examples are shown on this page.