Vectorizing A Simple Pair Sweep #
One way to perform the force calculation sweep in SPH, is to use a double loop like:
for (int i = 0; i < nelem-1; ++i) {
for (int j = i+1; j < nelem; ++j) {
// compare particle i and j
// calculate forces
}
}
This is generally not a preferred way to do the force calculations, as the number of operations in this algorithm scales with the square of the number of particles. But it’s useful for playing around with manual vectorization.
The base case we will use is this:
void sweep_base_sgl(const int nelem, float* a, float* b) {
int k = 0
for (int i = 0; i < nelem-1; i += 8) {
for (int j = i+1; j < nelem; ++j) {
tmp = a[i]-a[j];
b[i] += tmp;
b[j] -= tmp;
}
}
}
i
is incremented by 8 to simplify vectorization. Swap out float
with double
, to get the double precision version.
Important to note here, is that this loop cannot automatically be vectorized by g++
. For example, if we compile
our code with -O3 -fopt-info-vec-missed
, -O3
will try to vectorize the loop, and the other option will let you know
when it cannot vectorize a given loop. At compilation, I get some important info in the output:
sweep.cpp:51:27: missed: couldn't vectorize loop
sweep.cpp:51:27: missed: not vectorized: multiple nested loops.
sweep.cpp:51:27: missed: couldn't vectorize loop
sweep.cpp:51:27: missed: not vectorized: inner-loop count not invariant.
sweep.cpp:52:31: missed: couldn't vectorize loop
Which explains that the compiler missed the vectorization of these loops because:
- the loops are nested, and
- the inner loop looping variable,
j
, is not constant for every loop. Hence, this is a meaningful candidate to vectorize.
Why aligned store
and load
functions can’t be used
#
Recalling what is required for the store
and load
functions to work: that the memory address being accessed must be
aligned on vector-width boundaries. In the nested loop considered here, i
starts at element 0, but j
starts at
i + 1
, which would be j = 1
when i = 0
. So while the address of a[0]
would be aligned to a vector width, the
next element, a[1]
, would not be. Consequently, once the inner loop starts, the load/store functions would fail with a
segmentation fault
error. More generally, the outer loop would typically be incrementing by 1, which would mean
accesses to a
would mostly be unaligned.
Doing the Vectorization #
Vectorizing the direct pair sweep with 128-bit vectors (SSE instructions) #
This attempt to vectorize the loop will turn the outer loop into a vector of all the same variable, and then the inner loop will load multiple sequential elements to operate the outer-loop vector. This looks like:
void sweep_128_sgl(const int nelem, float* a, float* b) {
float tmp[4];
for (int i = 0; i < nelem-1; i += 8) {
__m128 vi = _mm_set1_ps(a[i]);
for (int j = i+1; j < nelem; j += 4) {
__m128 vj = _mm_loadu_ps(&a[j]);
__m128 vdiff = _mm_sub_ps(vi, vj);
__m128 vbj = _mm_loadu_ps(&b[j]);
vbj = _mm_sub_ps(vbj, vdiff);
_mm_storeu_ps(&b[j], vbj);
_mm_storeu_ps(tmp, vdiff);
b[i] += tmp[0]+tmp[1]+tmp[2]+tmp[3];
}
}
}
Now this code looks more complicated than the ABC addition example used earlier.
__m128 vi = _mm_set1_ps(a[i]);
is creating a 128-bit vector to hold 4 single precision elements. These elements are all initialized with the value ata[i]
. This vector is used for the entire duration of the inner loop.__m128 vj = _mm_loadu_ps(&a[j]);
loads the 4 following single precision elements at the location ofa[j]
.__m128 vdiff = _mm_sub_ps(vi, vj);
subtracts thevj
vector from thevi
vector and stores it in thevdiff
vector. This is equivalent to thetmp = a[i]-a[j];
step in the base code.__m128 vbj = _mm_loadu_ps(&b[j]);
loads the 4 following elements at the location ofb[j]
. This is so that thevdiff
vector can be subtracted.vbj = _mm_sub_ps(vbj, vdiff);
subtractsvdiff
fromvbj
. This is equivalent tob[j] -= tmp;
in the base code._mm_storeu_ps(&b[j], vbj);
takes the calculates vector and places it back in the location atb[j]
._mm_storeu_ps(tmp, vdiff);
stores the calculated difference vector in a temporary array of 4 elements. This is so that the elements of the vector can be serially added tob[i]
.b[i] += tmp[0]+tmp[1]+tmp[2]+tmp[3];
serially adds the 4 elements of the difference vector tob[i]
. This is equivalent tob[i] += tmp;
in the base code.
The 256-bit vector version can be obtained by replacing __m128
types with __m256
, and _mm
function prefixes with
_mm256
. Double precision versions are obtained by adding d
to the vector types, and replacing ps
with pd
in the
function suffixes.
Performance of the manually vectorized sweep function: with -O2
optimizations
#
Here, compiler optimizations are used as it has been established that automatic vectorization doesn’t work for this code, and these optimizations are needed to get the best out of the manual vectorization. The tests used here are much smaller due to the long duration of this sweep algorithm. At worst, the loaded arrays are stored within L2 cache.
Comparing sweep_base_sgl (from results/sweep1d_base_sgl.json) to sweep_128_sgl (from results/sweep1d_128_sgl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_128_sgl]/4096 -0.7448 -0.7448 2837605 724git074 2830970 722381
[sweep_base_sgl vs. sweep_128_sgl]/8192 -0.7433 -0.7433 11412353 2929379 11385665 2922530
[sweep_base_sgl vs. sweep_128_sgl]/16384 -0.7411 -0.7411 45808257 11859336 45701186 11831605
[sweep_base_sgl vs. sweep_128_sgl]/32768 -0.7412 -0.7412 183443120 47483586 183014137 47372294
OVERALL_GEOMEAN
The manual vectorization with the compiler optimizations are about 3.9x faster than the code without the manual vectorization for 128-bit vectors. Near-optimal speedup is gained for 256-bit vectors with a speedup of ~7.3x (see table below).
Comparing sweep_base_sgl (from results/sweep1d_base_sgl.json) to sweep_256_sgl (from results/sweep1d_256_sgl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_256_sgl]/4096 -0.8603 -0.8603 2837605 396449 2830970 395517
[sweep_base_sgl vs. sweep_256_sgl]/8192 -0.8623 -0.8623 11412353 1571707 11385665 1568032
[sweep_base_sgl vs. sweep_256_sgl]/16384 -0.8589 -0.8589 45808257 6462080 45701186 6446971
[sweep_base_sgl vs. sweep_256_sgl]/32768 -0.8454 -0.8454 183443120 28365528 183014137 28299199
OVERALL_GEOMEAN -0.8569 -0.8569 0 0 0 0
Although not shown here, without the compiler optimizations, the speedup of the manually vectorized code with 128-bit vectors is at best, ~1.5x faster; and ~2.7x faster for 256-bit vectors.
The double precision versions of the code also obtain close to optimal performance gain using 128-bit vectors. Like the single precision version,the double precision data managed to obtain very close to optimal (3.78x).
Comparing sweep_base_dbl (from results/sweep1d_base_dbl.json) to sweep_128_dbl (from results/sweep1d_128_dbl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------
[sweep_base_dbl vs. sweep_128_dbl]/4096 -0.4898 -0.4898 2837990 1448083 2831355 1444697
[sweep_base_dbl vs. sweep_128_dbl]/8192 -0.4868 -0.4868 11418819 5860700 11392117 5846937
[sweep_base_dbl vs. sweep_128_dbl]/16384 -0.4857 -0.4857 45805303 23556194 45698184 23501027
[sweep_base_dbl vs. sweep_128_dbl]/32768 -0.4882 -0.4882 184497440 94427345 184065995 94206550
OVERALL_GEOMEAN -0.4876 -0.4876 0 0 0 0
Comparing sweep_base_dbl (from results/sweep1d_base_dbl.json) to sweep_256_dbl (from results/sweep1d_256_dbl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------
[sweep_base_dbl vs. sweep_256_dbl]/4096 -0.7354 -0.7354 2837990 750796 2831355 749041
[sweep_base_dbl vs. sweep_256_dbl]/8192 -0.7337 -0.7337 11418819 3040832 11392117 3033721
[sweep_base_dbl vs. sweep_256_dbl]/16384 -0.7331 -0.7331 45805303 12227361 45698184 12198775
[sweep_base_dbl vs. sweep_256_dbl]/32768 -0.7253 -0.7253 184497440 50679024 184065995 50560536
OVERALL_GEOMEAN -0.7319 -0.7319 0 0 0 0
While not shown here, without the compiler optimisations, the 128-bit double precision manually vectorized code is up to 30% slower with 128-bit vectors, and at best, 1.62x faster with 256-bit vectors.
You may have noticed that the incrementing of b[i]
looks like it’s serialized, but near-optimal performance is
obtained in the tests. This is because g++
and clang++
will automatically convert b[i] += tmp[0] + ...;
to
assembly using SIMD instructions. I couldn’t figure out a way to prevent the compilers from doing this without also
making the intel intrinsic functions unavailable as well. In principle, the serializations of the update of b[i]
would
prevent optimal performance, and the manually vectorized reductions
discussed here and
here
would be needed to improve performance.
2D version #
A meaningful version of this sweep function would require at least 2 dimensions as few engineering problems occur in 1D. Updating the sweep function to work in 2D:
void sweep_base_sgl(const int nelem, float* ax, float*, ay, float* bx, float* by) {
int k = 0
for (int i = 0; i < nelem-1; i += 8) {
for (int j = i+1; j < nelem; ++j) {
tmp = ax[i]-ax[j];
bx[i] += tmp;
bx[j] -= tmp;
tmp = ay[i]-ay[j];
by[i] += tmp;
by[j] -= tmp;
}
}
}
And the corresponding manually vectorized version (128-bit vectors, single precision data):
void sweep_128_sgl(const int nelem, float* ax, float *ay, float* bx, float* by,) {
float tmp[4];
for (int i = 0; i < nelem-1; i += 8) {
__m128 vix = _mm_set1_ps(ax[i]);
__m128 viy = _mm_set1_ps(ay[i]);
for (int j = i+1; j < nelem; j += 4) {
__m128 vj = _mm_loadu_ps(&ax[j]);
__m128 vdiff = _mm_sub_ps(vix, vj);
__m128 vbj = _mm_loadu_ps(&bx[j]);
vbj = _mm_sub_ps(vbj, vdiff);
_mm_storeu_ps(&bx[j], vbj);
_mm_storeu_ps(tmp, vdiff);
bx[i] += tmp[0]+tmp[1]+tmp[2]+tmp[3];
vj = _mm_loadu_ps(&ay[j]);
vdiff = _mm_sub_ps(viy, vj);
vbj = _mm_loadu_ps(&by[j]);
vbj = _mm_sub_ps(vbj, vdiff);
_mm_storeu_ps(&by[j], vbj);
_mm_storeu_ps(tmp, vdiff);
by[i] += tmp[0]+tmp[1]+tmp[2]+tmp[3];
}
}
}
Which differs from the 1D version by basically duplicating the code to work on both ax/bx
and ay/by
.
Performance of the 2D version #
Performance of the 128-bit, single precision version:
Comparing sweep_base_sgl (from results/sweep2d_base_sgl.json) to sweep_128_sgl (from results/sweep2d_128_sgl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_128_sgl]/4096 -0.7374 -0.7374 2872377 754223 2865660 752460
[sweep_base_sgl vs. sweep_128_sgl]/8192 -0.7342 -0.7342 11506013 3057812 11479107 3050661
[sweep_base_sgl vs. sweep_128_sgl]/16384 -0.7311 -0.7311 46017767 12372116 45909943 12343182
[sweep_base_sgl vs. sweep_128_sgl]/32768 -0.7296 -0.7296 184173481 49800358 183742778 49683899
OVERALL_GEOMEAN -0.7331 -0.7331 0 0 0 0
and the double precision version:
Comparing sweep_base_sgl (from results/sweep2d_base_sgl.json) to sweep_128_dbl (from results/sweep2d_128_dbl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_128_dbl]/4096 -0.4732 -0.4732 2872377 1513260 2865660 1509721
[sweep_base_sgl vs. sweep_128_dbl]/8192 -0.4439 -0.4439 11506013 6397967 11479107 6383008
[sweep_base_sgl vs. sweep_128_dbl]/16384 -0.4514 -0.4514 46017767 25243093 45909943 25184062
[sweep_base_sgl vs. sweep_128_dbl]/32768 -0.4497 -0.4497 184173481 101358663 183742778 101121153
OVERALL_GEOMEAN -0.4547 -0.4547 0 0 0 0
The single precision version gets a max speedup of 3.8x and the double precision version gets 1.9x which are both pretty close to optimal! Now what about the 256-bit version?
single precision:
Comparing sweep_base_sgl (from results/sweep2d_base_sgl.json) to sweep_256_sgl (from results/sweep2d_256_sgl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_256_sgl]/4096 -0.7104 -0.7104 2872377 831729 2865660 829784
[sweep_base_sgl vs. sweep_256_sgl]/8192 -0.7079 -0.7079 11506013 3361005 11479107 3353127
[sweep_base_sgl vs. sweep_256_sgl]/16384 -0.7016 -0.7016 46017767 13732233 45909943 13700120
[sweep_base_sgl vs. sweep_256_sgl]/32768 -0.6624 -0.6624 184173481 62176512 183742778 62031113
OVERALL_GEOMEAN -0.6962 -0.6962 0 0 0 0
and double precision:
Comparing sweep_base_sgl (from results/sweep2d_base_sgl.json) to sweep_256_dbl (from results/sweep2d_256_dbl.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_256_dbl]/4096 -0.4926 -0.4926 2872377 1457335 2865660 1453927
[sweep_base_sgl vs. sweep_256_dbl]/8192 -0.4868 -0.4868 11506013 5905197 11479107 5891388
[sweep_base_sgl vs. sweep_256_dbl]/16384 -0.4857 -0.4857 46017767 23664846 45909943 23609391
[sweep_base_sgl vs. sweep_256_dbl]/32768 -0.4381 -0.4381 184173481 103495512 183742778 103253479
OVERALL_GEOMEAN -0.4762 -0.4762 0 0 0 0
Now the speedup has barely changed, and has, in fact, gotten slower for the single precision version! Profiling this
code reveals that a big chunk of time is spent on the loading of the vdiff
vector to the tmp
variable and then
adding 8 (single precision) or 4 (double precision) elements to bx/by[i]
.
Swapping out the code would change the load and add combination:
_mm_storeu_ps(tmp, vdiff);
bx[i] += tmp[0]+tmp[1]+tmp[2]+tmp[3];
and replacing it with something like
bx[i] += reduce_256_sgl(vdiff);
where reduce_256_sgl
, uses the function definitions:
static inline __attribute__((always_inline))
float reduce_128_sgl(__m128 a) {
__m128 shuf = _mm_movehdup_ps(a);
__m128 sums = _mm_add_ps(a, shuf);
shuf = _mm_movehl_ps(shuf, sums);
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
static inline __attribute__((always_inline))
float reduce_256_sgl(__m256 a) {
__m128 a_lo = _mm256_castps256_ps128(a);
__m128 a_hi = _mm256_extractf128_ps(a, 1);
a_lo = _mm_add_ps(a_lo, a_hi);
return reduce_128_sgl(a_lo);
}
where inline __attribute__((always_inline))
ensures that the compiler inlines the function no matter what. The double
precision version would be that described in
the faster sum reductions page. The performance
results from using these functions for the single precision version:
Comparing sweep_base_sgl (from results/sweep2d_base_sgl.json) to sweep_256_sgl_reduce2 (from results/sweep2d_256_sgl_reduce2.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-------------------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_256_sgl_reduce2]/4096 -0.8014 -0.8014 2872377 570401 2865660 569065
[sweep_base_sgl vs. sweep_256_sgl_reduce2]/8192 -0.7940 -0.7940 11506013 2370599 11479107 2365055
[sweep_base_sgl vs. sweep_256_sgl_reduce2]/16384 -0.7642 -0.7642 46017767 10848874 45909943 10823504
[sweep_base_sgl vs. sweep_256_sgl_reduce2]/32768 -0.7883 -0.7883 184173481 38997633 183742778 38906450
OVERALL_GEOMEAN -0.7874 -0.7874 0 0 0 0
which raises the performance from ~3.5x to ~5x! Not optimal, but a big improvement, and potentially providing sufficient motivation for the 256-bit manual vectorisation. The double precision performance results also show a similar improvement, raising the speedup from ~2x to ~2.4x.
Comparing sweep_base_sgl (from results/sweep2d_base_sgl.json) to sweep_256_dbl_reduce2 (from results/sweep2d_256_dbl_reduce2.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------
[sweep_base_sgl vs. sweep_256_dbl_reduce2]/4096 -0.5912 -0.5912 2872377 1174242 2865660 1171496
[sweep_base_sgl vs. sweep_256_dbl_reduce2]/8192 -0.5372 -0.5372 11506013 5324447 11479107 5311997
[sweep_base_sgl vs. sweep_256_dbl_reduce2]/16384 -0.5334 -0.5334 46017767 21473573 45909943 21423354
[sweep_base_sgl vs. sweep_256_dbl_reduce2]/32768 -0.5680 -0.5680 184173481 79555471 183742778 79369428
OVERALL_GEOMEAN -0.5581 -0.5581 0 0 0 0
Performance of 3D version #
For the sake of completeness, we’ll include the 3D performance numbers too. Hopefully, from the 2D version, you’ll understand that this is just duplicating some code a second time. The results are summarized below for the sake of brevity.
Case | Min Speedup Over Base (g++ 10.3 ) |
Max Speedup Over Base (g++ 10.3 ) |
Min Speedup Over Base (clang++ 12.0.1 ) |
Max Speedup Over Base (clang++ 12.0.1 ) |
|
---|---|---|---|---|---|
128_sgl_reduce0 | 2.24 | 2.61 | 2.66 | 3.24 | |
128_sgl_reduce1 | 2.35 | 2.59 | 2.69 | 3.24 | |
128_sgl_reduce2 | 2.43 | 2.68 | 2.76 | 3.35 | |
128_dbl_reduce0 | 1.32 | 1.38 | 1.42 | 1.69 | |
128_dbl_reduce1 | 1.33 | 1.38 | 1.48 | 1.68 | |
128_dbl_reduce2 | 1.34 | 1.38 | 1.42 | 1.53 | |
256_sgl_reduce0 | 1.97 | 2.18 | 2.55 | 2.80 | |
256_sgl_reduce1 | 2.56 | 2.79 | 3.05 | 3.49 | |
256_sgl_reduce2 | 2.59 | 2.87 | 2.94 | 3.64 | |
256_dbl_reduce0 | 1.44 | 1.49 | 1.49 | 1.76 | |
256_dbl_reduce1 | 1.45 | 1.50 | 1.69 | 1.85 | |
256_dbl_reduce2 | 1.45 | 1.49 | 1.75 | 1.91 |
The 3D version shows significantly less speedup compared its 2D and 1D counterparts using both g++
and
clang++
compilers. The best speedup obtained speedups are ~3.6x and ~1.9x for the single and double
precision versions, respectively. The speedup of using 256-bit vectors is comparable to using 128-bit
vectors. But, despite the lesser performance of the 3D version, it is still probably worth manually
vectorizing this algorith, given the inability of the Clang and GNU compilers from automatically vectorizing
the nested loop (at the time of writing).
Completing the Sweep #
The Sweep algorithm discussed so far, takes shortcuts to mitigate the discussion of dealing with “residuals” - the leftover itrations. Just for reference, the double precision 256-bit manually vectorized version of a full sweep, taking care of the residuals, is shown below:
void sweep_256_dbl_reduce2(const int nelem, double* ax, double *ay, double* az, double* bx, double* by, double* bz) {
for (int i = 0; i < nelem-1; i += 8) {
// loading ith value into vector
__m256d vix = _mm256_set1_pd(ax[i]);
__m256d viy = _mm256_set1_pd(ay[i]);
__m256d viz = _mm256_set1_pd(az[i]);
int len = nelem-(i+1); // number of iterations in inner loop
int end = (len % 4 == 0) ? nelem : nelem - 4; // truncating inner loop
int j; // declaring counter variable
for (j = i+1; j < nelem; j += 4) {
__m256d vj = _mm256_loadu_pd(&ax[j]);
__m256d vdiff = _mm256_sub_pd(vix, vj);
__m256d vbj = _mm256_loadu_pd(&bx[j]);
vbj = _mm256_sub_pd(vbj, vdiff);
_mm256_storeu_pd(&bx[j], vbj);
bx[i] += reduce_256_dbl(vdiff);
vj = _mm256_loadu_pd(&ay[j]);
vdiff = _mm256_sub_pd(viy, vj);
vbj = _mm256_loadu_pd(&by[j]);
vbj = _mm256_sub_pd(vbj, vdiff);
_mm256_storeu_pd(&by[j], vbj);
by[i] += reduce_256_dbl(vdiff);
vj = _mm256_loadu_pd(&az[j]);
vdiff = _mm256_sub_pd(viz, vj);
vbj = _mm256_loadu_pd(&bz[j]);
vbj = _mm256_sub_pd(vbj, vdiff);
_mm256_storeu_pd(&bz[j], vbj);
bz[i] += reduce_256_dbl(vdiff);
}
// taking care of residual iterations from inner loop
for (; j < nelem; ++j) {
double tmp = ax[i]-ax[j];
bx[i] += tmp;
bx[j] -= tmp;
tmp = ay[i]-ay[j];
by[i] += tmp;
by[j] -= tmp;
tmp = az[i]-az[j];
bz[i] += tmp;
bz[j] -= tmp;
}
}
}
The approach to dealing with the residual iterations is intentionally naive as SIMD intrinsics are a little tricky to implement efficiently using blend instructions. And, given that at most, only a few iterations in the inner loop will make use of the simple, it’s probably not really worth the added complexity for a likely modest performance gain.
The corresponding performance (compiled with g++
):
Comparing sweep_base_dbl (from results/sweep3dfull_base_dbl.json) to sweep_256_dbl_reduce2 (from results/sweep3dfull_256_dbl_reduce2.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
----------------------------------------------------------------------------------------------------------------------
[sweep_base_dbl vs. sweep_256_dbl_reduce2]/4096 -0.3532 -0.3530 23663467 15305544 23599375 15269741
[sweep_base_dbl vs. sweep_256_dbl_reduce2]/8192 -0.2939 -0.2936 102167385 72144190 101891762 71975441
[sweep_base_dbl vs. sweep_256_dbl_reduce2]/16384 -0.3422 -0.3421 411281918 270557558 410259459 269924670
[sweep_base_dbl vs. sweep_256_dbl_reduce2]/32768 -0.4680 -0.4678 1838934215 978258653 1833970828 975969961
OVERALL_GEOMEAN -0.3677 -0.3675 0 0 0 0
The speedup varies significantly, but is roughly similar to the simplified 3D version shown previously. Interestingly,
the speedup of the largest test case is quite good for reasons unnknown to me. This effect is present with both g++
and clang++
.