Vectorizing Cell-based Pair Search #
To calculate the acceleration of SPH particles, one must first find the pairs. In the case where incompressible or weakly-compressible assumptions are used, and where particles’ kernel radius is fixed, the cell-linked list strategy is often employed. The basis of the algorithm is that described on one of my other pages written in Fortran. Below is the C++ version of the main sweep code.
int npairs = 0;
for (int hashi = gridhash[0]; hashi <= gridhash[nelem-1]; ++hashi) {
for (int ii = starts[hashi]; ii < starts[hashi+1]; ++ii) {
for (int jj = ii+1; jj < starts[hashi+2]; ++jj) {
sph::dr(x[ii], x[jj], y[ii], y[jj], dx[niac], dy[niac], r[niac]);
pair_i[niac] = ii;
pair_j[niac] = jj;
niac += r[niac] < kh;
}
int hashbotleft = hashi-ngridy-1;
for (int jj = starts[hashbotleft]; jj < starts[hashbotleft+3]; ++jj) {
sph::dr(x[ii], x[jj], y[ii], y[jj], dx[niac], dy[niac], r[niac]);
pair_i[niac] = ii;
pair_j[niac] = jj;
niac += r[niac] < kh;
}
}
}
Some distinctions between this code and the Fortran code is that the positions are seperated into two std::vector
, and
distance vectors and magnitudes are stored in std::vectors
, as opposed to being discarded immediately (the calculation
is performed in sph::dr
function). This is useful because these values can be reused lated to calculate kernel values,
gradients, and particles’ acceleration.
Vectorizing the code with AVX512 instructions #
/* NB: cutoff is selected as being 2.4x the average particle spacing
and particles are arranged randomly in a unit square */
int npairs = 0;
__m512d vdx, vdy, vdr; // initializing 128-bit vectors for relative position vectors and magnitude
// declare vector to store relative positions of each element in an 8-element vector
__m256i zero2seven = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
// outer loop, iterating over each grid cell
for (int hashi = gridhash[0]; hashi <= gridhash[nelem-1]; ++hashi) {
// inner loop, iterating over particles in the given grid cell
for (int ii = starts[hashi]; ii < starts[hashi+1]; ++ii) {
// initializing particle i's position
__m512d vix = _mm_set1_pd(x[ii]);
__m512d viy = _mm_set1_pd(y[ii]);
// calculating where vectorized loop should end
int len = starts[hashi+2] - (ii+1);
int end = (len % 8 == 0) ? starts[hashi+2] : starts[hashi+2] - 8;
int jj;
// looping over j particles (vectorized)
for (jj = ii+1; jj < end; jj += 8) {
// load j particles' positions
__m512d vjx = _mm512_loadu_pd(&x[jj]);
__m512d vjy = _mm512_loadu_pd(&y[jj]);
// calculate relative position vectors and magnitude
sph::vdr_512(vix, viy, vjx, vjy, vdx, vdy, vdr);
// mask to determine which elements in the __m128d vdr vector are less than the cutoff
__mmask8 umask = _mm512_cmp_pd_mask(vdr, _mm512_set1_pd(cutoff), _CMP_LT_OQ);
// use the mask to "compress" the vdr/vdx/vdy values into the lower elements of the vector
_mm512_storeu_pd(&r[niac], _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdr));
_mm512_storeu_pd(&dx[niac], _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdx));
_mm512_storeu_pd(&dy[niac], _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdy));
// storing pair_i and pair_j using similar compress
_mm256_storeu_epi32(&pair_i[niac], _mm256_set1_epi32(ii));
__m256i vjdx = _mm256_add_epi32(_mm256_set1_epi32(jj), zero2seven); // translate relative index vector by jj
_mm256_storeu_epi32(&pair_j[niac], _mm256_mask_compress_epi32(_mm256_setzero_si256(), umask, vjdx));
// increment npairs by summing the umask bits
npairs += __builtin_popcount(umask);
}
// perform residual calculations
for (; jj < starts[hashi+2]; ++jj) {
sph::dr(x[ii], x[jj], y[ii], y[jj], dx[niac], dy[niac], r[niac]);
pair_i[niac] = ii;
pair_j[niac] = jj;
niac += r[niac] < kh;
}
// repeat above process for adjacent cells
int hashbotleft = hashi-ngridy-1;
len = starts[hashbotleft+3] - starts[hashbotleft];
end = (len % 8 == 0) ? starts[hashbotleft+3] : starts[hashbotleft+3] - 8;
for (jj = starts[hashbotleft]; jj < end; jj += 8) {
__m512d vjx = _mm512_loadu_pd(&x[jj]);
__m512d vjy = _mm512_loadu_pd(&y[jj]);
sph::vdr_512(vix, viy, vjx, vjy, vdx, vdy, vdr);
__mmask8 umask = _mm512_cmp_pd_mask(vdr, _mm512_set1_pd(cutoff), _CMP_LT_OQ);
_mm512_storeu_pd(&r[niac], _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdr));
_mm512_storeu_pd(&dx[niac], _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdx));
_mm512_storeu_pd(&dy[niac], _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdy));
_mm256_storeu_epi32(&pair_i[niac], _mm256_set1_epi32(ii));
__m256i vjdx = _mm256_add_epi32(_mm256_set1_epi32(jj), zero2seven);
_mm256_storeu_epi32(&pair_j[niac], _mm256_mask_compress_epi32(_mm256_setzero_si256(), umask, vjdx));
niac += __builtin_popcount(umask);
}
for (; jj < starts[hashbotleft+3]; ++jj) {
sph::dr(x[ii], x[jj], y[ii], y[jj], dx[niac], dy[niac], r[niac]);
pair_i[niac] = ii;
pair_j[niac] = jj;
niac += r[niac] < kh;
}
}
}
Here, I make use of the 512-bit vectors to find pairs of particles and requires AVX512f instructions. In addition to the
use of 512-bit vectors, AVX512f is needed for the _mm*_cmp_*_mask
and _mm*_mask_compress_*
instructions.
The latter function compresses elements into the lower elements of the resultant vector, based on a mask. That mask can
be determined at runtime, which is not the case fo SSE*/AVX2 shuffle and blend instructions.
The calculation of vdx/vdy/vdr
follows a straightforward pattern, where data is loaded from each of the i particles
and the relative position vector and magnitude is calculated for each of the j particles. pair_i
is stored using the
constant (relative to the iteration) ii
value, and pair_j
is stored using a similar mask and compress pattern.
_mm_cmp_pd_mask
compares the values of the first argument (in this case, the __m128d
vector of relative distance),
and compares it with the second argument, another vector (in this case, created by _mm_set1_pd(cutoff)
). The
comparison function is determined by the third argument, which is _CMP_LT_OQ
which is less than or equal to. In other
words, each element of the vdr
vector is checked whether it is less than or equal to cutoff
. The result is stored in
the mask umask
.
To illustrate this a bit better, consider the below simplied example of an iteration:
// define cutoff
double cutoff = 1.;
// define i particle's position (0., 0.)
__m512d vix = _mm512_set1_pd(0.);
__m512d viy = _mm512_set1_pd(0.);
// define j particle's position
__m512d vjx = _mm512_set_pd(-1., -0.75, -0.5, -0.25, 0.25, 0.5, 0.75, 1.);
__m512d vjy = _mm512_set_pd(-1., -0.75, -0.5, -0.25, 0.25, 0.5, 0.75, 1.);
// calculate dx, dy, dr
__m512d vdx = _mm512_sub_pd(vix, vjx) // vdx = [1., 0.75, 0.5, 0.25, -0.25, -0.5, -0.75, -1.]
__m512d vdy = _mm512_sub_pd(viy, vjy) // vdy = [1., 0.75, 0.5, 0.25, -0.25, -0.5, -0.75, -1.]
__m512d vdr = _mm512_add_pd(
_mm512_mul_pd(vdx, vdx),
_mm512_mul_pd(vdy, vdy)
) // vdr = [2., 1.125, 0.5, 0.125, 0.125, 0.5, 1.125, 2.]
vdr = _mm512_sqrt_pd(vdr) // vdr = [1.414, 1.061, 0.707, 0.353, 0.353, 0.707, 1.061, 1.414]
/* calculate mask
the mask is 8 bits, each "1" bit means the corresponding index satistfies the comparison.
Below results in 0b00111100
*/
__mmask8 umask = _mm_cmp_pd_mask(vdr, _mm_set1_pd(cutoff), _CMP_LT_OQ)
// "compress" elements of vdr using mask and then store into r, and set other values to 0.
vdr = _mm512_mask_compress_pd(_mm512_setzero_pd(), umask, vdr)
double r[8];
_mm512_storeu_pd(r, vdr); // this should result in r = [0.707, 0.353, 0.353, 0.707, 0., 0., 0., 0.]
Performance relative to base #
Comparing pair_search (from results/pairsearch2D-avx512.json) to pair_search_avx512_512 (from results/pairsearch2D-avx512.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------
[pair_search vs. pair_search_avx512_512]/4096 -0.4131 -0.4131 460838 270454 459760 269821
[pair_search vs. pair_search_avx512_512]/16384 -0.3575 -0.3575 2282786 1466726 2277422 1463294
[pair_search vs. pair_search_avx512_512]/65536 -0.3265 -0.3265 9946220 6698604 9922961 6682938
[pair_search vs. pair_search_avx512_512]/131072 -0.3073 -0.3073 20956765 14516657 20907746 14482700
OVERALL_GEOMEAN -0.4177 -0.4177 0 0 0 0
From the results above, we get a meaningful speedup of about 1.4-1.7x, albeit far from an ideal 8x speedup. A possible
reason for the far-from-ideal speedup is high ratio of residual to vectorized iterations. This is relevant because here,
inner loops are vectorized and there are residual iterations for every time an inner loop is executed. In this case,
where the cutoff is set as 2.4x the average particle spacing, that means that on average, and ignoring edges of the
square that hte particles are contained in, each cell has only 2.4*2.4 = 5.76
particles. This implies that the first
inner (non-vectorized) loop will have only 2 * 5.76 - 1 = 10.52
iterations on average. The second loop will
have 3 * 5.76 = 17.28
iterations on average. These are estimates, and the real averages are probably lower since
cells near at the edge of the domain will not have as many particles in them.
It’s also important to note that _mm*_mask_compress_*
is very expensive relative to all the other intrinsics used
here.
If I use 256-bit vectors instead, but maintain the same overall algorithm, I get speedups slightly better than when using 512-bit vectors (see comparison below), which somewhat supports the view that residual iterations are relevant (using smaller vectors reduces residual iterations). Note that the code will need AVX512VL instructions to make use of the 256-bit compress instructions.
Comparing pair_search (from results/pairsearch2D-avx512.json) to pair_search_avx512_256 (from results/pairsearch2D-avx512.json)
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------
[pair_search vs. pair_search_avx512_256]/4096 -0.4218 -0.4218 460838 266450 459760 265827
[pair_search vs. pair_search_avx512_256]/16384 -0.3588 -0.3588 2282786 1463624 2277422 1460201
[pair_search vs. pair_search_avx512_256]/65536 -0.3186 -0.3186 9946220 6776961 9922961 6761108
[pair_search vs. pair_search_avx512_256]/131072 -0.3183 -0.3183 20956765 14285855 20907746 14252428
OVERALL_GEOMEAN -0.4208 -0.4208 0 0 0 0
Note that using 128-bit vectors reduces speedup in this case to below that of using 512-bit vectors. I find that the 256-bit vectors provides the most consistent speedups, regardless of cutoff.
Possible future improvements #
Ideally, there would be a way to remove the need for using the compress
instructions altogether. This could
potentially be circumvented by calculating distances of all potential pairs, and then perhaps sorting (e.g., with
Intel’s avx512 sort), or filtering; although I’m unsure if there are any
SIMD implementations of the latter.