dkrotx-prg: population count (POPCNT)

I'm wondering, some people don't know about primitive bits technique. For example, trying to count number of '1' bits in int may be done as follows:

int get_nbits(int x) {
  int cnt = 0;
  while(x) {
    cnt += x & 1;
    x >>= 1;
  }
  return cnt;
}

It's rough and not efficient. Any ways to make it faster? Couple of them (and, yea - give more of 'em in comments).

boost 1)

Look here: we have a number which is a power of 2. Such numbers has a great attribute: if you perform '&' with number-1, you'll always get a zero. This is because you always have "ones" in place you get zero in power-of-two.

Look at this: number 16 and 16-1=15:

  10000
& 01111
-------
      0

So what does it means? You always eliminate the lower 'one' bit. And example above may be easily rewritten as:

int count_bits(int x) {
  int cnt = 0;
  while (x) {
    x &= (x-1);
    cnt++; 
  }
  return cnt;
}

So, now we will perform no more than "number-of-ones" loops. Ok, but still doesn't efficient.

boost 2)

Why we should count a bits using N loops; we have to precompute 'em instead! So, use precomputed table for bytes, we may rewrite our example as follows:

static unsigned char nbits[256] = {
  /*   0 -  15 */ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 
  /*  16 -  31 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  32 -  47 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  48 -  63 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /*  64 -  79 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /*  80 -  95 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /*  96 - 111 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 112 - 127 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 128 - 143 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 
  /* 144 - 159 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 160 - 175 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 176 - 191 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 192 - 207 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 
  /* 208 - 223 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 224 - 239 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 
  /* 240 - 255 */ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8
};

int count_bits(unsigned x)
{
  return nbits[x & 0xFFU] + nbits[(x >> 8) & 0xFFU] + nbits[(x >> 16) & 0xFFU] + nbits[x >> 24]
}

boost 3) Using POPCNT from Intel SSE4.2 command set

You may find brief description and links here.

boost 3.2

GCC has set of built-in commands which expands to CPU-instructions if present, or use software (but efficient) workarounds. As final example, I prefer to use this code:

//
// Uses SSE4 'POPCNT' instruction if present, or gcc-stub like 'popcount'
//
template int popcnt_modern(T);
template<> int popcnt_modern(unsigned x) { return __builtin_popcount(x); }
template<> int popcnt_modern(unsigned long x) { return __builtin_popcountl(x); }
template<> int popcnt_modern(unsigned long long x) { return __builtin_popcountll(x); }

Although It seems what gcc generates inefficient code unless your CPU provide SSE4.2. Hard to provide numbers here, but trust - code with byte-table is slightly faster.

dkrotx-prg

Thursday, October 20, 2011

population count (POPCNT)

boost 1)

boost 2)

boost 3) Using POPCNT from Intel SSE4.2 command set

boost 3.2

No comments:

Post a Comment

Followers

Blog Archive