int get_nbits(int x) { int cnt = 0; while(x) { cnt += x & 1; x >>= 1; } return cnt; }
It's rough and not efficient. Any ways to make it faster? Couple of them (and, yea - give more of 'em in comments).
boost 1)
Look here: we have a number which is a power of 2. Such numbers has a great attribute: if you perform '&' with number-1, you'll always get a zero. This is because you always have "ones" in place you get zero in power-of-two. Look at this: number 16 and 16-1=15:10000 & 01111 ------- 0So what does it means? You always eliminate the lower 'one' bit. And example above may be easily rewritten as:
int count_bits(int x) { int cnt = 0; while (x) { x &= (x-1); cnt++; } return cnt; }
So, now we will perform no more than "number-of-ones" loops. Ok, but still doesn't efficient.
boost 2)
Why we should count a bits using N loops; we have to precompute 'em instead! So, use precomputed table for bytes, we may rewrite our example as follows:static unsigned char nbits[256] = { /* 0 - 15 */ 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, /* 16 - 31 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, /* 32 - 47 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, /* 48 - 63 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 64 - 79 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, /* 80 - 95 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 96 - 111 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 112 - 127 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, /* 128 - 143 */ 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, /* 144 - 159 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 160 - 175 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 176 - 191 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, /* 192 - 207 */ 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, /* 208 - 223 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, /* 224 - 239 */ 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, /* 240 - 255 */ 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8 }; int count_bits(unsigned x) { return nbits[x & 0xFFU] + nbits[(x >> 8) & 0xFFU] + nbits[(x >> 16) & 0xFFU] + nbits[x >> 24] }
boost 3) Using POPCNT from Intel SSE4.2 command set
You may find brief description and links here.boost 3.2
GCC has set of built-in commands which expands to CPU-instructions if present, or use software (but efficient) workarounds. As final example, I prefer to use this code:// // Uses SSE4 'POPCNT' instruction if present, or gcc-stub like 'popcount' // templateint popcnt_modern(T); template<> int popcnt_modern(unsigned x) { return __builtin_popcount(x); } template<> int popcnt_modern(unsigned long x) { return __builtin_popcountl(x); } template<> int popcnt_modern(unsigned long long x) { return __builtin_popcountll(x); }
Although It seems what gcc generates inefficient code unless your CPU provide SSE4.2. Hard to provide numbers here, but trust - code with byte-table is slightly faster.
No comments:
Post a Comment