Sunday, January 29, 2012

Slow reading from std::cin

It's been 3rd hour I tried to optimize a program. All been OK, and I finally checked time(1) output: 9 seconds in userspace. Well done, no more critical things to optimize. Then I ran it again, but with another method: without filename as param (it's supposed to read from stdin in this case). time(1) showed me 21s. this time! What?! Aaa... it showed me as twice as lower previous run! Really, it depends on stdin?! I also picked another variant to check (my system?):
$ cat big_file | time myprogram ... /dev/stdin
9.18user 0.11system 0:09.32elapsed 99%CPU (0avgtext+0avgdata 6608maxresident)k
0inputs+8624outputs (0major+471minor)pagefaults 0swaps
Hmm... now it's fine. But now reading goes through std::ifstream not from std::cin. I checked my concern with google-perftools - and what I saw: nearly 50% has been spent by calling of std::getline(), here is the top10 functions by google-perftools:
     908  41.6%  41.6%      908  41.6% gogo::BlacklistFilter::exists <<< WORK
     340  15.6%  57.2%      357  16.4% _IO_getc
     271  12.4%  69.6%     1095  50.2% std::getline
     221  10.1%  79.7%      239  11.0% _IO_acquire_lock_fct
     172   7.9%  87.6%      193   8.8% _IO_ungetc
      54   2.5%  90.1%      287  13.2% __gnu_cxx::stdio_sync_filebuf::uflow
      48   2.2%  92.3%       48   2.2% std::__once_callable
      43   2.0%  94.3%       43   2.0% _IO_sputbackc
      37   1.7%  96.0%       37   1.7% std::_Rb_tree_black_count    <<< WORK
      27   1.2%  97.2%      275  12.6% __gnu_cxx::stdio_sync_filebuf::underflow
Fantastic! Hardly 45% is work_code-related, the rest caused by std::getline. Strange distribution, especially _IO_acquire_lock_fct function. This name seems to be self-explained, so I easily found this method: ios_base::sync_with_stdio. So, putting std::cin.sync_with_stdio(false); tamed my program as well. CPU time returned back and I'm happy again.
Note, this behavior doesn't related to -pthread or anything else compiler-key. It's just become 30 to 40 times slower when you reading from std::cin, no matter using std::getline or not.
Surely, that's not the potion what make all programs faster, but I'll keep this std::ios' weakness in mind.

Sunday, January 22, 2012

Swap words in text w/o additional memory

Yesterday I saw an interview-problem from our C-team:

Revert words-order in text no using additional memory

For example, string
"Fedora Project promotes internet freedom" should be translated to
"freedom internet promotes Project Fedora" [nice semantic palindrome, isn't it? :-)].

Doesn't looks hard, but the only solution I found is to inverse all text, and then inverse each word.

#include <stdio.h>
#include <string.h>

/* invert characters in range [beg; end) */
inline void inv(char *beg, char *end)
    while(beg < end) {
        char c = *beg;
        *beg++ = *end;
        *end-- = c;

int main(int argc, char *argv[])
    char *s = argv[1], *beg = s, *end = strchr(s, '\0');

    inv(beg, end);

    beg = s;
        end = strchr(beg, ' ') ? : strchr(beg, '\0');
        inv(beg, end);
        beg = end + 1;

    printf("%s\n", s);
    return 0;
The problem is we have to scan string twice. Maybe there is a better (by-algo, not especially by-speed) solution?
BTW, nice construction which is used everywhere in Linux kernel
x = y ? : z; // equals to "if(y) x = y; else x = z;"

Monday, January 16, 2012

Long Live ... SSD!

I pretty assume you'are using Linux :-)

Some advices to make your SSD live longer

  • Enable TRIM command from filesystem to disk firmware (ext4 has option 'discard', see man 8 mount)
  • set 'noatime' and 'nodiratime' options (again, see man 8 mount)
  • enlarge /proc/sys/vm/dirty_writeback_centisecs up to 60000 (60 seconds) to make pdflush write rarely
Do you know any more?

Surely, do not apply it blindly: there are many explanation why OS have to work w/ solid drives differently comparing with usual HDD:
  1. Illustrated process of rewriting block
  2. wear-leveling, or how solid state drives (or even USB sticks) remap data blocks.
  3. How to configure TRIM in Ubuntu and other distros. With little benchmarking


Also it's worth to change default IO scheduler to "noop". This will boost synchronous operations in case of distributed read requests. Many peoples thing it reduces much CPU cycles of "too smart" defult schedulers: CFQ or deadline. But I think the plenty of effect is not in CPU time. Instead, it reduces average of IO request waiting it's time to be actually send to device. Because "noop" scheduler does not buffer IO requests.

I saw big difference with iostat(1)' "await" column; from manpage it is "The average time (in milliseconds) for I/O requests issued to the device to be served. This includes the time spent by the requests in queue and the time spent servicing them". For my workload it decreased 8 times!

To switch your disk' scheduler to noop perform:

$ echo noop | sudo tee /sys/block/sda/queue/scheduler # my SSD is sda

And, certainly we have to do this thing each time on system boot. The most native way to do this in Ubuntu, as I found, is via procps:

$ sudo apt-get install procps

Then add following line to /etc/sysfs.conf:

block/sda/queue/scheduler = noop

Saturday, January 14, 2012

slow std::fill behavior

This morning I saw a commit in our group with nearly this content:

- std::fill(v1.begin(), v1.end(), 0);
- std::fill(v2.begin(), v2.end(), 0);
- std::fill(v3.begin(), v3.end(), 0);
+ for(int i = 0; i < N; ++i) {
+     v1[i] = v2[i] = v3[i] = 0;
+ }
with comment "small optimization".

I've been a bit wondered "Is it really give any speedup?", but, anyway, decided to check. Results are expected, all but the last one: strange thing, I can't do this code work fast w/o hand optimization even w/ -march=native. Here is the source code and my benchmark digits:

#include <algorithm>
#include <vector>
#include <stdio.h>
using namespace std;

#define N 1000

inline void i32_fill(int *start, int n, int c)
    int d1, d2;
    asm volatile(
        :"=&D"(d1), "=&c"(d2)
        :"0"(start), "1"(n), "a"(c)
        :"cc", "memory"

int main()
    int crc = 0;
    vector v1(N), v2(N), v3(N), v4(N);

    for ( int loop = 0; loop < 10000000; loop++ )
        /* Uncomment any of the following methods:
         * I especially did not place 'em one after each other keeping
         * in mind you may ask "wasn't it because of caching?"

        std::fill(v1.begin(), v1.end(), 1);
        std::fill(v2.begin(), v2.end(), 2);
        std::fill(v3.begin(), v3.end(), 3);
        std::fill(v4.begin(), v4.end(), 4);

        for ( int i = 0; i < N; i++) v1[i] = 1;
        for ( int i = 0; i < N; i++) v2[i] = 2;
        for ( int i = 0; i < N; i++) v3[i] = 3;
        for ( int i = 0; i < N; i++) v4[i] = 4;

        for ( int i = 0; i < N; i++) {
            v1[i] = 1;
            v2[i] = 2;
            v3[i] = 3;
            v4[i] = 4;

        i32_fill(&v1[0], N, 1);
        i32_fill(&v2[0], N, 2);
        i32_fill(&v3[0], N, 3);
        i32_fill(&v4[0], N, 4);
        crc += v1[N-1] + v2[N-1] + v3[N-1] + v4[N-1];

    printf("CRC=%d\n", crc);
    return 0;

First of all, there is a great difference (over 4 times) using -O2 & -O3 for all but i32_fill. Here is test' results for:

$ g++ -O3 -march=native fill.cpp && ./time a.out
N x loop0m7.018s
loop x N0m5.119s

So, the question is - is here a better solution than i32_fill and why "N x loop" wasn't optimized so much by compiler? BTW, memset(3) uses "stosq then stosb" approach and it's built-in gcc.