Tales about Aliaksey's life

Monday, December 9, 2013

Massive power of (liblzma based) XZ archiver

I've recently restarted gathering of bitcoin market data. I'm grabbing samples of market depth every 3 seconds and I'm collecting trade events.

Market depth samples can be quite large. Every mtgox sample appears to be about 90 kilobytes big. So 1 hour of samples is about 100 megs of data. And month is about 3 gigs. Which is a bit too much.

gzip is able to compress that about 5x. But that's still a bit too large.

I've found xz to really shine on that kind of data. More than 1 gig of data gets squeezed down to less than a meg! And what's extra cool is xz is very quick to decompress. For static data like btc market archive that's very useful.

So quality compression does matter. And I just wanted to express my ultimate respect to authors of that extremely useful software.

Have a nice day and happy hacking!

Sunday, September 22, 2013

Playing with Intel TSX

I've recently got access to a box that has Intel Haswell CPU inside. And I was quite looking forward playing with one of it's most interesting features: hardware transactional memory. My particular interest is to see how cheap it is.

My use case is per-processor data structures (e.g. malloc caches). And without explicit binding of threads to processors, there's only optimistic way of doing it. Which requires some synchronization to defend against pessimistic case of rescheduling of thread to different cpu. That would look like taking cpu id, locking it's corresponding lock which in most cases would be in cache and uncontended and thus reasonably quick, and then doing something with per-cpu data. So in this approach we always pay some performance price even if majority of actual runs will hit fast-path. Lack of really cheap optimistic locking makes that price significant which makes it less attractive.

So lets return to Intel's implementation of transactional memory (aka TSX). Wikipedia article describes that thing pretty well. My understanding is that it's expected to be most useful for somewhat coarse locks where multiple threads would normally contend for the lock yet they touch different memory locations. E.g. imagine different threads touching different buckets of hash table or different branches of binary search tree. It can also be used as compare-and-exchange operation that allows you to process multiple memory locations at once. There's already glibc support for it that optimises pthread mutex operations described in usually nice lwn.net article.

My hope was that this feature ends up being even faster than atomic operations in fastest path (everything is in L1 cache) given it's optimistic nature. And that it might be useful for quick optimistic locking I'd like to have.

You can see my test case here. It simulates fastpath of "lock" that guards a counter. There is no locking itself, just check that "lock" is free. Which is what glibc lock elision code is doing. And you can see how TSX allows to avoid actual locking. "On the other side of the ring" is code that changes counter via traditional compare-exchange atomic operation (no locking either, to give me purer numbers).

On the box I have access to (with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz processor) I'm getting about 67 cycles per loop iteration for TSX case. And about 27 cycles for atomic CAS code (and same for more traditional locking instructions "xchg mem, register"). Note that it's very likely that larger transactions will have bigger overhead. Also note that usual synchronised region is two atomic ops (unlocking atomic operation being potentially significantly cheaper than locking operation), so in this limited case TSX appears to be somewhat competitive with traditional locking, but not faster.

So TSX is not faster than single atomic operation. Which breaks my hope of using it for quick optimistic lock It is somewhat sad that on today's hardware there's seemingly no way to have fast-path of locks to be lightning fast without playing crazy tricks (e.g. slow path stopping lock owner thread via signal or ptrace like jvm "biased locking" appears to be doing).

Anyways, being slightly more than 2x slower than simple atomic operation is pretty good news IMHO for use cases for which TSX is designed. And it's "multi-word cas" application appears to be very interesting and useful too. So I'm looking forward using it somewhere.

And finally I have to note that, especially at the beginning, debugging transactional memory code can be quite tricky and very weird. That's because transaction is fully isolated while it runs, so there's no way to printf something to see why it fails. Or set breakpoint inside it and inspect things. This hit me initially because my simplistic code wasn't at all prepared to handle transaction failures. I.e. my code is only supposed to test fast-path without any real-world synchronization contention. After few minutes of struggling with it I realized, that even otherwise conflict-less code will abort from time to time. For example, any interrupt (e.g. timer tick) will abort in-flight transaction, as well as in fact any user- to kernel-space transition will.

So lesson number one is that debugging hardware transactional memory code should be done very carefully. Especially if code path is significantly different between successful and abort-ful cases. I.e. imagine some real transaction that might span several layers of code and consider that debugger/printf will never be able to see or expose "guts" of aborted transactions. And lesson number two is that aborts have to be handled always, even in toy code.

Have a nice day and happy hacking.

Sunday, April 8, 2012

gpicker 2.2 is out!

Hello there! I've just made long due release of gpicker 2.2. Some notable changes are:

  • new project type -- script, that I'm using to handle multi-repository project (i.e. couchbase)
  • implemented poor man's isearch on steroid's -- gpicker-isearch
  • big improvements for gpicker-imenu
  • more optimization

Savannah project page has link download area with source .tar.{gz,bz2,xz} archives and binary .deb packages (built on lenny) for i386 and amd64. If you haven't heard of gpicker before also check out supermegadoc which is very convenient gpicker-using tool.

Thursday, December 15, 2011

Me and Gnome3

Hi. Quite a bit of time passed since my last past. That was busy time with continued hard work on (still forthcoming) Couchbase Server 2.0 release and, most importantly, I've found beautiful girl and got married!

Anyway, I just got remind that I should not forget about writing something from time to time. And today's "hot" topic is Gnome 3.

About a month ago (or was it 2 ? Time flies so weirdly with so much happening around me now) Debian Sid got Gnome 3. Even earlier it got some components of Gnome 3. Most noticeable was upgrade of gnome-terminal to Gnome 3 version. And that was almost immediately reverted back to gnome-terminal 2 from last Debian stable. The reason is very simple. Default theme of gtk3 (which is, naturally, used by all gnome 3 apps) is ugly. Like very very ugly. And, surprisingly, there's only one non-default theme engine for gtk3. The one that's heavily using CSS3. I don't like it's look either, but the most worrisome aspect of it is quite noticeable slowness. There are ways to adjust look with CSS3 hackery after all. I've found that some porting work of old gtk engines was initiated. But quick and minimalistic Mist engine I'm used to is not yet ported.

That's basically my whole Gnome 3 story. I cannot tolerate Gnome 3 not because of it's experimental UI, but because I need usable gtk3 theme first. I cannot even say what I'm thinking about gnome's UI, because I haven't even tried using it on daily basis.

Whoever makes Mist work on gtk3 will become my hero. Meanwhile, I was forced to find refuge in XFCE land, that's missing few things I had on my gnome 2 desktop.

Monday, May 16, 2011

Unbreaking LXC on latest Debian unstable

With recent switch to /run directory in Debian I was getting error from lxc when it was trying to mount /dev/shm in container and failed because /dev/shm is now symlink to inside /run. The simplest fix I found is replacing symlink with bind mount. Here's what I've added to /etc/rc.local

if [ -L /dev/shm ]
  mv /dev/shm /dev/shm~
  mkdir /dev/shm
  mount --bind "`readlink -f /dev/shm~`" /dev/shm

Tuesday, May 3, 2011

Setting up Distel and erlang remote shell for membase

I decided to do myself a small present. I've just wrote some Emacs lisp code that via REST API grabs erlang otp node and cookie and connects my emacs with that node. I've made integration with Distel and erlang remote shell. The later was most problematic because of quite weird TTY handling in Erlang shell. But in the end it seems to work great! Grab code here: https://gist.github.com/952958 And the usage is M-x alk-membase-shell and M-x alk-membase-setup-distel.

Sunday, April 24, 2011

Should we avoid exceptions in JS ? (Or the problem with jsPerf micro-benchmarks)

In general my take on benchmarking is: it's too easy to screw up something. The biggest problem is that with normal programming/engineering it's easy to see if you're doing it right. With benchmarking you have some (plausibly looking or otherwise) numbers. I actually think that most of benchmarks we're seeing are flawed in some way. So it's very easy to jump to conclusions, only to be found guilty of screwing up some important aspect. So first of all I'd like to make the following statement:
I understand that benchmarks suck. Please be careful and don't blindly believe any benchmarking numbers. Yes. Even mine.
Now back to main story. Today my pull request for underscore was rejected. I've proposed nice call/cc like helper to do non-local return from JS functions. It should be especially helpful for breaking out of forEach/map loops. More ideally ES5 designers could have added some ability to terminate loops early. Like via returning some special value. But apparently they're too conservative. Maybe I'll post some rants about that some other day.

Anyway, my patch was rejected because underscore author does not want to use exceptions for control flow. His point is that exceptions are very slow.

But is that true? I'm not sure. If we want all performance we can get, then probably we shouldn't use forEach & friends in first place. And then ES3's break statement can actually break out of several loop nestings (via break to label). But note, that standard forbids breaking across function calls, otherwise it would be nice (and potentially faster) alternative to exceptions.

Part of the problem is that, apparentlyV8 forces heap allocation of frames of functions with try/catch block.

The fact is, any code that creates closures has potential to force heap allocation of stack frames (or parts of), which if done often enough will trigger (relatively) expensive GC pauses. That's because if closure is 'leaked' out of it's dynamic scope, it effectively out-lives function that created it. So any variables that are needed by closure cannot live on normal stack.

Lot's of tricks in modern JS interpreters allow them to avoid heap allocation in many important cases. But sometimes you still have to pay that price.

My point is: there is no other way to do non-local return in JS other than by throwing exception. And it doesn't seem to be too slow. I also expect JS-engine vendors to gradually optimize that case.

But even though I understand micro-benchmarking drawbacks. Here's my (potentially flawed) take on for-loop-with-break vs. forEach with throw problem.

First, here's jsPerf page on plain for vs. forEach performance. We see that on FF3.6 forEach is much slower than loop. But on V8 it's about same.

Second, here's jsPerf page on for with break vs. forEach with throw performance. We see that throwing out of loop is not too bad. It actually seems to be faster on Firefox and only 1.5-to-2 times slower on Chrome's V8.

Also note, that somehow manual inlining of add function produces much larger scores, which might indicate  major flow in this microbenchmark. In general of course we'd like to compile/optimize benchmark code once and then run it multiple times. But lack of automatic inlining in this very trivial case hints that we might be causing JIT to optimize it each time it's run. So feel free to play with that benchmarks more and find some flaw!

Note that there's some plausible evidence that throwing non-Errors is much faster. I think that's because when you throw Error instance, it causes runtime to collect backtrace and fill in 'stack' attribute, which seems to be much costlier then just throwing. In fact you can google for Java exception throwing performance and you'll discover that optimal throwing performance is reached by throwing 'prepared' exception instance. Exactly for same reason.

But returning to original question, I still feel that in some cases exceptions are perfectly good way to do some non-local returns. There are so many creative ways to use all language features. And blindly rejecting any of them, just because it seems to be slow is not smart, IMHO. Sure, exceptions have their cost, but sometimes avoiding them is costlier. And all trade-offs should be carefully analyzed and understood.