I've recently got access to a box that has Intel Haswell CPU inside. And I was quite looking forward playing with one of it's most interesting features: hardware transactional memory. My particular interest is to see how cheap it is.
My use case is per-processor data structures (e.g. malloc caches). And without explicit binding of threads to processors, there's only optimistic way of doing it. Which requires some synchronization to defend against pessimistic case of rescheduling of thread to different cpu. That would look like taking cpu id, locking it's corresponding lock which in most cases would be in cache and uncontended and thus reasonably quick, and then doing something with per-cpu data. So in this approach we always pay some performance price even if majority of actual runs will hit fast-path. Lack of really cheap optimistic locking makes that price significant which makes it less attractive.
So lets return to Intel's implementation of transactional memory (aka TSX). Wikipedia article describes that thing pretty well. My understanding is that it's expected to be most useful for somewhat coarse locks where multiple threads would normally contend for the lock yet they touch different memory locations. E.g. imagine different threads touching different buckets of hash table or different branches of binary search tree. It can also be used as compare-and-exchange operation that allows you to process multiple memory locations at once. There's already glibc support for it that optimises pthread mutex operations described in usually nice lwn.net article.
My hope was that this feature ends up being even faster than atomic operations in fastest path (everything is in L1 cache) given it's optimistic nature. And that it might be useful for quick optimistic locking I'd like to have.
You can see my test case here. It simulates fastpath of "lock" that guards a counter. There is no locking itself, just check that "lock" is free. Which is what glibc lock elision code is doing. And you can see how TSX allows to avoid actual locking. "On the other side of the ring" is code that changes counter via traditional compare-exchange atomic operation (no locking either, to give me purer numbers).
On the box I have access to (with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz processor) I'm getting about 67 cycles per loop iteration for TSX case. And about 27 cycles for atomic CAS code (and same for more traditional locking instructions "xchg mem, register"). Note that it's very likely that larger transactions will have bigger overhead. Also note that usual synchronised region is two atomic ops (unlocking atomic operation being potentially significantly cheaper than locking operation), so in this limited case TSX appears to be somewhat competitive with traditional locking, but not faster.
So TSX is not faster than single atomic operation. Which breaks my hope of using it for quick optimistic lock It is somewhat sad that on today's hardware there's seemingly no way to have fast-path of locks to be lightning fast without playing crazy tricks (e.g. slow path stopping lock owner thread via signal or ptrace like jvm "biased locking" appears to be doing).
Anyways, being slightly more than 2x slower than simple atomic operation is pretty good news IMHO for use cases for which TSX is designed. And it's "multi-word cas" application appears to be very interesting and useful too. So I'm looking forward using it somewhere.
And finally I have to note that, especially at the beginning, debugging transactional memory code can be quite tricky and very weird. That's because transaction is fully isolated while it runs, so there's no way to printf something to see why it fails. Or set breakpoint inside it and inspect things. This hit me initially because my simplistic code wasn't at all prepared to handle transaction failures. I.e. my code is only supposed to test fast-path without any real-world synchronization contention. After few minutes of struggling with it I realized, that even otherwise conflict-less code will abort from time to time. For example, any interrupt (e.g. timer tick) will abort in-flight transaction, as well as in fact any user- to kernel-space transition will.
So lesson number one is that debugging hardware transactional memory code should be done very carefully. Especially if code path is significantly different between successful and abort-ful cases. I.e. imagine some real transaction that might span several layers of code and consider that debugger/printf will never be able to see or expose "guts" of aborted transactions. And lesson number two is that aborts have to be handled always, even in toy code.
Have a nice day and happy hacking.
My use case is per-processor data structures (e.g. malloc caches). And without explicit binding of threads to processors, there's only optimistic way of doing it. Which requires some synchronization to defend against pessimistic case of rescheduling of thread to different cpu. That would look like taking cpu id, locking it's corresponding lock which in most cases would be in cache and uncontended and thus reasonably quick, and then doing something with per-cpu data. So in this approach we always pay some performance price even if majority of actual runs will hit fast-path. Lack of really cheap optimistic locking makes that price significant which makes it less attractive.
So lets return to Intel's implementation of transactional memory (aka TSX). Wikipedia article describes that thing pretty well. My understanding is that it's expected to be most useful for somewhat coarse locks where multiple threads would normally contend for the lock yet they touch different memory locations. E.g. imagine different threads touching different buckets of hash table or different branches of binary search tree. It can also be used as compare-and-exchange operation that allows you to process multiple memory locations at once. There's already glibc support for it that optimises pthread mutex operations described in usually nice lwn.net article.
My hope was that this feature ends up being even faster than atomic operations in fastest path (everything is in L1 cache) given it's optimistic nature. And that it might be useful for quick optimistic locking I'd like to have.
You can see my test case here. It simulates fastpath of "lock" that guards a counter. There is no locking itself, just check that "lock" is free. Which is what glibc lock elision code is doing. And you can see how TSX allows to avoid actual locking. "On the other side of the ring" is code that changes counter via traditional compare-exchange atomic operation (no locking either, to give me purer numbers).
On the box I have access to (with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz processor) I'm getting about 67 cycles per loop iteration for TSX case. And about 27 cycles for atomic CAS code (and same for more traditional locking instructions "xchg mem, register"). Note that it's very likely that larger transactions will have bigger overhead. Also note that usual synchronised region is two atomic ops (unlocking atomic operation being potentially significantly cheaper than locking operation), so in this limited case TSX appears to be somewhat competitive with traditional locking, but not faster.
So TSX is not faster than single atomic operation. Which breaks my hope of using it for quick optimistic lock It is somewhat sad that on today's hardware there's seemingly no way to have fast-path of locks to be lightning fast without playing crazy tricks (e.g. slow path stopping lock owner thread via signal or ptrace like jvm "biased locking" appears to be doing).
Anyways, being slightly more than 2x slower than simple atomic operation is pretty good news IMHO for use cases for which TSX is designed. And it's "multi-word cas" application appears to be very interesting and useful too. So I'm looking forward using it somewhere.
And finally I have to note that, especially at the beginning, debugging transactional memory code can be quite tricky and very weird. That's because transaction is fully isolated while it runs, so there's no way to printf something to see why it fails. Or set breakpoint inside it and inspect things. This hit me initially because my simplistic code wasn't at all prepared to handle transaction failures. I.e. my code is only supposed to test fast-path without any real-world synchronization contention. After few minutes of struggling with it I realized, that even otherwise conflict-less code will abort from time to time. For example, any interrupt (e.g. timer tick) will abort in-flight transaction, as well as in fact any user- to kernel-space transition will.
So lesson number one is that debugging hardware transactional memory code should be done very carefully. Especially if code path is significantly different between successful and abort-ful cases. I.e. imagine some real transaction that might span several layers of code and consider that debugger/printf will never be able to see or expose "guts" of aborted transactions. And lesson number two is that aborts have to be handled always, even in toy code.
Have a nice day and happy hacking.
No comments:
Post a Comment