Wednesday, December 30, 2015

PrintAssembly output explained!

If you are a regular reader of my blog, you may have noticed that I am (ab)using (of) the PrintAssembly options from the JVM to examine the code generated by the JIT compiler. It helps me a lot to understand how my code is executed, and how the JIT compiler works and optimize the code.
Even if from time to time I use also JMH, I am not a big fan of benchmarking and especially micro-benchmarking.

Why? Because micro-benchmarking is an idealization of how the production code is executed: tight loops, all data into the L1 cache and hot, few branch misses, best case for aggressive JIT optimizations (like monomorphic calls, etc.)

The thing is, the execution context in production is totally different from micro-benchmarks, so what's the point in exercising a code that will not be executed in the same condition? Are the conclusions that I can draw from the micro-benchmark still valid or still beneficial for my production cases?
All of this, push me away from micro-benchmark as much as possible and try to find another way to evaluate the performance like performance counters inserted directly into the application or reading the assembly generated by the JIT compiler. Note also that it is not perfect either as nowadays CPU are out-of-order in execution and also perform Instruction Level Parallelism. So benchmarking in some situations are the only way to assess performance.

Printing assembly helps me also to back assertions about how JIT optimizes instead of relying on some folklore and urban legends (reordering of instructions, memory barriers, ...).

With all of that, PrintAssembly is one of my favorite tools. But I can understand the output of this may be difficult to read. Nowadays, not all developers are familiar with assembly, unfortunately, but with some basic knowledge and with the help of comments inserted, it can be less cryptic.

For those who have never used PrintAssembly please refer to my previous posts about it: How to print dissassembly from JIT code and How to build hsdis-amd64.dll. Chris Newland, creator of JITWatch tool, has also some useful tips for Mac OS X. Nitsan Wakart wrote an article on this.

Your setup is done? Perfect let's read some assembly, yeah!

Assembly 101

First of all, I am using intel syntax, not AT&T one. I am used to this syntax, and because we are talking about x86 instruction set made by Intel let's stick to their convention.
Reminder: To get this syntax with the disassembler plugin, use the JVM option:

Instruction lines are decomposed as the following:

mnemonic parameters
  • mnemonic is the instruction name (mov, call, add, jmp, cmp, ...)
  • parameters can be register, memory accesses, immediate values

mov rax, 0x2A
mov rdx, QWORD PTR [rbx+0x571c418]

mov instruction is a data movement. The first line move the constant value 0x2A into the register rax.
the second line, move the memory content at the address computed from the value of regiser rbx and the constant value 0x571c418 into the register rdx. Note that order is reversed for AT&T syntax.

push/pop instructions move data to/from the stack
add/sub/imul/idiv instructions perform addition/subtraction/multiplication/division on integers
inc/dec instructions increment/decrement value in registers or memory
and/or/xor/not/shl/shr instructions perform bitwise operations
jmp instruction performs a unconditional jump to the specified address
jxx instructions perform a conditional jump based on the result of the related last operation
cmp instruction performs a comparison between 2 operands
call/ret instruction perform call to /return from a subroutine

For more information see this guide for example or the official Intel documentation.

Disassembler comments

Hopefully, disassembler plugin does not spit raw instructions but annotate them with useful information.
Let's take an example with the method ArrayList.add(Object) and analyze it:


In the header we can find the following information:
The first line is the name of the method disassembled: 'add' with its signature: one parameter of type Object returning a boolean from the class java.util.ArrayList
But as this is a instance method there is in fact 2 parameters as mentioned in the rest of the header:
Parameter this which is stored in register rdx, and the Object parameter in register r8.

Verified Entry Point

After the header the first instructions of the methods begins after the [Verified entry point]section. Assembly before this mark is here for alignment (padding). Starting from this section, we will look at comments that are after the semi-colon. Comments that are starting with the star (*)  indicates the associated byte code.

Synchronization Entry

The following comment: ; - java.util.ArrayList::add@-1 (line 458)gives us the mapping to the Java code: class name, method name and bytecode offset into the method, and finally the line number into the original Java source file. For this prologue, as we do not have a specific bytecode associated we've got the -1 offset. For the first one: ;*synchronization entry, it indicates the prologue of the function: some instructions that are necessary to prepare the execution (stack allocation or stack banging, saving some registers, ...)

Get size field

Next comment retrieves the field named from the current instance (ArrayList). It is translated to the following assembly line: mov r9d,DWORD PTR [rdx+0x10]
it moves into r9 register the content of the address rdx (this instance, cf method parameter) + 0x10 offset where the size field is located.

Get elementData field

The following comment is interesting because we have the same type of bytecode getfield but the mapping to the Java code involved 2 methods: java.util.ArrayList::ensureCapacityInternal@1 (line 223)and java.util.ArrayList::add@7 (line 458)Implicitly, it means that the JIT has inlined the first method mentionned and the byte code come from this method.

Empty array test

{oop(a 'java/lang/Object'[0])}indicates an instance (oop) with the following type 'java/lang/Object'[0]'. It means object array. This is in fact the constant instance empty array against which we are comparing inside the inlined method ensureCapacityInternal.

More inlining

Here we have an additional level of inlining for the ensureExplicitCapacity method.

Implicit null check

New kind of comment: Here we have an implicit null check because we are dereferencing the object array elementData to get the length of it. (Java code: elementData.length). If elementData is null, JVM must throw a NullPointerException in this case. But, too avoid generating code for each object dereferenced, JIT relies on OS signal handling for segfault to handle this rare case. See my article on this technique. 

Type Check

Let's skip some regular comments to stop on this one
We are verifying the current instance elementData class (metadata) is an object array ('java/lang/Object'[]). For performing this, we are getting from the instance the class pointer that we compare to the address of the class loaded by the JVM.

Card marking

Sometimes the comments are wrong: Here this is not a synchronization entry, but a special operation called 'card marking' that is performed after a write of a reference into a field or a reference array (elementData in our case). Card marking generated assembly is analyzed in this article. In this case we have card marking for element in an array, but for regular instance field, the generated assembly is different.

Safepoint poll

Finally, the comment {poll_return}indicates that the instruction performs a safepoint check. You will see this at the end of all methods. For more details about safepoints, please read my article and, a more detailed exploration of safepoints and impact here.

Voilà! You have the basics to understand the disassembly output from PrintAssembly options. I strongly recommend, again, if you want to go further to use the wonderful JITWatch tool.


From this blog:

Wednesday, October 14, 2015

assignment-with-use or inline assignment

I am still astonished by some persistent folklore and "rules of thumb" that find their origin in a so ancient time that most of people forget why they follow this rule.

Recently, I have heard one of those rules of thumb:
If you want to keep object references into the registers you need to perform inline assignment in a if statement for example:

which is claimed to be more efficient than:
To back this statement, the JDK source code is used as best practices to be followed. More precisely in LongAdder class:

Nice example of inline assignment written by Doug Lea (or at least approved by him). Why not doing like him, after all, he knows what is doing, right?

I have searched for explanation about this style, and found this thread where Rémi Forax found this code "not very java-ish". Aurélien Gonnay noticed also a difference of style between JDK code and JSR one on the method sumThenReset.

So let's simplify this code but keeping the inline assignement style. I get a similar example from the previous thread with Rémi Forax:

A more Java-ish code will be the following:
Firstly, we compare the generated byte code for inline assignments:
with no inline assignments:

Here a diff as summary:
<  4: dup      
<  5: astore_1 
>  4: astore_1 
>  5: aload_1  
< 11: dup      
< 12: istore_2 
> 11: istore_2 
> 12: iload_2  
< 25: dup      
< 26: astore_3 
> 25: astore_3 
> 26: aload_3  
Differences are only spotted at lines 4, 11 and 25. instead of dup + store for inline assignment we get store + load which semantically is the same. But is it more optimized with the first form ? Let's go check the assembly:
Compared to no inline assignments:

Here are the commands to diff without the addresses:
sed -e "s/0x[0-9a-f]*/0/g" inlineAssign1.txt > inlineAssign1_sed.txt

sed -e "s/0x[0-9a-f]*/0/g" inlineAssign2.txt > inlineAssign2_sed.txt

diff -w inlineAssign1_sed.txt inlineAssign2_sed.txt
<   # {method} {0} 'bench_c' '()V' in 'main/java/TestJIT'
>   # {method} {0} 'bench_java' '()V' in 'main/java/TestJIT'
<                                          ; - (line 16)
>                                          ; - (line 24)
<                                          ; - (line 16)
>                                          ; - (line 24)
<                                          ; - (line 16)
>                                          ; - (line 26)
<                                          ; - (line 16)
>                                          ; - (line 27)
<                                          ; - (line 17)
>                                          ; - (line 28)
<                                          ; - (line 17)
>                                          ; - (line 28)
<                                          ; - (line 17)
>                                          ; - (line 29)

You can check by yourself, line by line: each instruction is exactly the same on both version, only the addresses are different and the comment because, in my test, the method name is not the same.
Conclusion? Inline assignments have no value in term of performance, JIT is sufficiently smart to handle this.

Note: I have used JDK 1.8.0_40 for this article.

Thanks to Aurélien Gonnay for the review.

Monday, September 21, 2015

Yoda conditions

This blog post is not about performance, but ranting. Ranting about too many developers that apply blindly some "rules" without knowing why there are those practices in the first place.
Recently, I have read this article from Lukas Eder which in example number 4, recommends to inverse the natural way to write conditions by putting constant on left side of the operator instead of right side. The reason behind that is to avoid accidental assignment if you miss one equal sign.

This practice (often called Yoda Conditions) is quite old now. I encountered this the first time when doing C++. The language (in fact, this is C effectively) allows to do inline assignations inside a condition because conditions are in fact an expression evaluation (no boolean type in earlier version of C), and based on this evaluation if the result is 0 it means false, otherwise it is true. So any value (1, 2,3 -1, 0x012345f...) is considered as true.
Putting constant on the right side prevents (in case of a missing equal) the expression to compile so it helps to catch the mistake quickly.
If in C/C++ language this practice makes sense (discussing if it's is good or bad practice in C/C++ is out of the scope of this post) because there is rationale behind it.

In Java or C#, the story is totally different: Conditions are strongly typed to boolean. Assigning a reference to null or an integer variable inside a condition leads to compile error. Therefore, it does not make sense to follow this "rule" in those languages as the the mistake is inherently caught by the compiler thanks to the type system.

Takeaway from this: Do not follow blindly "rules" without knowing their origin and what are they trying to address as issues. Know also your language and its type system. Rules like Yoda Conditions fall by themselves when leveraging correctly your language type system.

Monday, September 14, 2015

Why Bios Settings matter (and not size)!

Recently, we have received new server machines based on Xeon E5v3 (Haswell). I have heard from different people that this CPU generation is very good, and figures are really impressive.
So I was pretty excited to test those new beasts on our standard benchmark to compare to the previous E5v2 (Ivy Bridge) we have.

Let's go for the first run:

Honestly, this is pretty baaaaad compared to what we've got with E5v2. This is so bad that I emailed to my Production System Manager to find out what's going wrong. Usually when we receive new machines we apply a procedure to configure them correctly for our requirements. As we work in low latency space we need to avoid many pit falls like power management, NUMA, isolcpus, in order to get the best performance.
When I checked at the OS level, I noticed, for example, cpuidle was active, which is not expected. With a well configured/tuned machine, cpuidle is not enabled. My suspicions went to a misconfigured BIOS. My PSM asked me to check with a command line the BIOS (which pretty handy, not need to reboot the machine).
The usual features that we change are the following:
  • C states: disabled
  • C1E state: disabled
  • Power management profile: Maximum performance
  • Collaborative CPU performance: disabled

Bingo, BIOS was not reconfigured following our standards! I applied them and re-run our benchmark:

Latencies divided by 2 (or more), that's really better! but still slower than E5v2. Let's recheck those BIOS settings one more time.

Why power management features are so bad for low latency ? The thing is for a component to be woken up, it takes time, can be hundreds of microseconds. For a process that we are measuring under 100 microseconds, it is huge.
For CPU there is C states that are different sleep modes. On Linux with C states enabled at BIOS level, you can see the latency associated:

cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency

It means here that to wake up a core from C3 to C0 (running) it takes 200 microseconds!

With new server generation comes new features and options, maybe there is one that make the difference.
I identified one that sounds pretty "bad" from low latency POV: "uncore frequency"=dynamic
Available options: dynamic/maximum.
Let's set it to maximum and run our benchmark:

Now we are talking! Results are better than E5v2 (roughly +30%), which is REALLY impressive!
We have tested on a E5 2697 v2 @ 2.7Ghz and on a E5 2697 v3 @ 2.6Ghz. There is only 100Mhz less on the v3 but still 30% better on our benchmark.

Finally, there is some fun features we can play in BIOS settings: we can enabled Turbo Boost and make the turbo frequency static by reducing the number of cores available into the CPU.
The E5v3 has 14 cores, let's cut this to only 2 cores and make the frequency permanently to 3.6GHz (yeah this is overclocking for servers!):

Compared to the default setup we divided our latency by 4! Just by well tuning the BIOS!
My PSM asked me to email those results to make sure everybody in his team is aware of the importance to apply BIOS settings correctly on production servers.

Friday, August 28, 2015

Ordered Scheduler


The LMAX's Disruptor is a great piece of software engineering and may be used for many use cases. However as I showed in my previous article on the cost of signalling, multiplying this kind of structure can lead to scheduler contention because we increase the number of threads that may request to the OS scheduler a CPU time slice.

At Ullink we faced this issue for a very concrete case. This case requires keeping ordering while trying to parallelize tasks. In the meantime we would like to reduce the number of threads involved in our application.

Disruptor could resolve our case, but at the price of a consumer thread for each instance. As we may have hundreds of this pattern for a running application, we will increase significantly the number of threads and therefore the pressure on the scheduler.

Georges Gomes, CTO of Ullink,  find out in an intel's article [1] an alternative approach for solving our problem. This article talks about the following issue: How to parallelize some tasks and keeping ordering in the mean time ?

Here is an example: we have a video stream and we want to re-encode it. We read from the input stream each frame and we re-encode it before writing it into an output stream.
Clearly, frame encoding can be parallelized. Each frame is an independent source of data. The problem is we need to keep ordering of the frames otherwise the video stream would not have any sense!


To solve this problem, we need to use an order buffer:


This buffer keeps items until the proper order is respected then write into the output stream.
To ensure the proper ordering, each task will grab a ticket on which the algorithm will order the processing.

Each thread is responsible to execute the algorithm and make sure it is its turn to write into the output stream or not. If not, it leaves the item into the order buffer and this is the threads of the previous items (in order) to take care of item leftover. It means we do not have a dedicated consumer thread. It leads us to the following advantages:

  • No inter-thread communication overhead (signalling: wait/notify, await/signal)
  • No additional thread
  • No wait strategy (spining, back-off, ...)

We can then multiply this kind of structure without the fear to increase significantly the number of threads running in the application. Bonus: we reduce the need for synchronization that was previously required to ensure ordering.


When tasks are created/dispatched we need to keep track of the order in which they were by assigning a ticket to it. This ticket is just a sequence that we keep incrementing.
For the order buffer we will create an ring buffer style structure with tail counter.


When one of the task entered into the structure, we check if its ticket number matches with the tail counter. If it matches we process the task with the current thread entering and we increment tail.

Next, this is the thread containing the ticket number 3 which (by the game of scheduling) comes. Here the ticket number does not match with current tail (2).

We then put at ticket number index (modulo the length of the ring buffer) the task we would like to process into the ring buffer. And we let the thread free to go.

Finally, the thread with ticket number 2 comes. It matches the tail so we can process the task immediately and increment the tail.


We examine if there is still tasks at the next index from the tail. There is the task with ticket number 3 that was left by previous thread, so it is its turn so we reused the current thread to execute the task.

And we're done!

Ordered Scheduler

If you have a code like this:

You need to synchronized to ensure your code is keeping the ordering required when writing to the output.
With Ordered Scheduler you can change your code to:

Synchronized block is still required in this example to ensure the input is attached to right ticket. You could do differently by reading your input into a single thread, taking the ticket and then dispatch into a thread pool your process tasks. Your call to the ordered scheduler will then serialize your output.

What is important to keep in mind is that you cannot miss a ticket. This is why in this example is an exception is thrown during the process we call the trash method to inform the ordered scheduler that ticket is no longer valid, otherwise it would wait for ever that ticket to come into the scheduler.

The implementation is open sourced on GitHub.



Monday, July 13, 2015

Notify... oh, wait! I have a signal.


When you want to pipeline, delegate some task asynchronously or simply synchronize 2 threads, you usually end up using wait/notify couple (or even await/signal, depending on your taste).

But what is the cost or the overhead for this kind of pattern ?

Under the hood

What happening when we are using wait/notify couple ?
I simplify here to this couple, as the other (await/signal) calls the same set of underlying methods:


Basically we are performing system calls. For Object.wait we ask the OS scheduler to move the current thread to the wait queue

For Object.notify, we ask the scheduler (via futexes[1] on Linux) to move one of the waiting threads from the wait queue to the run queue to be scheduled when possible.

Just a quick remark about system calls: contrary to the common belief, system calls do not imply context switches[2]. It depends on the kernel implementation. On Linux there is no context switch unless the system call implementation requires it like for IO. In the case of pthread_cond_signal, there is no context switches involved.

Knowing that, what is the cost of calling notify for a producer thread ?

Measure, don't guess!

Why not building a micro-benchmark ? Because I do not care about average latency, I care about outliers, spikes. How it behaves for 50, 90, 95, 99, 99.9 % of the time.  What may be the maximum I can observe?
Let's measure it with HdrHistorgram from Gil Tene and the following code:

This code basically creates n pairs of threads: one (critical) which trying to notify the second (flushing) that data are available to be processed (or flushed).
I run this code with following parameters 16 1000. It means that we have 16 pairs of threads that doing wait/notify.

Results on Windows (ns):
count: 16000
min: 0
max: 55243
mean: 549.5238125
50%: 302
90%: 1208
95%: 1812
99%: 3019
99.9%: 11472

Results on Linux (ns):
count: 16000
min: 69
max: 20906
mean: 1535.5181875
50%: 1532
90%: 1790
95%: 1888
99%: 2056
99.9%: 3175

So most of the time we can observe couple of microseconds for a call to notify. But in some cases we can reach 50us! For Low Latency systems it can be an issue and a source of outliers.

Now, if we push a little our test program to use 256 pairs of threads we end up with the following results:

Results on Windows (ns):
count: 256000
min: 0
max: 1611088
mean: 442.25016015625
50%: 302
90%: 907
95%: 1208
99%: 1811
99.9%: 2717

Results on Linux (ns):
count: 256000
min: 68
max: 1590240
mean: 1883.61266015625
50%: 1645
90%: 2367
95%: 2714
99%: 7762
99.9%: 15230

A notify call can take 1.6ms!

Even though there is no contention in this code per se, there is another kind of contention that happens in the kernel. Scheduler needs to arbitrate which thread can be run. Having 256 threads that trying to wake up their partner thread put a lot of pressure on the scheduler which become a bottleneck here.


Signaling can be a source of outliers not because we have contention on executing code between threads but because the OS scheduler needs to arbitrate among those threads, responding to wake up requests.


[1] Futex are tricky U. Drepper:

Tuesday, July 7, 2015

WhiteBox API

I have already seen this in JCStress but this is in a post from Rémi Forax on mechanical sympathy forum that it brings attention to me when I saw what is possible to do with it. Here is a summary:

This API is usable since JDK8 and there is some new additions in JDK9.

But how to use it ?
This API is not part of the standard API but in the test library from OpenJDK. you can find it here.
Download the source of OpenJDK then either you build it entirely and grab the wb.jar or

  1. go to test/testlibrary/whitebox directory
  2. javac -sourcepath . -d . sun\hotspot\**.java
  3. jar cf wb.jar .

Place you wb.jar next to your application and launch it with:

java -Xbootclasspath/a:wb.jar -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI ...

Here is an examble you can run with WhiteBox jar:

For this example you need to add -XX:MaxTenuringThreshold=1 to make it work as expected.
Now you have an API to trigger minor GC and test if an object resides in Old generation, pretty awesome !

You can also trigger JIT compilation on demand for some methods and change VM flags on the fly:

Unlike unsafe, this API seems difficult to use in production environment, but at least you can have fun in labs or adding this like the OpenJDK for your low level tests. Enjoy!

Tuesday, May 26, 2015

Volatile and memory barriers

I have already blogged on the effect of a volatile variable on optimization performed by the JIT. But what is the really difference between a regular variable ? And what are the impacts in terms of performance ?


Volatile has a well defined semantic in Java Memory Model (JMM), but to summarize, it has the following:
  • Cannot be reordered
  • Ensure data visibility to other threads


Doug Lea refers thread visibility to flushing caches, however, as pointed out by Martin Thompson in his post, volatile access does not flush the cache for visibility (as in writing data to memory to make it visible for all cores).
ccNUMA architecture means that all data in cache subsystem are in fact coherent with main memory. So the semantic of volatile apply through the Load/Store buffers (or Memory Ordering Buffers) placed between registers and L1 cache.

Depending on CPU architecture/instructions set, generated instructions to ensure those properties can vary. Let's focus on x86 which is the most wide spread.


For reordering there is 2 kinds:
  • Compiler
  • Hardware/CPU
Compiler is able to reorder instructions during instruction scheduling to match cost of loading or storing data with the CPU specifications.
It could be interesting to trigger 2 loads without dependencies between each other in order to optimize the time spent by the CPU to wait for those data and perform other operations in the mean time.

In the following example I have put 6 non-volatile fields :

public class TestJIT
    private static int field1;
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private static int field6;
    private static void assign(int i)
        field1 = i << 1;
        field2 = i << 2;
        field3 = i << 3;
        field4 = i << 4;
        field5 = i << 5;
        field6 = i << 6;

    public static void main(String[] args) throws Exception
        for (int i = 0; i < 10000; i++)

Let's examine what is generated by the JIT for the method assign:

  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x02438800: push   ebp
  0x02438801: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::assign@-1 (line 26)
  0x02438807: mov    ebx,ecx
  0x02438809: shl    ebx,1
  0x0243880b: mov    edx,ecx
  0x0243880d: shl    edx,0x2
  0x02438810: mov    eax,ecx
  0x02438812: shl    eax,0x3
  0x02438815: mov    esi,ecx
  0x02438817: shl    esi,0x4
  0x0243881a: mov    ebp,ecx
  0x0243881c: shl    ebp,0x5
  0x0243881f: shl    ecx,0x6
  0x02438822: mov    edi,0x160
  0x02438827: mov    DWORD PTR [edi+0x565c3b0],ebp
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@27 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x0243882d: mov    ebp,0x164
  0x02438832: mov    DWORD PTR [ebp+0x565c3b0],ecx
                                        ;*putstatic field6
                                        ; - com.bempel.sandbox.TestJIT::assign@34 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438838: mov    ebp,0x150
  0x0243883d: mov    DWORD PTR [ebp+0x565c3b0],ebx
                                        ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438843: mov    ecx,0x154
  0x02438848: mov    DWORD PTR [ecx+0x565c3b0],edx
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@9 (line 27)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x0243884e: mov    ebx,0x158
  0x02438853: mov    DWORD PTR [ebx+0x565c3b0],eax
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@15 (line 28)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438859: mov    ecx,0x15c
  0x0243885e: mov    DWORD PTR [ecx+0x565c3b0],esi
                                        ;*putstatic field4
                                        ; - com.bempel.sandbox.TestJIT::assign@21 (line 29)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438864: add    esp,0x8
  0x02438867: pop    ebp
  0x02438868: test   DWORD PTR ds:0x190000,eax
                                        ;   {poll_return}
  0x0243886e: ret    

As you can see in the comments, the order of field assignation is the following: field5, field6, field1, field2, field3.
We now add volatile modifier for field1 & field6:
  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x024c8800: push   ebp
  0x024c8801: sub    esp,0x8
  0x024c8807: mov    ebp,ecx
  0x024c8809: shl    ebp,1
  0x024c880b: mov    edx,ecx
  0x024c880d: shl    edx,0x2
  0x024c8810: mov    esi,ecx
  0x024c8812: shl    esi,0x3
  0x024c8815: mov    eax,ecx
  0x024c8817: shl    eax,0x4
  0x024c881a: mov    ebx,ecx
  0x024c881c: shl    ebx,0x5
  0x024c881f: shl    ecx,0x6
  0x024c8822: mov    edi,0x150
  0x024c8827: mov    DWORD PTR [edi+0x562c3b0],ebp
                                        ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c882d: mov    ebp,0x160
  0x024c8832: mov    DWORD PTR [ebp+0x562c3b0],ebx
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@27 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8838: mov    ebx,0x154
  0x024c883d: mov    DWORD PTR [ebx+0x562c3b0],edx
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@9 (line 27)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8843: mov    ebp,0x158
  0x024c8848: mov    DWORD PTR [ebp+0x562c3b0],esi
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@15 (line 28)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c884e: mov    ebx,0x15c
  0x024c8853: mov    DWORD PTR [ebx+0x562c3b0],eax
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8859: mov    ebp,0x164
  0x024c885e: mov    DWORD PTR [ebp+0x562c3b0],ecx
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8864: lock add DWORD PTR [esp],0x0  ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
  0x024c8869: add    esp,0x8
  0x024c886c: pop    ebp
  0x024c886d: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x024c8873: ret    

Now, field1 is really the first field assigned and field 6 is last one. But in the middle we have the following order: field5, field2, field3, field4.

Beside that, CPU is able to reorder the instruction flow in certain circumstances to also optimize the efficiency of instruction execution. Those properties are well summarized in the Intel white paper Memory Ordering.

Write access

From previous example which is only volatile writes you can notice there is a special instruction: lock add. This is a bit weird at first, we add 0 to the memory pointed by SP (Stack Pointer) register. So we are not changing the data at this memory location but with lock prefix, memory related instructions are processed specially to act as a memory barrier, similar to mfence instruction. Dave Dice explains in his blog that they benchmarked the 2 kinds of barrier, and the lock add seems the most efficient one on today's architecture.

So this barrier ensures that there is no reordering before and after this instruction and also drains all instructions pending into Store Buffer. After executing these instructions, all writes are visible to all other threads through cache subsystem or main memory. This costs some latency to wait for this drain.


Some time we can relax the constraint on immediate visibility but still keep the ordering guarantee. For this reason, Doug Lea introduces lazySet method for Atomic* objects. Let's use AtomicInteger in replacement of volatile int:

public class TestJIT
    private static AtomicInteger field1 = new AtomicInteger(0);
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private static AtomicInteger field6 = new AtomicInteger(0);
    public static void assign(int i)
        field1.lazySet(i << 1);
        field2 = i << 2;
        field3 = i << 3;        
        field4 = i << 4;
        field5 = i << 5;
        field6.lazySet(i << 6);
    public static void main(String[] args) throws Throwable
        for (int i = 0; i < 10000; i++)

We have the following output for PrintAssembly:

  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # this:     ecx       = 'com/bempel/sandbox/TestJIT'
  # parm0:    edx       = int
  #           [sp+0x10]  (sp of caller)
  0x024c7f40: cmp    eax,DWORD PTR [ecx+0x4]
  0x024c7f43: jne    0x024ace40         ;   {runtime_call}
  0x024c7f49: xchg   ax,ax
[Verified Entry Point]
  0x024c7f4c: mov    DWORD PTR [esp-0x3000],eax
  0x024c7f53: push   ebp
  0x024c7f54: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::assign@-1 (line 30)
  0x024c7f5a: mov    ebp,edx
  0x024c7f5c: shl    ebp,1              ;*ishl
                                        ; - com.bempel.sandbox.TestJIT::assign@5 (line 30)
  0x024c7f5e: mov    ebx,0x150
  0x024c7f63: mov    ecx,DWORD PTR [ebx+0x56ec4b8]
                                        ;*getstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@0 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f69: test   ecx,ecx
  0x024c7f6b: je     0x024c7fd0
  0x024c7f6d: mov    DWORD PTR [ecx+0x8],ebp  ;*invokevirtual putOrderedInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::lazySet@8 (line 80)
                                        ; - com.bempel.sandbox.TestJIT::assign@6 (line 30)
  0x024c7f70: mov    edi,edx
  0x024c7f72: shl    edi,0x2
  0x024c7f75: mov    ebp,edx
  0x024c7f77: shl    ebp,0x3
  0x024c7f7a: mov    esi,edx
  0x024c7f7c: shl    esi,0x4
  0x024c7f7f: mov    eax,edx
  0x024c7f81: shl    eax,0x5
  0x024c7f84: shl    edx,0x6            ;*ishl
                                        ; - com.bempel.sandbox.TestJIT::assign@39 (line 35)
  0x024c7f87: mov    ebx,0x154
  0x024c7f8c: mov    ebx,DWORD PTR [ebx+0x56ec4b8]
                                        ;*getstatic field6
                                        ; - com.bempel.sandbox.TestJIT::assign@33 (line 35)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f92: mov    ecx,0x158
  0x024c7f97: mov    DWORD PTR [ecx+0x56ec4b8],edi
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@12 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f9d: mov    ecx,0x164
  0x024c7fa2: mov    DWORD PTR [ecx+0x56ec4b8],eax
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@30 (line 34)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fa8: mov    edi,0x15c
  0x024c7fad: mov    DWORD PTR [edi+0x56ec4b8],ebp
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@18 (line 32)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fb3: mov    ecx,0x160
  0x024c7fb8: mov    DWORD PTR [ecx+0x56ec4b8],esi
                                        ;*putstatic field4
                                        ; - com.bempel.sandbox.TestJIT::assign@24 (line 33)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fbe: test   ebx,ebx
  0x024c7fc0: je     0x024c7fdd
  0x024c7fc2: mov    DWORD PTR [ebx+0x8],edx  ;*invokevirtual putOrderedInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::lazySet@8 (line 80)
                                        ; - com.bempel.sandbox.TestJIT::assign@40 (line 35)
  0x024c7fc5: add    esp,0x8
  0x024c7fc8: pop    ebp
  0x024c7fc9: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x024c7fcf: ret    

So now no trace of memory barriers whatsoever: (no mfence, no lock add instruction). But we have the order of field 1 and field 6 remain (first and last) then field2, field5, field3 and field4.
In fact, lazySet method call putOrderedInt from Unsafe object which do not emit memory barrier but guarantee no reordering.

Read Access

We will now examine what is the cost of a volatile read with this example:

public class TestJIT
    private static volatile int field1 = 42;
    public static void testField1(int i)
        if (field1 < 0)
            System.out.println("field value: " + field1);
    public static void main(String[] args) throws Throwable
        for (int i = 0; i < 10000; i++)

The PrintAssembly output looks like:

  # {method} 'testField1' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x024f8800: mov    DWORD PTR [esp-0x3000],eax
  0x024f8807: push   ebp
  0x024f8808: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::testField1@-1 (line 22)
  0x024f880e: mov    ebx,0x150
  0x024f8813: mov    ecx,DWORD PTR [ebx+0x571c418]
                                        ;*getstatic field1
                                        ; - com.bempel.sandbox.TestJIT::testField1@0 (line 22)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024f8819: test   ecx,ecx
  0x024f881b: jl     0x024f8828         ;*ifge
                                        ; - com.bempel.sandbox.TestJIT::testField1@3 (line 22)
  0x024f881d: add    esp,0x8
  0x024f8820: pop    ebp
  0x024f8821: test   DWORD PTR ds:0x190000,eax
                                        ;   {poll_return}
  0x024f8827: ret    

No memory barrier. Only a load from memory address to a register. This is what is required for a volatile read. JIT cannot optimize to cache it into a register. But with cache subsystem, latency for volatile read is very similar compared to regular variable read. If you test the same Java code without volatile modifier the result for this test is in fact the same.
Volatile reads put also some constraint on reordering.


Volatile access prevents instructions reordering either at compiler level or at hardware level.
It ensures visibility, for write with the price of memory barriers, for read with no register caching or reordering possibilities.


Saturday, May 23, 2015

Measuring contention on locks

Locks is one of the major bottleneck in scalability for your code. Lock-Free structures offer an alternative to your lock usage. However it is sometimes more difficult to use or to integrate into your existing code. Before rushing into this specific approach, it would be interesting to determine which part of your locking code would really benefit from this. Locks become a bottleneck if more than one thread tries to acquire them and needs to wait for a release. This is contention. Measuring this contention will help us to pinpoint which locks need to be improved.

You see in this Java world there are two kinds of locks. Those with synchronized blocks, and those which use java.util.concurrent.Lock. The first ones are directly handled by the JVM with the help of a specific byte code (monitorenter & monitorexit). With those, JVM provides, via JVMTI, some events that can be used by native agents to get information about synchronized blocks: MonitorContendedEnter & MonitorContentedEntered.
Profilers like YourKit exploit this information to provide contention profiling.

Azul Systems provides with their Zing JVM a tool named ZVision to also profile the synchronized blocks:


But what about j.u.c.Lock ?

Here, this is trickier: j.u.c.Lock is not handled by the JVM. It is part of the JDK classes like any regular library. No special treatment. Hence, no special information about them.
YourKit is not able to profile them. However, another profiler which is able to profile j.u.c.Lock exists: JProfiler.



I suppose it uses instrumentation to insert callbacks when j.u.c.Lock classes are used. Profile information seems precise with callstacks. It helps to identify which locks are effectively in contention.

I have also found an article describing jucProfiler, a tool created by an IBM team.

Before finding those 2 last tools, I made a very light and quick j.u.c.Lock profiler on my own. The technique is simple and I will describe it below. Full code available on GitHub.
The goal was to profile existing code using j.u.c.Locks. The plan wasn't to create any wrapper around them or any subtype, but to intercept the actual classes. I copied the JDK classes and kept them in the same package.
I have identified in the code where the lock fails to be acquired. It means in those cases there is a contention with another thread which already holds the current lock.
Here is an extract from ReentrantLock:

I incremented a contention field to keep track of the fail attempts. No need to have an atomic counter for this. If some contended locks are not precisely reported, it is not a big deal.

Each lock instance is identified at the construction by a unique ID and a callstack, stored into a global static map.

When you want to report the statistics about lock contention, you can traverse the map and print the information about each lock including the number of contentions detected.

To use those classes instead of the ones from the JDK you need to prepend your jar containing them into the bootclasspath. This way, it is your classes that are looked up before those contained into the rt.jar


Here is an example of the output of the profiler:

The line creating lock n indicates where the lock id n is created. The line with equals reports you for a locking id (left of the equal), the number of time that lock has contended (right of the equal).
Then it helps you to focus and work on the most contented ones first.

And remember: Measure, Don't Premature!

Thanks Georges & Aurélien for the review.