SyntaxHighlighter

Wednesday, December 30, 2015

PrintAssembly output explained!

If you are a regular reader of my blog, you may have noticed that I am (ab)using (of) the PrintAssembly options from the JVM to examine the code generated by the JIT compiler. It helps me a lot to understand how my code is executed, and how the JIT compiler works and optimize the code.
Even if from time to time I use also JMH, I am not a big fan of benchmarking and especially micro-benchmarking.

Why? Because micro-benchmarking is an idealization of how the production code is executed: tight loops, all data into the L1 cache and hot, few branch misses, best case for aggressive JIT optimizations (like monomorphic calls, etc.)

The thing is, the execution context in production is totally different from micro-benchmarks, so what's the point in exercising a code that will not be executed in the same condition? Are the conclusions that I can draw from the micro-benchmark still valid or still beneficial for my production cases?
All of this, push me away from micro-benchmark as much as possible and try to find another way to evaluate the performance like performance counters inserted directly into the application or reading the assembly generated by the JIT compiler. Note also that it is not perfect either as nowadays CPU are out-of-order in execution and also perform Instruction Level Parallelism. So benchmarking in some situations are the only way to assess performance.

Printing assembly helps me also to back assertions about how JIT optimizes instead of relying on some folklore and urban legends (reordering of instructions, memory barriers, ...).

With all of that, PrintAssembly is one of my favorite tools. But I can understand the output of this may be difficult to read. Nowadays, not all developers are familiar with assembly, unfortunately, but with some basic knowledge and with the help of comments inserted, it can be less cryptic.

For those who have never used PrintAssembly please refer to my previous posts about it: How to print dissassembly from JIT code and How to build hsdis-amd64.dll. Chris Newland, creator of JITWatch tool, has also some useful tips for Mac OS X. Nitsan Wakart wrote an article on this.

Your setup is done? Perfect let's read some assembly, yeah!

Assembly 101

First of all, I am using intel syntax, not AT&T one. I am used to this syntax, and because we are talking about x86 instruction set made by Intel let's stick to their convention.
Reminder: To get this syntax with the disassembler plugin, use the JVM option:
-XX:PrintAssemblyOptions=intel

Instruction lines are decomposed as the following:

mnemonic parameters
  • mnemonic is the instruction name (mov, call, add, jmp, cmp, ...)
  • parameters can be register, memory accesses, immediate values
Examples:

mov rax, 0x2A
mov rdx, QWORD PTR [rbx+0x571c418]

mov instruction is a data movement. The first line move the constant value 0x2A into the register rax.
the second line, move the memory content at the address computed from the value of regiser rbx and the constant value 0x571c418 into the register rdx. Note that order is reversed for AT&T syntax.

push/pop instructions move data to/from the stack
add/sub/imul/idiv instructions perform addition/subtraction/multiplication/division on integers
inc/dec instructions increment/decrement value in registers or memory
and/or/xor/not/shl/shr instructions perform bitwise operations
jmp instruction performs a unconditional jump to the specified address
jxx instructions perform a conditional jump based on the result of the related last operation
cmp instruction performs a comparison between 2 operands
call/ret instruction perform call to /return from a subroutine

For more information see this guide for example or the official Intel documentation.

Disassembler comments

Hopefully, disassembler plugin does not spit raw instructions but annotate them with useful information.
Let's take an example with the method ArrayList.add(Object) and analyze it:

Header

In the header we can find the following information:
The first line is the name of the method disassembled: 'add' with its signature: one parameter of type Object returning a boolean from the class java.util.ArrayList
But as this is a instance method there is in fact 2 parameters as mentioned in the rest of the header:
Parameter this which is stored in register rdx, and the Object parameter in register r8.


Verified Entry Point

After the header the first instructions of the methods begins after the [Verified entry point]section. Assembly before this mark is here for alignment (padding). Starting from this section, we will look at comments that are after the semi-colon. Comments that are starting with the star (*)  indicates the associated byte code.

Synchronization Entry

The following comment: ; - java.util.ArrayList::add@-1 (line 458)gives us the mapping to the Java code: class name, method name and bytecode offset into the method, and finally the line number into the original Java source file. For this prologue, as we do not have a specific bytecode associated we've got the -1 offset. For the first one: ;*synchronization entry, it indicates the prologue of the function: some instructions that are necessary to prepare the execution (stack allocation or stack banging, saving some registers, ...)


Get size field

Next comment retrieves the field named from the current instance (ArrayList). It is translated to the following assembly line: mov r9d,DWORD PTR [rdx+0x10]
it moves into r9 register the content of the address rdx (this instance, cf method parameter) + 0x10 offset where the size field is located.

Get elementData field

The following comment is interesting because we have the same type of bytecode getfield but the mapping to the Java code involved 2 methods: java.util.ArrayList::ensureCapacityInternal@1 (line 223)and java.util.ArrayList::add@7 (line 458)Implicitly, it means that the JIT has inlined the first method mentionned and the byte code come from this method.

Empty array test

{oop(a 'java/lang/Object'[0])}indicates an instance (oop) with the following type 'java/lang/Object'[0]'. It means object array. This is in fact the constant instance empty array against which we are comparing inside the inlined method ensureCapacityInternal.

More inlining

Here we have an additional level of inlining for the ensureExplicitCapacity method.

Implicit null check

New kind of comment: Here we have an implicit null check because we are dereferencing the object array elementData to get the length of it. (Java code: elementData.length). If elementData is null, JVM must throw a NullPointerException in this case. But, too avoid generating code for each object dereferenced, JIT relies on OS signal handling for segfault to handle this rare case. See my article on this technique. 

Type Check

Let's skip some regular comments to stop on this one
We are verifying the current instance elementData class (metadata) is an object array ('java/lang/Object'[]). For performing this, we are getting from the instance the class pointer that we compare to the address of the class loaded by the JVM.

Card marking

Sometimes the comments are wrong: Here this is not a synchronization entry, but a special operation called 'card marking' that is performed after a write of a reference into a field or a reference array (elementData in our case). Card marking generated assembly is analyzed in this article. In this case we have card marking for element in an array, but for regular instance field, the generated assembly is different.

Safepoint poll

Finally, the comment {poll_return}indicates that the instruction performs a safepoint check. You will see this at the end of all methods. For more details about safepoints, please read my article and, a more detailed exploration of safepoints and impact here.

Voilà! You have the basics to understand the disassembly output from PrintAssembly options. I strongly recommend, again, if you want to go further to use the wonderful JITWatch tool.


References

From this blog:

Wednesday, October 14, 2015

assignment-with-use or inline assignment

I am still astonished by some persistent folklore and "rules of thumb" that find their origin in a so ancient time that most of people forget why they follow this rule.

Recently, I have heard one of those rules of thumb:
If you want to keep object references into the registers you need to perform inline assignment in a if statement for example:

which is claimed to be more efficient than:
To back this statement, the JDK source code is used as best practices to be followed. More precisely in LongAdder class:

Nice example of inline assignment written by Doug Lea (or at least approved by him). Why not doing like him, after all, he knows what is doing, right?

I have searched for explanation about this style, and found this thread where Rémi Forax found this code "not very java-ish". Aurélien Gonnay noticed also a difference of style between JDK code and JSR one on the method sumThenReset.

So let's simplify this code but keeping the inline assignement style. I get a similar example from the previous thread with Rémi Forax:

A more Java-ish code will be the following:
Firstly, we compare the generated byte code for inline assignments:
with no inline assignments:

Here a diff as summary:
3,4c3,4        
<  4: dup      
<  5: astore_1 
---            
>  4: astore_1 
>  5: aload_1  
8,9c8,9        
< 11: dup      
< 12: istore_2 
---            
> 11: istore_2 
> 12: iload_2  
18,19c18,19    
< 25: dup      
< 26: astore_3 
---            
> 25: astore_3 
> 26: aload_3  
Differences are only spotted at lines 4, 11 and 25. instead of dup + store for inline assignment we get store + load which semantically is the same. But is it more optimized with the first form ? Let's go check the assembly:
Compared to no inline assignments:

Here are the commands to diff without the addresses:
sed -e "s/0x[0-9a-f]*/0/g" inlineAssign1.txt > inlineAssign1_sed.txt

sed -e "s/0x[0-9a-f]*/0/g" inlineAssign2.txt > inlineAssign2_sed.txt

diff -w inlineAssign1_sed.txt inlineAssign2_sed.txt
1c1
<   # {method} {0} 'bench_c' '()V' in 'main/java/TestJIT'
---
>   # {method} {0} 'bench_java' '()V' in 'main/java/TestJIT'
7c7
<                                          ; - main.java.TestJIT::bench_c@-1 (line 16)
---
>                                          ; - main.java.TestJIT::bench_java@-1 (line 24)
10c10
<                                          ; - main.java.TestJIT::bench_c@1 (line 16)
---
>                                          ; - main.java.TestJIT::bench_java@1 (line 24)
13c13
<                                          ; - main.java.TestJIT::bench_c@10 (line 16)
---
>                                          ; - main.java.TestJIT::bench_java@10 (line 26)
17c17
<                                          ; - main.java.TestJIT::bench_c@13 (line 16)
---
>                                          ; - main.java.TestJIT::bench_java@13 (line 27)
23c23
<                                          ; - main.java.TestJIT::bench_c@23 (line 17)
---
>                                          ; - main.java.TestJIT::bench_java@23 (line 28)
29c29
<                                          ; - main.java.TestJIT::bench_c@24 (line 17)
---
>                                          ; - main.java.TestJIT::bench_java@24 (line 28)
33c33
<                                          ; - main.java.TestJIT::bench_c@27 (line 17)
---
>                                          ; - main.java.TestJIT::bench_java@27 (line 29)


You can check by yourself, line by line: each instruction is exactly the same on both version, only the addresses are different and the comment because, in my test, the method name is not the same.
Conclusion? Inline assignments have no value in term of performance, JIT is sufficiently smart to handle this.

Note: I have used JDK 1.8.0_40 for this article.

Thanks to Aurélien Gonnay for the review.

Monday, September 21, 2015

Yoda conditions

This blog post is not about performance, but ranting. Ranting about too many developers that apply blindly some "rules" without knowing why there are those practices in the first place.
Recently, I have read this article from Lukas Eder which in example number 4, recommends to inverse the natural way to write conditions by putting constant on left side of the operator instead of right side. The reason behind that is to avoid accidental assignment if you miss one equal sign.

This practice (often called Yoda Conditions) is quite old now. I encountered this the first time when doing C++. The language (in fact, this is C effectively) allows to do inline assignations inside a condition because conditions are in fact an expression evaluation (no boolean type in earlier version of C), and based on this evaluation if the result is 0 it means false, otherwise it is true. So any value (1, 2,3 -1, 0x012345f...) is considered as true.
Putting constant on the right side prevents (in case of a missing equal) the expression to compile so it helps to catch the mistake quickly.
If in C/C++ language this practice makes sense (discussing if it's is good or bad practice in C/C++ is out of the scope of this post) because there is rationale behind it.

In Java or C#, the story is totally different: Conditions are strongly typed to boolean. Assigning a reference to null or an integer variable inside a condition leads to compile error. Therefore, it does not make sense to follow this "rule" in those languages as the the mistake is inherently caught by the compiler thanks to the type system.



Takeaway from this: Do not follow blindly "rules" without knowing their origin and what are they trying to address as issues. Know also your language and its type system. Rules like Yoda Conditions fall by themselves when leveraging correctly your language type system.

Monday, September 14, 2015

Why Bios Settings matter (and not size)!

Recently, we have received new server machines based on Xeon E5v3 (Haswell). I have heard from different people that this CPU generation is very good, and figures are really impressive.
So I was pretty excited to test those new beasts on our standard benchmark to compare to the previous E5v2 (Ivy Bridge) we have.

Let's go for the first run:





Honestly, this is pretty baaaaad compared to what we've got with E5v2. This is so bad that I emailed to my Production System Manager to find out what's going wrong. Usually when we receive new machines we apply a procedure to configure them correctly for our requirements. As we work in low latency space we need to avoid many pit falls like power management, NUMA, isolcpus, in order to get the best performance.
When I checked at the OS level, I noticed, for example, cpuidle was active, which is not expected. With a well configured/tuned machine, cpuidle is not enabled. My suspicions went to a misconfigured BIOS. My PSM asked me to check with a command line the BIOS (which pretty handy, not need to reboot the machine).
The usual features that we change are the following:
  • C states: disabled
  • C1E state: disabled
  • Power management profile: Maximum performance
  • Collaborative CPU performance: disabled

Bingo, BIOS was not reconfigured following our standards! I applied them and re-run our benchmark:


Latencies divided by 2 (or more), that's really better! but still slower than E5v2. Let's recheck those BIOS settings one more time.

Why power management features are so bad for low latency ? The thing is for a component to be woken up, it takes time, can be hundreds of microseconds. For a process that we are measuring under 100 microseconds, it is huge.
For CPU there is C states that are different sleep modes. On Linux with C states enabled at BIOS level, you can see the latency associated:

cat /sys/devices/system/cpu/cpu0/cpuidle/state3/latency
200

It means here that to wake up a core from C3 to C0 (running) it takes 200 microseconds!

With new server generation comes new features and options, maybe there is one that make the difference.
I identified one that sounds pretty "bad" from low latency POV: "uncore frequency"=dynamic
Available options: dynamic/maximum.
Let's set it to maximum and run our benchmark:


Now we are talking! Results are better than E5v2 (roughly +30%), which is REALLY impressive!
We have tested on a E5 2697 v2 @ 2.7Ghz and on a E5 2697 v3 @ 2.6Ghz. There is only 100Mhz less on the v3 but still 30% better on our benchmark.

Finally, there is some fun features we can play in BIOS settings: we can enabled Turbo Boost and make the turbo frequency static by reducing the number of cores available into the CPU.
The E5v3 has 14 cores, let's cut this to only 2 cores and make the frequency permanently to 3.6GHz (yeah this is overclocking for servers!):


Compared to the default setup we divided our latency by 4! Just by well tuning the BIOS!
My PSM asked me to email those results to make sure everybody in his team is aware of the importance to apply BIOS settings correctly on production servers.






Friday, August 28, 2015

Ordered Scheduler

Background


The LMAX's Disruptor is a great piece of software engineering and may be used for many use cases. However as I showed in my previous article on the cost of signalling, multiplying this kind of structure can lead to scheduler contention because we increase the number of threads that may request to the OS scheduler a CPU time slice.

At Ullink we faced this issue for a very concrete case. This case requires keeping ordering while trying to parallelize tasks. In the meantime we would like to reduce the number of threads involved in our application.

Disruptor could resolve our case, but at the price of a consumer thread for each instance. As we may have hundreds of this pattern for a running application, we will increase significantly the number of threads and therefore the pressure on the scheduler.

Georges Gomes, CTO of Ullink,  find out in an intel's article [1] an alternative approach for solving our problem. This article talks about the following issue: How to parallelize some tasks and keeping ordering in the mean time ?

Here is an example: we have a video stream and we want to re-encode it. We read from the input stream each frame and we re-encode it before writing it into an output stream.
Clearly, frame encoding can be parallelized. Each frame is an independent source of data. The problem is we need to keep ordering of the frames otherwise the video stream would not have any sense!


OrderedScheduler1.png

To solve this problem, we need to use an order buffer:

OrderedScheduler1.png

This buffer keeps items until the proper order is respected then write into the output stream.
To ensure the proper ordering, each task will grab a ticket on which the algorithm will order the processing.

Each thread is responsible to execute the algorithm and make sure it is its turn to write into the output stream or not. If not, it leaves the item into the order buffer and this is the threads of the previous items (in order) to take care of item leftover. It means we do not have a dedicated consumer thread. It leads us to the following advantages:


  • No inter-thread communication overhead (signalling: wait/notify, await/signal)
  • No additional thread
  • No wait strategy (spining, back-off, ...)

We can then multiply this kind of structure without the fear to increase significantly the number of threads running in the application. Bonus: we reduce the need for synchronization that was previously required to ensure ordering.

Algorithm

When tasks are created/dispatched we need to keep track of the order in which they were by assigning a ticket to it. This ticket is just a sequence that we keep incrementing.
For the order buffer we will create an ring buffer style structure with tail counter.

OrderedScheduler1.png


When one of the task entered into the structure, we check if its ticket number matches with the tail counter. If it matches we process the task with the current thread entering and we increment tail.
OrderedScheduler2.png

Next, this is the thread containing the ticket number 3 which (by the game of scheduling) comes. Here the ticket number does not match with current tail (2).

OrderedScheduler3.png
We then put at ticket number index (modulo the length of the ring buffer) the task we would like to process into the ring buffer. And we let the thread free to go.

OrderedScheduler4.png
Finally, the thread with ticket number 2 comes. It matches the tail so we can process the task immediately and increment the tail.


OrderedScheduler6.png

We examine if there is still tasks at the next index from the tail. There is the task with ticket number 3 that was left by previous thread, so it is its turn so we reused the current thread to execute the task.

And we're done!

Ordered Scheduler

If you have a code like this:

You need to synchronized to ensure your code is keeping the ordering required when writing to the output.
With Ordered Scheduler you can change your code to:

Synchronized block is still required in this example to ensure the input is attached to right ticket. You could do differently by reading your input into a single thread, taking the ticket and then dispatch into a thread pool your process tasks. Your call to the ordered scheduler will then serialize your output.

What is important to keep in mind is that you cannot miss a ticket. This is why in this example is an exception is thrown during the process we call the trash method to inform the ordered scheduler that ticket is no longer valid, otherwise it would wait for ever that ticket to come into the scheduler.

The implementation is open sourced on GitHub.

References

[1] https://software.intel.com/en-us/articles/exploiting-data-parallelism-in-ordered-data-streams

Monday, July 13, 2015

Notify... oh, wait! I have a signal.

Introduction


When you want to pipeline, delegate some task asynchronously or simply synchronize 2 threads, you usually end up using wait/notify couple (or even await/signal, depending on your taste).

But what is the cost or the overhead for this kind of pattern ?

Under the hood

What happening when we are using wait/notify couple ?
I simplify here to this couple, as the other (await/signal) calls the same set of underlying methods:


LinuxWindows
Object.notifypthread_cond_signalSetEvent
Object.waitpthread_cond_timedwaitWaitForSingleObject

Basically we are performing system calls. For Object.wait we ask the OS scheduler to move the current thread to the wait queue


For Object.notify, we ask the scheduler (via futexes[1] on Linux) to move one of the waiting threads from the wait queue to the run queue to be scheduled when possible.

Just a quick remark about system calls: contrary to the common belief, system calls do not imply context switches[2]. It depends on the kernel implementation. On Linux there is no context switch unless the system call implementation requires it like for IO. In the case of pthread_cond_signal, there is no context switches involved.

Knowing that, what is the cost of calling notify for a producer thread ?

Measure, don't guess!

Why not building a micro-benchmark ? Because I do not care about average latency, I care about outliers, spikes. How it behaves for 50, 90, 95, 99, 99.9 % of the time.  What may be the maximum I can observe?
Let's measure it with HdrHistorgram from Gil Tene and the following code:



This code basically creates n pairs of threads: one (critical) which trying to notify the second (flushing) that data are available to be processed (or flushed).
I run this code with following parameters 16 1000. It means that we have 16 pairs of threads that doing wait/notify.

Results on Windows (ns):
count: 16000
min: 0
max: 55243
mean: 549.5238125
50%: 302
90%: 1208
95%: 1812
99%: 3019
99.9%: 11472




Results on Linux (ns):
count: 16000
min: 69
max: 20906
mean: 1535.5181875
50%: 1532
90%: 1790
95%: 1888
99%: 2056
99.9%: 3175



So most of the time we can observe couple of microseconds for a call to notify. But in some cases we can reach 50us! For Low Latency systems it can be an issue and a source of outliers.

Now, if we push a little our test program to use 256 pairs of threads we end up with the following results:

Results on Windows (ns):
count: 256000
min: 0
max: 1611088
mean: 442.25016015625
50%: 302
90%: 907
95%: 1208
99%: 1811
99.9%: 2717

Results on Linux (ns):
count: 256000
min: 68
max: 1590240
mean: 1883.61266015625
50%: 1645
90%: 2367
95%: 2714
99%: 7762
99.9%: 15230

A notify call can take 1.6ms!

Even though there is no contention in this code per se, there is another kind of contention that happens in the kernel. Scheduler needs to arbitrate which thread can be run. Having 256 threads that trying to wake up their partner thread put a lot of pressure on the scheduler which become a bottleneck here.

Conclusion

Signaling can be a source of outliers not because we have contention on executing code between threads but because the OS scheduler needs to arbitrate among those threads, responding to wake up requests.

References

[1] Futex are tricky U. Drepper: http://www.akkadia.org/drepper/futex.pdf
[2] http://en.wikipedia.org/wiki/System_call#Processor_mode_and_context_switching

Tuesday, July 7, 2015

WhiteBox API

I have already seen this in JCStress but this is in a post from Rémi Forax on mechanical sympathy forum that it brings attention to me when I saw what is possible to do with it. Here is a summary:




This API is usable since JDK8 and there is some new additions in JDK9.

But how to use it ?
This API is not part of the standard API but in the test library from OpenJDK. you can find it here.
Download the source of OpenJDK then either you build it entirely and grab the wb.jar or

  1. go to test/testlibrary/whitebox directory
  2. javac -sourcepath . -d . sun\hotspot\**.java
  3. jar cf wb.jar .

Place you wb.jar next to your application and launch it with:

java -Xbootclasspath/a:wb.jar -XX:+UnlockDiagnosticVMOptions -XX:+WhiteBoxAPI ...

Here is an examble you can run with WhiteBox jar:


For this example you need to add -XX:MaxTenuringThreshold=1 to make it work as expected.
Now you have an API to trigger minor GC and test if an object resides in Old generation, pretty awesome !

You can also trigger JIT compilation on demand for some methods and change VM flags on the fly:



Unlike unsafe, this API seems difficult to use in production environment, but at least you can have fun in labs or adding this like the OpenJDK for your low level tests. Enjoy!