SyntaxHighlighter

Tuesday, December 17, 2013

ArrayList vs LinkedList

This post was originally posted on Java Advent Calendar

I must confess the title of this post is a little bit catchy. I have recently read this blog post and this is a good summary of  discussions & debates about this subject.
But this time I would like to try a different approach to compare those 2 well known data structures: using Hardware Performance Counters.

I will not perform a micro-benchmark, well not directly. I will not time using System.nanoTime(), but rather using HPCs like cache hits/misses.

No need to present those data structures, everybody knows what they are using for and how they are implemented. I am focusing my study on list iteration because, beside adding an element, this is the most common task for a list. And also because the memory access pattern for a list is a good example of CPU cache interaction.


Here my code for measuring list iteration for LinkedList & ArrayList:

import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

import ch.usi.overseer.OverHpc;

public class ListIteration
{
    private static List<String> arrayList = new ArrayList<>();
    private static List<String> linkedList = new LinkedList<>();

    public static void initializeList(List<String> list, int bufferSize)
    {
        for (int i = 0; i < 50000; i++)
        {
            byte[] buffer = null;
            if (bufferSize > 0)
            {
                buffer = new byte[bufferSize];
            }
            String s = String.valueOf(i);
            list.add(s);
            // avoid buffer to be optimized away
            if (System.currentTimeMillis() == 0)
            {
                System.out.println(buffer);
            }
        }
    }

    public static void bench(List<String> list)
    {
        if (list.contains("bar"))
        {
            System.out.println("bar found");
        }
    }

    public static void main(String[] args) throws Exception
    {
        if (args.length != 2) return;
        List<String> benchList = "array".equals(args[0]) ? arrayList : linkedList;
        int bufferSize = Integer.parseInt(args[1]);
        initializeList(benchList, bufferSize);
        HWCounters.init();
        System.out.println("init done");
        // warmup
        for (int i = 0; i < 10000; i++)
        {
            bench(benchList);
        }
        Thread.sleep(1000);
        System.out.println("warmup done");

        HWCounters.start();
        for (int i = 0; i < 1000; i++)
        {
            bench(benchList);
        }
        HWCounters.stop();
        HWCounters.printResults();
        HWCounters.shutdown();
    }
}

To measure, I am using a class called HWCounters based on overseer library to get Hardware Performance Counters. You can find the code of this class here.

The program take 2 parameters: the first one to choose between ArrayList implementation or LinkedList one, the second one to take a buffer size used in initializeList method. This method fills a list implementation with 50K strings. Each string is newly created just before add it to the list. We may also allocate a buffer based on our second parameter of the program. if 0, no buffer is allocated.
bench method performs a search of a constant string which is not contained into the list, so we fully traverse the list.
Finally, main method, perform initialization of the list, warmups the bench method and measure 1000 runs of this method. Then, we print results from HPCs.

Let's run our program with no buffer allocation on Linux with 2 Xeon X5680:

[root@archi-srv]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration array 0
init done
warmup done
Cycles: 428,711,720
Instructions: 776,215,597
L2 hits: 5,302,792
L2 misses: 23,702,079
LLC hits: 42,933,789
LLC misses: 73
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0

[root@archi-srv]# /java -cp .:overseer.jar com.ullink.perf.myths.ListIteration linked 0
init done
warmup done
Cycles: 767,019,336
Instructions: 874,081,196
L2 hits: 61,489,499
L2 misses: 2,499,227
LLC hits: 3,788,468
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0

First run is on the ArrayList implementation, second with LinkedList.

  • Number of cycles is the number of CPU cycle spent on executing our code. Clearly LinkedList spent much more cycles than ArrayList.
  • Instructions is little higher for LinkedList. But it is not so significant here.
  • For L2 cache accesses we have a clear difference: ArrayList has significant more L2 misses compared to LinkedList.  
  • Mechanically, LLC hits are very important for ArrayList.

The conclusion on this comparison is that most of the data accessed during list iteration is located into L2 for LinkedList but into L3 for ArrayList.
My explanation for this is that strings added to the list are created right before. For LinkedList it means that it is local the Node entry that is created when adding the element. We have more locality with the node.

But let's re-run the comparison with intermediary buffer allocated for each new String added.


[root@archi-srv]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration array 256
init done
warmup done
Cycles: 584,965,201
Instructions: 774,373,285
L2 hits: 952,193
L2 misses: 62,840,804
LLC hits: 63,126,049
LLC misses: 4,416
CPU migrations: 0
Local DRAM: 824
Remote DRAM: 0

[root@archi-srv]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration linked 256
init done
warmup done
Cycles: 5,289,317,879
Instructions: 874,350,022
L2 hits: 1,487,037
L2 misses: 75,500,984
LLC hits: 81,881,688
LLC misses: 5,826,435
CPU migrations: 0
Local DRAM: 1,645,436
Remote DRAM: 1,042

Here the results are quite different:

  • Cycles are 10 times more important.
  • Instructions stay the same as previously
  • For cache accesses, ArrayList have more L2 misses/LLC hits, than previous run, but still in the same magnitude order. LinkedList on the contrary have a lot more L2 misses/LLC hits, but moreover a significant number of LLC misses/DRAM accesses. And the difference is here.

With the intermediary buffer, we are pushing away entries and strings, which generate more cache misses and the end also DRAM accesses which is much more slower than hitting caches.
ArrayList is more predictable here since we keep locality of element from each other.

The memory access pattern here is crucial for list iteration performance. ArrayList is more stable than LinkedList in the way that whatever you are doing between each element adding, you are keeping your data  much more local than the LinkedList.
Remember also that, iterating through an array is much more efficient for CPU since it can trigger Hardware Prefetching because access pattern is very predictable.



Wednesday, October 30, 2013

Hardware Performance Counters: atomic vs standard incrementation

In a previous post on Hardware  Performance Counters, I have said that the main problem of the perf command is that only program from command line can be monitored like this.
It means when you want to monitor HPC for Java program you will also monitor the JVM and the startup phase, including warmup. Obviously it will affect negatively the performance compared to the steady state of the application when the code is JITed.

The perfect illustration of this flaw is presented in this blog post. In an attempt to understand the cost of a memory barrier, the author use the perf command in a micro-benchmark comparing simple incrementation of a regular variable to an incremention of an AtomicInteger. The author is fully aware of this methodlogy contains flaws:

We must be careful interpreting these results, because they are polluted with data not related to our program under test, like the JVM startup sequence.
To avoid this bias in the micro-benchmark we can use the HPC more precisely with the help of the overseer library presented in my previous post.
I have written a similar micro-benchmark with overseer and try to eliminate all the startup & warmup bias when measuring.

Here the code:

import ch.usi.overseer.OverHpc;
import java.util.concurrent.atomic.AtomicInteger;

public class Increment
{
    private static String[] EVENTS = {
        "UNHALTED_CORE_CYCLES",
        "INSTRUCTION_RETIRED",
        "L2_RQSTS:LD_HIT",
        "L2_RQSTS:LD_MISS",
        "LLC_REFERENCES",
        "MEM_LOAD_RETIRED:LLC_MISS",
        "PERF_COUNT_SW_CPU_MIGRATIONS",
        "MEM_UNCORE_RETIRED:LOCAL_DRAM_AND_REMOTE_CACHE_HIT",
        "MEM_UNCORE_RETIRED:REMOTE_DRAM"
    };

    private static String[] EVENTS_NAME = {
        "Cycles",
        "Instructions",
        "L2 hits",
        "L2 misses",
        "LLC hits",
        "LLC misses",
        "CPU migrations",
        "Local DRAM",
        "Remote DRAM"
    };

    private static long[] results = new long[EVENTS.length];
    private static OverHpc oHpc = OverHpc.getInstance();

    static int counter;
    static AtomicInteger atomicCounter = new AtomicInteger();

    static void stdIncrement()
    {
        counter++;
    }

    static void atomicIncrement()
    {
        atomicCounter.incrementAndGet();
    }

    static void benchStdIncrement(int loopCount)
    {
        for (int i = 0; i < loopCount; i++)
        {
            stdIncrement();
        }
    }

    static void benchAtomicIncrement(int loopCount)
    {
        for (int i = 0; i < loopCount; i++)
        }
    }

    static void benchAtomicIncrement(int loopCount)
    {
        for (int i = 0; i < loopCount; i++)
        {
            atomicIncrement();
        }
    }

    public static void main(String[] args) throws Exception
    {
        boolean std = args.length > 0 && args[0].equals("std");
        StringBuilder sb  = new StringBuilder();
        for (int i = 0; i < EVENTS.length; i++)
        {
            if (i > 0)
            {
                sb.append(",");
            }
            sb.append(EVENTS[i]);
        }
        oHpc.initEvents(sb.toString());
        // warmup
        if (std)
        {
            benchStdIncrement(10000);
            benchStdIncrement(10000);
        }
        else
        {
            benchAtomicIncrement(10000);
            benchAtomicIncrement(10000);
        }
        Thread.sleep(1000);
        System.out.println("warmup done");

        int tid = oHpc.getThreadId();
        oHpc.bindEventsToThread(tid);
        // bench
        if (std)
            benchStdIncrement(5*1000*1000);
        else
            benchAtomicIncrement(5*1000*1000);

        for (int i = 0; i < EVENTS.length; i++)
        {
            results[i] = oHpc.getEventFromThread(tid, i);
        }
        for (int i = 0; i < EVENTS.length; i++)
        {
            System.out.println(EVENTS_NAME[i] + ": " + String.format("%,d", results[i]));
        }
        OverHpc.shutdown();
   }
}

Depending on the argument I call 2 times the method performing the loop to make sure everything is compiled before starting the counters. With -XX:+PrintCompilation I verify this.

So now we can run it and compare the raw results for 3 runs each:

[root@archi-srv myths]# java -cp .:overseer.jar Increment atomic
warmup done
Cycles: 150,054,450
Instructions: 59,961,219
L2 hits: 245
L2 misses: 61
LLC hits: 164
LLC misses: 0
CPU migrations: 0
Local DRAM: 10
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment atomic
warmup done
Cycles: 149,907,417
Instructions: 59,968,058
L2 hits: 54
L2 misses: 38
LLC hits: 105
LLC misses: 9
CPU migrations: 0
Local DRAM: 11
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment atomic
warmup done
Cycles: 149,890,503
Instructions: 59,964,047
L2 hits: 372
L2 misses: 30
LLC hits: 85
LLC misses: 0
CPU migrations: 0
Local DRAM: 7
Remote DRAM: 0

[root@archi-srv myths]# java -cp .:overseer.jar Increment std
warmup done
Cycles: 1,217,065
Instructions: 3,155,346
L2 hits: 288
L2 misses: 518
LLC hits: 1,452
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment std
warmup done
Cycles: 1,367,222
Instructions: 3,155,345
L2 hits: 409
L2 misses: 387
LLC hits: 1,153
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment std
warmup done
Cycles: 1,366,928
Instructions: 3,155,345
L2 hits: 374
L2 misses: 423
LLC hits: 1,183
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0

As you can see, they are differences:
  • more cycles for atomic version
  • more instructions for atomic versions
  • a little bit more L2 misses for std versions
  • more LLC hits for std versions
  • Local DRAM accesses for atomic versions
We can also noticed than for atomic version we have roughly 3 cycles per instruction and for std version 3 instructions for 1 cycle...

Still, conclusion remains the same. we spend more instructions & more cycles with atomic version. We also have more accesses from DRAM which have significant more latency that caches.
But now ,we are sure that those measurements are more accurate than including the warmup phase.

Stay tuned for another adventure in HPC land...

Tuesday, September 3, 2013

Null check elimination

Among of all kinds of optimization performed by the JIT, one is the most cited as example: null check elimination. Maybe because dealing with null is such a common task in Java.

Regarding performance what is the cost of null checking anyway. Let's take a basic example here:

import java.util.ArrayList;
import java.util.List;

public class NullCheck
{
    private static void bench(List list)
    {
        list.contains("foo");
    }
    
    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
    }
}

Classic example where in bench method I call contains method on list instance passed as parameter. As usual we examine the PrintAssembly output of this method (64 bits version now since I have my hsdis-amd64.dll ;-)):

[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
  # {method} 'bench' '(Ljava/util/List;)V' in 'com/bempel/sandbox/NullCheck'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x40]  (sp of caller)
  0x00000000025302a0: mov    DWORD PTR [rsp-0x6000],eax
  0x00000000025302a7: push   rbp
  0x00000000025302a8: sub    rsp,0x30           ;*synchronization entry
                                                ; - com.bempel.sandbox.NullCheck::bench@-1 (line 10)
  0x00000000025302ac: mov    r8,rdx
  0x00000000025302af: mov    r10,QWORD PTR [rdx+0x8]  ; implicit exception: dispatches to 0x0000000002530621
  0x00000000025302b3: movabs r11,0x5777e60      ;   {oop('java/util/ArrayList')}
  0x00000000025302bd: cmp    r10,r11
  0x00000000025302c0: jne    0x00000000025305fc  ;*invokeinterface contains
                                                ; - com.bempel.sandbox.NullCheck::bench@3 (line 10)
[...]

Nothing fancy here, contains call is inlined, and before that a test against ArrayList class for the instance, to make sure we can apply safely the devirtualization of the call as explained here.

If we now change our example to add a null check on the list instance passed as parameter:

import java.util.ArrayList;
import java.util.List;

public class NullCheck
{
    private static void bench(List list)
    {
        if (list != null)
        {
            list.contains("foo");
        }
    }
    
    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
    }
}

The output is in fact exactly the same.

[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
  # {method} 'bench' '(Ljava/util/List;)V' in 'com/bempel/sandbox/NullCheck'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x40]  (sp of caller)
  0x00000000025402a0: mov    DWORD PTR [rsp-0x6000],eax
  0x00000000025402a7: push   rbp
  0x00000000025402a8: sub    rsp,0x30           ;*synchronization entry
                                                ; - com.bempel.sandbox.NullCheck::bench@-1 (line 10)
  0x00000000025402ac: mov    r8,rdx
  0x00000000025402af: mov    r10,QWORD PTR [rdx+0x8]  ; implicit exception: dispatches to 0x000000000254062d
  0x00000000025402b3: movabs r11,0x5787e60      ;   {oop('java/util/ArrayList')}
  0x00000000025402bd: cmp    r10,r11
  0x00000000025402c0: jne    0x00000000025405fc  ;*invokeinterface contains
                                                ; - com.bempel.sandbox.NullCheck::bench@7 (line 12)
[...]

However the semantic of our bench method is totally different: first version we can have a Null Pointer Exception thrown if we pass null to the bench method. Now it is impossible.
So how JIT can handle this since there is no special instruction to check this in generated code ?

Well, null is in fact a very special value (billion dollars mistake, blah blah blah...). When trying to access this value (I mean dereferencing it), whatever OS you are using you end up with an error (ACCESS_VIOLATION, SEGMENTATION FAULT, ...). Most of the time this error leads to an application crash.
However it is still possible to intercept this kind of error. On Unix SEGMENTATION FAULT is just a signal like SIQ_QUIT, SIG_BRK, etc. So you can provide an handler for this signal and execute some code when it is raised. On Windows there is a mechanism for exception handlers. See RtlAddFunctionTable.

So JVM uses those mechanism to handle nulls. If you look carefully at the generated code, you can see a comment on line

mov r10,QWORD PTR [rdx+0x8] ;implicit exception: dispatches to 0x000000000254062d

If we follow the address we get:

  0x000000000254062d: mov    edx,0xffffffad
  0x0000000002540632: mov    QWORD PTR [rsp],r8
  0x0000000002540636: nop
  0x0000000002540637: call   0x0000000002516f60  ; OopMap{[0]=Oop off=924}
                                                ;*ifnull
                                                ; - com.bempel.sandbox.NullCheck::bench@1 (line 10)
                                                ;   {runtime_call}
  0x000000000254063c: int3                      ;*ifnull
                                                ; - com.bempel.sandbox.NullCheck::bench@1 (line 10)

This code is added also into then generated code along with our bench method. we can see that in this case there is a null check (bytecode instruction ifnull).

Let's add a call to bench method with a null parameter to see what happens:

    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
        bench(null);
    }

Calling the test with -XX:+PrintCompilation we get the following result:

  1       java.util.ArrayList::indexOf (67 bytes)
  2       com.bempel.sandbox.NullCheck::bench (14 bytes)
  3       java.util.ArrayList::contains (14 bytes)
  2   made not entrant  com.bempel.sandbox.NullCheck::bench (14 bytes)

So bench method was the second method compiled here, but then was "made not entrant", it means deoptimized. The call with null parameter does not cause any NullPointerException or any application crash but a trigger to deoptimization of that method that was optimized too aggressively.
If we call again 10 000 times bench with our previous list instance, we get bench method compiled again:

    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
        bench(null);
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
    }

PrintCompilation output:

  1       java.util.ArrayList::indexOf (67 bytes)
  2       com.bempel.sandbox.NullCheck::bench (14 bytes)
  3       java.util.ArrayList::contains (14 bytes)
  2   made not entrant  com.bempel.sandbox.NullCheck::bench (14 bytes)
  4       com.bempel.sandbox.NullCheck::bench (14 bytes)

PrintAssembly output:

[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
  # {method} 'bench' '(Ljava/util/List;)V' in 'com/bempel/sandbox/NullCheck'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x40]  (sp of caller)
  0x0000000002500da0: mov    DWORD PTR [rsp-0x6000],eax
  0x0000000002500da7: push   rbp
  0x0000000002500da8: sub    rsp,0x30           ;*synchronization entry
                                                ; - com.bempel.sandbox.NullCheck::bench@-1 (line 10)
  0x0000000002500dac: mov    r8,rdx
  0x0000000002500daf: test   rdx,rdx
  0x0000000002500db2: je     0x0000000002501104  ;*ifnull
                                                ; - com.bempel.sandbox.NullCheck::bench@1 (line 10)
  0x0000000002500db8: mov    r10,QWORD PTR [rdx+0x8]
  0x0000000002500dbc: movabs r11,0x5747e60      ;   {oop('java/util/ArrayList')}
  0x0000000002500dc6: cmp    r10,r11
  0x0000000002500dc9: jne    0x0000000002501110  ;*invokeinterface contains
                                                ; - com.bempel.sandbox.NullCheck::bench@7 (line 12)

We have now an explicit null check:

  0x0000000002500daf: test   rdx,rdx
  0x0000000002500db2: je     0x0000000002501104  ;*ifnull

JIT falls back to a classic way to handle nulls, but if the method was never called with a null value, the optimization is good deal !

So not fear about null check in your code, it helps you protect against unexpected value and its costs is virtually zero !

Friday, August 2, 2013

Hardware performance counters

Introduction

Hardware Performance Counters (HPC) are counters directly provided by CPU thanks to a dedicated unit: Performance Monitoring Unit (PMU).
Depending on the CPU arch/vendor you can have different kind of counter available. This is highly hardware dependent, even across a same vendor, each CPU generation has its own implementation.

Those counters include cycles, number of instructions, cache hits/misses, branch mispredictions, etc.

This can be valuable if they are correctly used.

Perf

On linux there is a command line tool which provide access to those counters: perf
for example we can capture statsitcs about command execution like ls:

perf stat -d ls -lR

 Performance counter stats for 'ls -lR':

    1725.533251 task-clock:HG             #    0.119 CPUs utilized
          3,421 context-switches:HG       #    0.002 M/sec
             39 CPU-migrations:HG         #    0.023 K/sec
            590 page-faults:HG            #    0.342 K/sec
  5,696,827,092 cycles:HG                 #    3.301 GHz                     [39.30%]
  3,395,404,627 stalled-cycles-frontend:HG #   59.60% frontend cycles idle    [39.23%]
  1,837,081,737 stalled-cycles-backend:HG #   32.25% backend  cycles idle    [39.84%]
  4,539,875,605 instructions:HG           #    0.80  insns per cycle
                                          #    0.75  stalled cycles per insn [49.95%]
    942,557,723 branches:HG               #  546.241 M/sec                   [50.11%]
     16,070,006 branch-misses:HG          #    1.70% of all branches         [50.57%]
  1,271,339,179 L1-dcache-loads:HG        #  736.780 M/sec                   [50.94%]
    145,209,839 L1-dcache-load-misses:HG  #   11.42% of all L1-dcache hits   [50.73%]
     83,147,216 LLC-loads:HG              #   48.186 M/sec                   [40.04%]
        525,580 LLC-load-misses:HG        #    0.63% of all LL-cache hits    [39.54%]

    14.489543284 seconds time elapsed


Through this command you can also list more events available:

perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]

[...]

So you can for example specify one of those events during executing your command:

perf stat -e dTLB-load-misses ls -lR

 Performance counter stats for 'ls -lR':

         7,198,657 dTLB-misses


      13.225589146 seconds time elapsed


You can also specify specific and processor dependent counter from the Intel Software Developper's manual Volume 3B chapter 19.

But the major flaw of this approach is that you can only monitor program executed from command line and exiting in order to get results. It is limited to small program, namely micro-benchmark.

How to use hardware counters to monitor real life application ?

libpfm

Enter libpfm which provides an api to access hardware counters. This enables more flexibility to integrate those hardware counters into applications.

overseer

For Java application, I have found the overseer library which provides a Java API through JNI wrapper to libpfm.
With the help of this library it is now possible to integrate hardware counters directly inside any application.

Here an example how to use it:

import ch.usi.overseer.OverHpc;

public class HardwareCounters
{
    private static String[] EVENTS = {
        "UNHALTED_CORE_CYCLES",
        "INSTRUCTION_RETIRED",
        "L2_RQSTS:LD_HIT",
        "L2_RQSTS:LD_MISS",
        "LLC_REFERENCES",
        "MEM_LOAD_RETIRED:LLC_MISS",
        "PERF_COUNT_SW_CPU_MIGRATIONS",
        "MEM_UNCORE_RETIRED:LOCAL_DRAM_AND_REMOTE_CACHE_HIT",
        "MEM_UNCORE_RETIRED:REMOTE_DRAM"
    };
    
    private static String[] EVENTS_NAME = {
        "Cycles",
        "Instructions",
        "L2 hits",
        "L2 misses",
        "LLC hits",
        "LLC misses",
        "CPU migrations",
        "Local DRAM",
        "Remote DRAM"
    };
    
    private static long[] results = new long[EVENTS.length];

    public static void main(String[] args) throws Exception
    {
        OverHpc oHpc = OverHpc.getInstance();
        StringBuilder sb  = new StringBuilder();
        for (int i = 0; i < EVENTS.length; i++)
        {
            if (i > 0)
            {
                sb.append(",");
            }
            sb.append(EVENTS[i]);
        }
        oHpc.initEvents(sb.toString());        
        int tid = oHpc.getThreadId();
        oHpc.bindEventsToThread(tid);
        
        // process to monitor
        
        for (int i = 0; i < EVENTS.length; i++)
        {
            results[i] = oHpc.getEventFromThread(tid, i);
        }
        for (int i = 0; i < EVENTS.length; i++)
        {
            System.out.println(EVENTS_NAME[i] + ": " + String.format("%,d", results[i]));
        }
        
        OverHpc.shutdown();
    }
}

Conclusion

Hardware performance counters provided by CPU can be valuable to understand why you have some spikes into your application process. Thanks to overseer library you can inject those counters deep inside your application to fine-grained monitor your process.

In the next posts we will use this technique for various examples.

References

Overseer library: http://www.peternier.com/projects/overseer/overseer.php
libpfm: http://perfmon2.sourceforge.net/docs_v4.html
Intel Software Developper's manual Volume 3B: http://download.intel.com/products/processor/manual/253669.pdf

Tuesday, July 2, 2013

How to build hsdis-amd64.dll

If you read my post on How to print disassembly from JIT code ?, you will find that one file is missing to download: hsdis-am64.dll. It is required to disassemble JIT code with windows 64 bits version of JDK.
Until now, I have always shown you disassemble code from 32bits (i386) because I was not able to find a pre-built version or recompile it my self.

I have finally managed to build it. But according to this page, it seems there is incompatible licenses between binutils & OpenJDK.
I am not an expert in licensing, so in the doubt I will not publish my binary, but I will share you my step by step procedure to be able to reproduce it at home.

My different attempts with cygwin environments were unsuccessful (I am, by the way, not a big fan of cygwin, this may be explain that...). So I try a different approach with MinGW and MSYS.

Based on this tutorial, here what I did to build this famous dll:

  1. Download the MingGW-Get installer: mingw-get-inst-20120426.exe
  2. Run the installer. In the third wizard screen, choose the pre-packaged repository catalogues. In the fifth wizard screen, you can change the installation directory. The default C:\mingw is usually approriate.
  3. When asked what components to install, select the following:
    - MSYS Basic System
    - MSYS Developer Toolkit
  4. Do NOT install the MinGW Compiler Suite, make sure you deselect the C Compiler.
  5. Once the installation has finished, start the MinGW Shell. You can find it in Start -> Programs -> MSYS -> MinGW Shell. We will use the installer's command line interface to install additional packages.
    You can also start the shell from C:\MinGW\msys\1.0\msys.bat
  6. MinGW is a port of the GCC compiler to the win32 platform. Instead of the official release, we install a build from the mingw-w64 project:
    x86_64-w64-mingw32-gcc-4.7.0-release-win64_rubenvb.7z
  7. Unzip the packages into the C:\MinGW directory. You should and up with the new subdirectory:  C:\MinGW\mingw64
  8. mount 'C:\MinGW\mingw64\' /mingw64
    mkdir /c/mingw/opt
    mkdir /c/mingw/build64 /c/mingw/local64
    mount 'C:\MinGW\opt\' /opt
    mount 'C:\MinGW\local64\' /local64
    mount 'C:\MinGW\build64\' /build64
    mkdir /opt/bin /local64/{bin,etc,include,lib,share}
    mkdir /local64/lib/pkgconfig
  9. Create /local64/etc/profile.local:
    cat > /local64/etc/profile.local << "EOF"
    #
    # /local64/etc/profile.local
    #
    
    alias dir='ls -la --color=auto'
    alias ls='ls --color=auto'
    
    PKG_CONFIG_PATH="/local64/lib/pkgconfig"
    CPPFLAGS="-I/local64/include"
    CFLAGS="-I/local64/include -mms-bitfields -mthreads"
    CXXFLAGS="-I/local64/include -mms-bitfields -mthreads"
    LDFLAGS="-L/local64/lib"
    export PKG_CONFIG_PATH CPPFLAGS CFLAGS CXXFLAGS LDFLAGS
    
    PATH=".:/local64/bin:/mingw64/bin:/mingw/bin:/bin:/opt/bin"
    PS1='\[\033[32m\]\u@\h \[\033[33m\w\033[0m\]$ '
    export PATH PS1
    
    # package build directory
    LOCALBUILDDIR=/build64
    # package installation prefix
    LOCALDESTDIR=/local64
    export LOCALBUILDDIR LOCALDESTDIR
    
    EOF
  10. source /local64/etc/profile.local 
  11. I encountered an issue with ar tool, so you need to copy & rename the following file
    C:\MinGW\mingw64\bin\ar.exe
    to
    C:\MinGW\mingw64\bin\x86_64-w64-mingw32-ar.exe
  12. Download binutils sources (works great with 2.22) here
  13. Unzip the package and go into the directory & type:
    ./configure --host=x86_64-w64-mingw32
  14. Then 
    make
  15. Go to your OpenJDK sources in hsdis directory:
    hotspot/src/share/tools/hsdis
  16. edit you hsdis.c to add the line
    #include <stdint.h>
    which is missing for this version of the compiler.
  17. finally build hsdis with:
    make BINUTILS=/c/binutils-2.22 ARCH=amd64 
You should now get the hsdis-amd64.dll in hotspot/src/share/tools/hsdis/bin/win

Happy disassembling !

Sunday, June 16, 2013

Is it really how JVM optimizes the code?

This post is shorter than the previous ones. It is just to give you another example of the usage of my lovely JVM option PrintAssembly.

This example is inspired from Brian Goetz talk from Devoxx 2009: Towards A Universal VM.

Here is the code sample used in this presentation:

    public interface Holder {
        T get();
      }
    
    public class MyHolder implements Holder {
        private final T content;
        public MyHolder(T someContent) {
            content = someContent;
        }
        @Override public T get() {
            return content;
        }
    }
    
    public String getFrom(Holder h) {
        if(h == null)
            throw new IllegalArgumentException("h cannot be null");
        else
            return h.get();
    }
    
    public String doSomething() {
        MyHolder holder = new MyHolder("Hello World");
        return getFrom(holder);
    }

Mr Goetz explains in its talk several steps performed by the JIT in order to optimize the code. I let you see the talk for the details. At the end of the optimization process, we get the following equivalent in Java code:

    public String doSomething() {
        return "Hello World";
    }

Yeah, right! Really ?
Let's add another level and put that code in to our Test class:


public class TestJIT
{
    // previous code here

    public void bench()
    {
        if (doSomething().length() == 0)
        {
            throw new NullPointerException();
        }
    }

    public static void main(String[] args) throws Exception
    {
        TestJIT testJIT = new TestJIT();
        for (int i = 0; i < 10000; i++)
        {            
            testJIT.bench();
        }
        Thread.sleep(1000);
    }
}

With PrintAssembly option, here is the native code of bench method:

  # {method} 'bench' '()V' in 'com/bempel/sandbox/TestJIT'
  #           [sp+0x10]  (sp of caller)
  0x02487340: cmp    eax,DWORD PTR [ecx+0x4]
  0x02487343: jne    0x0246ce40         ;   {runtime_call}
  0x02487349: xchg   ax,ax
[Verified Entry Point]
  0x0248734c: mov    DWORD PTR [esp-0x3000],eax
  0x02487353: push   ebp
  0x02487354: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::doSomething@-1 (line 40)
                                        ; - com.bempel.sandbox.TestJIT::bench@1 (line 46)
  0x0248735a: mov    ebx,0x56ad530      ;   {oop("Hello World")}
  0x0248735f: mov    ecx,DWORD PTR [ebx+0x10]
  0x02487362: test   ecx,ecx
  0x02487364: je     0x02487371         ;*ifne
                                        ; - com.bempel.sandbox.TestJIT::bench@7 (line 46)
  0x02487366: add    esp,0x8
  0x02487369: pop    ebp
  0x0248736a: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x02487370: ret    

It seems that the doSomething method was inlined and replaced by the "Hello World" string instead. The java code equivalent is actually:
    
public void bench()
{
    if ("Hello World".length() == 0)
    {
        throw new NullPointerException();
    }
}
Well not so good, JIT could compute the constant length of Hello World and eliminate the whole code from the method. It happens if you change doSomething().length == 0 to doSomething() == null

Anyway point taken, JVM does the optimizations explained by Mr Goetz !

Wednesday, April 3, 2013

CompileThreshold is relative!

CompileThreshold

Basically, what I have always read and understood about JIT compilation triggering in HotSpot is the following:
  1. Execution starts in interpreted mode, collecting statistics per method
  2. When a number of method calls is reached (aka CompileThreshold) the JIT kicks in and compile the method.
  3. There is special cases for loops with On-Stack Replacement to allow partial compilation for those hot spots !
So the following code behaves as predicted:

public static void main(String[] args) throws Exception
{
    for (int i = 0; i < 10000; i++)
    {
        call();
    }
    Thread.sleep(1000);
}

private static void call()
{
    if (System.currentTimeMillis() == 0)
    {
        System.out.println("foo");
    }
}


Executing this class with java -server -XX:+PrintCompilation gives the following output:

---   n   java.lang.System::currentTimeMillis (static)
  1       com.bempel.sandbox.TestJIT::call (17 bytes)

So yes, after exactly 10 000 calls (server compiler C2's CompileThreshold=10000) to call method, it gets compiled.
You can also tweak this behavior by changing CompileThresold with the option -XX:CompileThreshold=n

Everything is fine, micro-benchmarks can be warmed up correctly and results are predictable, we are happy ! Until now...

Welcome to the real world !

Wake up Neo ! Real life applications do not execute like our benchmarks ! Throughputs are not constant, messages are not coming at sustainable high rates. There are some peaks and some pauses. Flows are chaotics.
Simulating this behavior is less easier than putting in a "monkey" program some instructions in a loop. Some flows are so slow that it will take a day to replay in "real time".
But some times it really matters !
Let's introduce some slowdown into our first example:

public static void main(String[] args) throws Exception
{
    for (int i = 0; i < 10000; i++)
    {
        call();
        Thread.sleep(1);
    }
    Thread.sleep(1000);
}

private static void call()
{
    if (System.currentTimeMillis() == 0)
    {
        System.out.println("foo");
    }
}


We only pause 1ms for each loop execution. Running it shows that the call method is no longer compiled !
---   n   java.lang.System::currentTimeMillis (static)
---   n   java.lang.Thread::sleep (static)

So what happened ? Does the CompileThreshold is also time dependent ?
Let's add the following option to monitor VM operations during this tests: -XX:+PrintSafepointStatistics

---   n   java.lang.System::currentTimeMillis (static)
---   n   java.lang.Thread::sleep (static)
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
4.148: EnableBiasedLocking              [       8          0              0    ]      [     0     0     0     0     0    ]  0   
20.707: no vm operation                  [       6          0              0    ]      [     0     0     0     0     0    ]  0   

Polling page always armed
EnableBiasedLocking                1
    0 VM operations coalesced during safepoint
Maximum sync time      0 ms
Maximum vm operation time (except for Exit VM operation)      0 ms

As you can see, after 4 seconds there is a VM operation EnabledBiasedLocking which requires a safepoint to be executed. As mentioned in one of my previous post about locks, this operation is performed for biased locking optimization. If we deactivate this kind of optimization (-XX:-UseBisaedLocking), we have now our call method compiled:
---   n   java.lang.System::currentTimeMillis (static)
  1       com.bempel.sandbox.TestJIT::call (17 bytes)
---   n   java.lang.Thread::sleep (static)
         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
21.780: no vm operation                  [       6          0              0    ]      [     0     0     0     0     0    ]  0   

Polling page always armed
    0 VM operations coalesced during safepoint
Maximum sync time      0 ms
Maximum vm operation time (except for Exit VM operation)      0 ms

Safepoints

So CompileThreshold is in fact sensible to safepoints. There is another kind of safepoints that you encouter everyday: GC. Let's introduce some GCs into our loop:

public static void main(String[] args) throws Exception
{
    for (int i = 0; i < 10000; i++)
    {
        call();
        if (i % 100 == 0)
        {
            System.gc();
        }
        Thread.sleep(10);
    }
    Thread.sleep(1000);
}

Again, the method call is not getting compiled here. In fact no method are compiled !

During each safepoint, a special process is also invoked: CounterDecay. It divides by 2 the number of method invocations for a small percentage (around 17%) of the classes.
This mechanism is here to balance "Hot Spots" not only based on number of invocations but alos based on time (indirectly by safepoints/GC).

Hopefully this mechanism can be disabled with -XX:-UseCounterDecay
Now with the same code we have all our methods compiled:
---   n   java.lang.System::currentTimeMillis (static)
  1       com.bempel.sandbox.TestJIT::call (17 bytes)
---   n   java.lang.Thread::sleep (static)



Monday, March 25, 2013

Lock Lock Lock: Enter!

Java locks

There are 2 majors locks in java:
  • synchronized keyword 
  • java.util.concurrent.Lock (with ReentrantLock implementation).

In this post we will dig into internals of those locks.

Synchronized keyword

The main advantage of this lock mechanism is the language integration. With this, JVM can optimize freely without impacting existing code.

Also every object carries structure allowing to perform locking anywhere. However this comes at a cost of larger objects.

Each object has at least 2 words (2x4 Bytes for 32 bits & 2x8 Bytes for 64 bits, excluding CompressedOops).
First word is known as Mark Word. This is the header of the object and it contains different information, including locking ones.
Second word is a pointer to metadata class defining the type of this object. This includes also the VMT (Virtual Method Table, cf Virtual Call 911).

The Mark Word can be summarized by this table (1):

The Mark Word encodes different informations depending on the state of the object indicated by the 2 lowest bits: the Tag.
If the object is not used as a lock, the Mark Word contains cached version of the hashcode and the age of the object (for GC/survivors).
Then, there is 3 other states for locking: light-weight, heavy-weight & biased.


Light-weight

If we try to acquire an uncontended lock, there is a light-weight mechanism involved: a CAS (Compare-And-Swap) operation is performed to displaced the Mark Word to the current stack. Since synchronized can only be used into one and same stack, it stores previous value of Mark Word into the stack and allows to store more information to handle contended & recursive cases.

What is generated by the JIT for the following code ?


{
    synchronized (obj)
    {
          i++;
    }
}



I have highlight in green above what is executed in uncontended case, in best case.
In  blue the code inside the synchronized block.

Basically, the critical part is the CAS operation: lock cmpxchg

In contended case, the lock is inflated to heavy-weight mechanism.

Heavy-weight

If the lock detects contention with another thread, the lock is inflated. It means that a special object "monitor" is allocated to store lock information with also Mark Word information since then, Mark Word contains address of this monitor.
Monitor object stores WaitSet for threads waiting to acquire the lock.
Lock mechanism is based on OS primitives like Mutex or Events. This implies a context switch for the Thread. This is why this heavy-weight should be avoid for performance.

Biased

If the lock is only aquired by one thread, JVM can optimize it by biasing the lock toward this thread. It means that we keep into the MarkWord the thread Id which acquires the lock and then subsequent lock acquisition is just a simple check that this the same thread id, and without any CAS instruction.


As you can see, the instructions highlighted in green are basic, no lock cmpxchg.
So it seems better, but in fact it comes with a high cost: If another thread try to acquire this lock, JVM need to revoke the bias. And bias revocation costs a safepoint. It means that all threads must be stopped in order to perform the revocation.
So in certain circumstances, it is better to deactivate this kind of optimization: -XX:-UseBiasedLocking

Please note also that BiasedLocking is enabled only 4 seconds after startup. You can tune it by -XX:BiasedLockingStartupDelay=4000

Other optimizations

In some cases, JVM can apply additional optimisation. Mostly for full synchronized object like StringBuffer or Hashtable.
For those objects, since we can call multiple times synchronized methods, we can avoid lots of locking/unlocking by grouping method calls under a same large lock.

{
    sb.append("foo");
    sb.append(v1);
    sb.append("bar");
}

Here the 3 calls are under a unique and large lock to avoid 3 locking/unlocking operations.
This is the lock coarsening.

Then if we apply Escape Analysis and we detect that StringBuffer instance does not escape from a local scope, we can  prove that those synchronized methods will never be contended.

{
    StringBuffer sb = new StringBuffer();
    sb.append("foo");
    sb.append(v1);
    sb.append("bar");
    return sb.toString();
}

In this case sb instance is local and there will be never contention in here. So we can suppress locks. This is Lock Elision.
However, those optimizations seems to be very specific for this kind of object, and are not very common in today's code.

ReentrantLock

Unlike synchronized, ReentrantLock is a regular class integrated into the JDK. But some primitives are provided by the JVM like the ability to perform CAS operation through Unsafe.compareAndSwapInt() method. This method is handled specifically by the JVM because it is declared as intrinsic. It means that JIT can generate special set of instructions for it instead of the regular call to JNI implementation.

Intrinsics

If we take a simple example with AtomicInteger class, compareAndSet method is implemented like this:

public final boolean compareAndSet(int expect, int update) {
      return unsafe.compareAndSwapInt(this, valueOffset, expect, update);
}

And compareAndSwapInt is decalred in Unsafe Class like this (extracted from OpenJDK sources):

public final native boolean compareAndSwapInt(Object o, long offset,
                                                  int expected,
                                                  int x);

So it is declared as a JNI implementation that I have also extracted from Open JDK sources:

UNSAFE_ENTRY(jboolean, Unsafe_CompareAndSwapInt(JNIEnv *env, jobject unsafe, jobject obj, jlong offset, jint e, jint x))
  UnsafeWrapper("Unsafe_CompareAndSwapInt");
  oop p = JNIHandles::resolve(obj);
  jint* addr = (jint *) index_oop_from_field_offset_long(p, offset);
  return (jint)(Atomic::cmpxchg(x, addr, e)) == e;
UNSAFE_END

Normally, a call to method like this, is a regular runtime call to the compiled code above. Let's verify on disassembly version of the follwoing Java code:
    static AtomicInteger  i = new AtomicInteger(42);
    static void call()
    {
         i.compareAndSet(42, 43);
    }

  0x02538c8e: mov    ebx,0x150
  0x02538c93: mov    ecx,DWORD PTR [ebx+0x569cef0]
                                        ;*getstatic i
                                        ; - com.bempel.sandbox.TestJIT::call@0 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02538c99: test   ecx,ecx
  0x02538c9b: je     0x02538cc5
  0x02538c9d: lea    ebx,[ecx+0x8]
  0x02538ca0: mov    eax,0x2a
  0x02538ca5: mov    ecx,0x2b
  0x02538caa: lock cmpxchg DWORD PTR [ebx],ecx
  0x02538cae: mov    ebx,0x0
  0x02538cb3: jne    0x02538cba
  0x02538cb5: mov    ebx,0x1            ;*invokevirtual compareAndSwapInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::compareAndSet@9 (line 107)
                                        ; - com.bempel.sandbox.TestJIT::call@7 (line 31)

As you can see, the code contains in fact very few instructions, and the main one: lock cmpxchg

Uncontended case

For the following code:

{
    lock.lock();
    try
    {
          i++;
    }
    finally
    {
          lock.unlock();
    }
}

We have the following output for disassembly:


Highlighted in green, the path executed when we acquired successfully the lock. Notice again the main instruction lock cmpxchg.

Conclusion

Except for biased version, other versions are similar. so here no clear winner for what is the fastest lock.
Biased locking seems very efficient here, but at a high cost of a safepoint if revoked.
ReentrantLock is more stable in terms of execution since there is no special optimization beside the intrinsic form of CompareAndSwap operation.

In any case you can notice that there is overhead by using lock, even in best case when there is no contention.


(1) From Presentation: The Hotspot Virtual Machine by Paul Hohensee

Monday, March 4, 2013

Safety First: Safepoints

You may have never heard about them since your are programming in Java, but it is very important feature of the JVM.
Your Java (or other JVM language) application encounter a safepoint every day. Let's talk about it.

What is a Safepoint ?

A safepoint is a state of your application execution where all references to objects are perfectly reachable by the VM.

Some operations of the VM require that all threads reach a safepoint to be performed. The most common operation that need it is the GC.

A safepoint means that all threads need to reach a certain point of execution before they are stopped. Then the VM operation is performed. After that all threads are resumed.


VM operations

When a GC occurs, a safepoint is requested. Then, when all threads are stopped, GC threads can performed their algorithm.
But GC is not the only operation that requires a safepoint. Here is a non exhaustive list of operations that your application can encounter during execution:

  • Deoptimization
  • PrintThreads
  • PrintJNI
  • FindDeadlock
  • ThreadDump
  • EnableBiasLocking
  • RevokeBias
  • HeapDumper
  • GetAllStackTrace
  • ...
You can find some VM Operations here: vm_operations.hpp But note that all VM operations do not require a safepoint.

What is the impact ?


If you run your application with following JVM options:
-XX:+PrintSafepointStatistics -XX:+PrintGCApplicationStoppedTime -XX:PrintSafepointStatisticsCount=1

You will find a output similar to:

vmop [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
0.255: ParallelGCFailedAllocation       [9    0    0]      [0     0     0     0     1]  0  
Total time for which application threads were stopped: 0.0018988 seconds

vmop [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count 
0.267: ParallelGCFailedAllocation       [9    0    1]      [0     0     0     0     1]  0  
Total time for which application threads were stopped: 0.0014882 seconds
 
vmop [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count 
0.278: FindDeadlocks                    [9    0    0]      [0     0     0     0     0]  0  
Total time for which application threads were stopped: 0.0001193 seconds


This output indicates that 3 safepoints were run: 2 for GC 1 for findDeadlock.
The 2 GCs (ParallelGCFailedAllocation stopped threads for 1 to 2 ms.
FindDeadlock stopped threads for 119 us.

If  findDeadlock is a lighter operation than a GC it is not negligible for some kind of sensitive applications. Some other VM operations like ThreadDump can take a while (up to several ms) depending on the number of threads involved.

So check on your application what are the effects of such operations.