SyntaxHighlighter

Tuesday, May 26, 2015

Volatile and memory barriers

I have already blogged on the effect of a volatile variable on optimization performed by the JIT. But what is the really difference between a regular variable ? And what are the impacts in terms of performance ?

Semantic


Volatile has a well defined semantic in Java Memory Model (JMM), but to summarize, it has the following:
  • Cannot be reordered
  • Ensure data visibility to other threads

Visibility


Doug Lea refers thread visibility to flushing caches, however, as pointed out by Martin Thompson in his post, volatile access does not flush the cache for visibility (as in writing data to memory to make it visible for all cores).
ccNUMA architecture means that all data in cache subsystem are in fact coherent with main memory. So the semantic of volatile apply through the Load/Store buffers (or Memory Ordering Buffers) placed between registers and L1 cache.

Depending on CPU architecture/instructions set, generated instructions to ensure those properties can vary. Let's focus on x86 which is the most wide spread.

Reordering


For reordering there is 2 kinds:
  • Compiler
  • Hardware/CPU
Compiler is able to reorder instructions during instruction scheduling to match cost of loading or storing data with the CPU specifications.
It could be interesting to trigger 2 loads without dependencies between each other in order to optimize the time spent by the CPU to wait for those data and perform other operations in the mean time.

In the following example I have put 6 non-volatile fields :

public class TestJIT
{
    private static int field1;
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private static int field6;
    
    private static void assign(int i)
    {
        field1 = i << 1;
        field2 = i << 2;
        field3 = i << 3;
        field4 = i << 4;
        field5 = i << 5;
        field6 = i << 6;
    }

    public static void main(String[] args) throws Exception
    {
        for (int i = 0; i < 10000; i++)
        {
            assign(i);
        }
        Thread.sleep(1000);
    }
}


Let's examine what is generated by the JIT for the method assign:

  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x02438800: push   ebp
  0x02438801: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::assign@-1 (line 26)
  0x02438807: mov    ebx,ecx
  0x02438809: shl    ebx,1
  0x0243880b: mov    edx,ecx
  0x0243880d: shl    edx,0x2
  0x02438810: mov    eax,ecx
  0x02438812: shl    eax,0x3
  0x02438815: mov    esi,ecx
  0x02438817: shl    esi,0x4
  0x0243881a: mov    ebp,ecx
  0x0243881c: shl    ebp,0x5
  0x0243881f: shl    ecx,0x6
  0x02438822: mov    edi,0x160
  0x02438827: mov    DWORD PTR [edi+0x565c3b0],ebp
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@27 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x0243882d: mov    ebp,0x164
  0x02438832: mov    DWORD PTR [ebp+0x565c3b0],ecx
                                        ;*putstatic field6
                                        ; - com.bempel.sandbox.TestJIT::assign@34 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438838: mov    ebp,0x150
  0x0243883d: mov    DWORD PTR [ebp+0x565c3b0],ebx
                                        ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438843: mov    ecx,0x154
  0x02438848: mov    DWORD PTR [ecx+0x565c3b0],edx
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@9 (line 27)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x0243884e: mov    ebx,0x158
  0x02438853: mov    DWORD PTR [ebx+0x565c3b0],eax
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@15 (line 28)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438859: mov    ecx,0x15c
  0x0243885e: mov    DWORD PTR [ecx+0x565c3b0],esi
                                        ;*putstatic field4
                                        ; - com.bempel.sandbox.TestJIT::assign@21 (line 29)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x02438864: add    esp,0x8
  0x02438867: pop    ebp
  0x02438868: test   DWORD PTR ds:0x190000,eax
                                        ;   {poll_return}
  0x0243886e: ret    


As you can see in the comments, the order of field assignation is the following: field5, field6, field1, field2, field3.
We now add volatile modifier for field1 & field6:
  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x024c8800: push   ebp
  0x024c8801: sub    esp,0x8
  0x024c8807: mov    ebp,ecx
  0x024c8809: shl    ebp,1
  0x024c880b: mov    edx,ecx
  0x024c880d: shl    edx,0x2
  0x024c8810: mov    esi,ecx
  0x024c8812: shl    esi,0x3
  0x024c8815: mov    eax,ecx
  0x024c8817: shl    eax,0x4
  0x024c881a: mov    ebx,ecx
  0x024c881c: shl    ebx,0x5
  0x024c881f: shl    ecx,0x6
  0x024c8822: mov    edi,0x150
  0x024c8827: mov    DWORD PTR [edi+0x562c3b0],ebp
                                        ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c882d: mov    ebp,0x160
  0x024c8832: mov    DWORD PTR [ebp+0x562c3b0],ebx
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@27 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8838: mov    ebx,0x154
  0x024c883d: mov    DWORD PTR [ebx+0x562c3b0],edx
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@9 (line 27)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8843: mov    ebp,0x158
  0x024c8848: mov    DWORD PTR [ebp+0x562c3b0],esi
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@15 (line 28)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c884e: mov    ebx,0x15c
  0x024c8853: mov    DWORD PTR [ebx+0x562c3b0],eax
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8859: mov    ebp,0x164
  0x024c885e: mov    DWORD PTR [ebp+0x562c3b0],ecx
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c8864: lock add DWORD PTR [esp],0x0  ;*putstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@3 (line 26)
  0x024c8869: add    esp,0x8
  0x024c886c: pop    ebp
  0x024c886d: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x024c8873: ret    

Now, field1 is really the first field assigned and field 6 is last one. But in the middle we have the following order: field5, field2, field3, field4.

Beside that, CPU is able to reorder the instruction flow in certain circumstances to also optimize the efficiency of instruction execution. Those properties are well summarized in the Intel white paper Memory Ordering.

Write access

From previous example which is only volatile writes you can notice there is a special instruction: lock add. This is a bit weird at first, we add 0 to the memory pointed by SP (Stack Pointer) register. So we are not changing the data at this memory location but with lock prefix, memory related instructions are processed specially to act as a memory barrier, similar to mfence instruction. Dave Dice explains in his blog that they benchmarked the 2 kinds of barrier, and the lock add seems the most efficient one on today's architecture.

So this barrier ensures that there is no reordering before and after this instruction and also drains all instructions pending into Store Buffer. After executing these instructions, all writes are visible to all other threads through cache subsystem or main memory. This costs some latency to wait for this drain.

LazySet

Some time we can relax the constraint on immediate visibility but still keep the ordering guarantee. For this reason, Doug Lea introduces lazySet method for Atomic* objects. Let's use AtomicInteger in replacement of volatile int:

public class TestJIT
{
    private static AtomicInteger field1 = new AtomicInteger(0);
    private static int field2;
    private static int field3;
    private static int field4;
    private static int field5;
    private static AtomicInteger field6 = new AtomicInteger(0);
    
    public static void assign(int i)
    {
        field1.lazySet(i << 1);
        field2 = i << 2;
        field3 = i << 3;        
        field4 = i << 4;
        field5 = i << 5;
        field6.lazySet(i << 6);
    }
    
    public static void main(String[] args) throws Throwable
    {
        for (int i = 0; i < 10000; i++)
        {
            assign(i);
        }
        Thread.sleep(1000);
    }
}

We have the following output for PrintAssembly:

  # {method} 'assign' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # this:     ecx       = 'com/bempel/sandbox/TestJIT'
  # parm0:    edx       = int
  #           [sp+0x10]  (sp of caller)
  0x024c7f40: cmp    eax,DWORD PTR [ecx+0x4]
  0x024c7f43: jne    0x024ace40         ;   {runtime_call}
  0x024c7f49: xchg   ax,ax
[Verified Entry Point]
  0x024c7f4c: mov    DWORD PTR [esp-0x3000],eax
  0x024c7f53: push   ebp
  0x024c7f54: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::assign@-1 (line 30)
  0x024c7f5a: mov    ebp,edx
  0x024c7f5c: shl    ebp,1              ;*ishl
                                        ; - com.bempel.sandbox.TestJIT::assign@5 (line 30)
  0x024c7f5e: mov    ebx,0x150
  0x024c7f63: mov    ecx,DWORD PTR [ebx+0x56ec4b8]
                                        ;*getstatic field1
                                        ; - com.bempel.sandbox.TestJIT::assign@0 (line 30)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f69: test   ecx,ecx
  0x024c7f6b: je     0x024c7fd0
  0x024c7f6d: mov    DWORD PTR [ecx+0x8],ebp  ;*invokevirtual putOrderedInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::lazySet@8 (line 80)
                                        ; - com.bempel.sandbox.TestJIT::assign@6 (line 30)
  0x024c7f70: mov    edi,edx
  0x024c7f72: shl    edi,0x2
  0x024c7f75: mov    ebp,edx
  0x024c7f77: shl    ebp,0x3
  0x024c7f7a: mov    esi,edx
  0x024c7f7c: shl    esi,0x4
  0x024c7f7f: mov    eax,edx
  0x024c7f81: shl    eax,0x5
  0x024c7f84: shl    edx,0x6            ;*ishl
                                        ; - com.bempel.sandbox.TestJIT::assign@39 (line 35)
  0x024c7f87: mov    ebx,0x154
  0x024c7f8c: mov    ebx,DWORD PTR [ebx+0x56ec4b8]
                                        ;*getstatic field6
                                        ; - com.bempel.sandbox.TestJIT::assign@33 (line 35)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f92: mov    ecx,0x158
  0x024c7f97: mov    DWORD PTR [ecx+0x56ec4b8],edi
                                        ;*putstatic field2
                                        ; - com.bempel.sandbox.TestJIT::assign@12 (line 31)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7f9d: mov    ecx,0x164
  0x024c7fa2: mov    DWORD PTR [ecx+0x56ec4b8],eax
                                        ;*putstatic field5
                                        ; - com.bempel.sandbox.TestJIT::assign@30 (line 34)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fa8: mov    edi,0x15c
  0x024c7fad: mov    DWORD PTR [edi+0x56ec4b8],ebp
                                        ;*putstatic field3
                                        ; - com.bempel.sandbox.TestJIT::assign@18 (line 32)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fb3: mov    ecx,0x160
  0x024c7fb8: mov    DWORD PTR [ecx+0x56ec4b8],esi
                                        ;*putstatic field4
                                        ; - com.bempel.sandbox.TestJIT::assign@24 (line 33)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024c7fbe: test   ebx,ebx
  0x024c7fc0: je     0x024c7fdd
  0x024c7fc2: mov    DWORD PTR [ebx+0x8],edx  ;*invokevirtual putOrderedInt
                                        ; - java.util.concurrent.atomic.AtomicInteger::lazySet@8 (line 80)
                                        ; - com.bempel.sandbox.TestJIT::assign@40 (line 35)
  0x024c7fc5: add    esp,0x8
  0x024c7fc8: pop    ebp
  0x024c7fc9: test   DWORD PTR ds:0x180000,eax
                                        ;   {poll_return}
  0x024c7fcf: ret    

So now no trace of memory barriers whatsoever: (no mfence, no lock add instruction). But we have the order of field 1 and field 6 remain (first and last) then field2, field5, field3 and field4.
In fact, lazySet method call putOrderedInt from Unsafe object which do not emit memory barrier but guarantee no reordering.

Read Access

We will now examine what is the cost of a volatile read with this example:

public class TestJIT
{
    private static volatile int field1 = 42;
    
    public static void testField1(int i)
    {
        if (field1 < 0)
        {
            System.out.println("field value: " + field1);
        }
    }
    
    public static void main(String[] args) throws Throwable
    {
        for (int i = 0; i < 10000; i++)
        {
            testField1(i);
        }
        Thread.sleep(1000);
    }
}

The PrintAssembly output looks like:

  # {method} 'testField1' '(I)V' in 'com/bempel/sandbox/TestJIT'
  # parm0:    ecx       = int
  #           [sp+0x10]  (sp of caller)
  0x024f8800: mov    DWORD PTR [esp-0x3000],eax
  0x024f8807: push   ebp
  0x024f8808: sub    esp,0x8            ;*synchronization entry
                                        ; - com.bempel.sandbox.TestJIT::testField1@-1 (line 22)
  0x024f880e: mov    ebx,0x150
  0x024f8813: mov    ecx,DWORD PTR [ebx+0x571c418]
                                        ;*getstatic field1
                                        ; - com.bempel.sandbox.TestJIT::testField1@0 (line 22)
                                        ;   {oop('com/bempel/sandbox/TestJIT')}
  0x024f8819: test   ecx,ecx
  0x024f881b: jl     0x024f8828         ;*ifge
                                        ; - com.bempel.sandbox.TestJIT::testField1@3 (line 22)
  0x024f881d: add    esp,0x8
  0x024f8820: pop    ebp
  0x024f8821: test   DWORD PTR ds:0x190000,eax
                                        ;   {poll_return}
  0x024f8827: ret    

No memory barrier. Only a load from memory address to a register. This is what is required for a volatile read. JIT cannot optimize to cache it into a register. But with cache subsystem, latency for volatile read is very similar compared to regular variable read. If you test the same Java code without volatile modifier the result for this test is in fact the same.
Volatile reads put also some constraint on reordering.

Summary

Volatile access prevents instructions reordering either at compiler level or at hardware level.
It ensures visibility, for write with the price of memory barriers, for read with no register caching or reordering possibilities.

References


Saturday, May 23, 2015

Measuring contention on locks

Locks is one of the major bottleneck in scalability for your code. Lock-Free structures offer an alternative to your lock usage. However it is sometimes more difficult to use or to integrate into your existing code. Before rushing into this specific approach, it would be interesting to determine which part of your locking code would really benefit from this. Locks become a bottleneck if more than one thread tries to acquire them and needs to wait for a release. This is contention. Measuring this contention will help us to pinpoint which locks need to be improved.

You see in this Java world there are two kinds of locks. Those with synchronized blocks, and those which use java.util.concurrent.Lock. The first ones are directly handled by the JVM with the help of a specific byte code (monitorenter & monitorexit). With those, JVM provides, via JVMTI, some events that can be used by native agents to get information about synchronized blocks: MonitorContendedEnter & MonitorContentedEntered.
Profilers like YourKit exploit this information to provide contention profiling.



Azul Systems provides with their Zing JVM a tool named ZVision to also profile the synchronized blocks:

AzulZvision_syncLocks.png

But what about j.u.c.Lock ?

Here, this is trickier: j.u.c.Lock is not handled by the JVM. It is part of the JDK classes like any regular library. No special treatment. Hence, no special information about them.
YourKit is not able to profile them. However, another profiler which is able to profile j.u.c.Lock exists: JProfiler.

JProfiler_SyncLocks.png

JProfiler_jucLocks.png

I suppose it uses instrumentation to insert callbacks when j.u.c.Lock classes are used. Profile information seems precise with callstacks. It helps to identify which locks are effectively in contention.

I have also found an article describing jucProfiler, a tool created by an IBM team.

Before finding those 2 last tools, I made a very light and quick j.u.c.Lock profiler on my own. The technique is simple and I will describe it below. Full code available on GitHub.
The goal was to profile existing code using j.u.c.Locks. The plan wasn't to create any wrapper around them or any subtype, but to intercept the actual classes. I copied the JDK classes and kept them in the same package.
I have identified in the code where the lock fails to be acquired. It means in those cases there is a contention with another thread which already holds the current lock.
Here is an extract from ReentrantLock:

I incremented a contention field to keep track of the fail attempts. No need to have an atomic counter for this. If some contended locks are not precisely reported, it is not a big deal.

Each lock instance is identified at the construction by a unique ID and a callstack, stored into a global static map.

When you want to report the statistics about lock contention, you can traverse the map and print the information about each lock including the number of contentions detected.

To use those classes instead of the ones from the JDK you need to prepend your jar containing them into the bootclasspath. This way, it is your classes that are looked up before those contained into the rt.jar

-Xbootclasspath/p:contention-profiling.jar  

Here is an example of the output of the profiler:

The line creating lock n indicates where the lock id n is created. The line with equals reports you for a locking id (left of the equal), the number of time that lock has contended (right of the equal).
Then it helps you to focus and work on the most contented ones first.

And remember: Measure, Don't Premature!

Thanks Georges & Aurélien for the review.

Friday, May 9, 2014

Branches: I have lost my path!

At Devoxx France 2014, I made a talk about Hardware Performance Counters, including examples I had already blogged about previously (first & second). But I have also included a third one showing branch mispredictions measurement and effects.

For this example, I was inspired by this question on stackoverflow. There are very good explanations of this phenomenon in answers, I encourage you to read them.

I am more interested in measuring this effect to be able to pinpoint this kind of issues in the future in my code. I have rewritten the code of the example as follow:

import java.util.Random;
import java.util.Arrays;

public class CondLoop
{
    final static int COUNT = 64*1024;
    static Random random = new Random(System.currentTimeMillis());

    private static int[] createData(int count, boolean warmup, boolean predict)
    {
        int[] data = new int[count];
        for (int i = 0; i < count; i++)
        {
            data[i] = warmup ? random.nextInt(2) 
                             : (predict ? 1 : random.nextInt(2));
        }
        return data;
    }
    
    private static int benchCondLoop(int[] data)
    {
        long ms = System.currentTimeMillis();
        HWCounters.start();
        int sum = 0;
        for (int i = 0; i < data.length; i++)
        {
            if (data[i] == 1)
         sum += i;
        }
        HWCounters.stop();
        return sum;
    }

    public static void main(String[] args) throws Exception
    {
        boolean predictable = Boolean.parseBoolean(args[0]);
        HWCounters.init();
        int count = 0;
        for (int i = 0; i < 10000; i++)
        {
            int[] data = createData(1024, true, predictable); 
            count += benchCondLoop(data);
        }
        System.out.println("warmup done");
        Thread.sleep(1000);
        int[] data = createData(512*1024, false, predictable); 
        count += benchCondLoop(data);
        HWCounters.printResults();
        System.out.println(count);
        HWCounters.shutdown();
    }
}


I have 2 modes: one is completely predictable with only 1s into the array, and the other is unpredictable with array filled with 0s and 1s randomly.
When I run my code with HPC including branch mispredictions counter on a 2 Xeon X5680 (Westmere) machine I get the following results:

[root@archi-srv condloop]# java -cp overseer.jar:. CondLoop true
warmup done
Cycles: 2,039,751
branch mispredicted: 20
-1676149632

[root@archi-srv condloop]# java -cp overseer.jar:. CondLoop false
warmup done
Cycles: 2,042,371
branch mispredicted: 20
-1558729579

We can see there is no difference between the 2 modes. In fact there is caveat in my example: It is too simple and the JIT compiler is able to perform an optimization I was not aware of at this time. To understand what's going on with my example, I made a tour with my old friend PrintAssembly as usual! (Note: I am using the intel syntax with the help of -XX:PrintAssemblyOptions=intel because well I am running on x86_64 CPU so let's use their syntax!)

  # {method} 'benchCondLoop' '([I)I' in 'CondLoop'
  [...]
  0x00007fe45105fcc9: cmp    ebp,ecx
  0x00007fe45105fccb: jae    0x00007fe45105fe27  ;*iaload
                                         ; - CondLoop::benchCondLoop@15 (line 28)
  0x00007fe45105fcd1: mov    r8d,DWORD PTR [rbx+rbp*4+0x10]
  0x00007fe45105fcd6: mov    edx,ebp
  0x00007fe45105fcd8: add    edx,r13d
  0x00007fe45105fcdb: cmp    r8d,0x1
  0x00007fe45105fcdf: cmovne edx,r13d
  0x00007fe45105fce3: inc    ebp                ;*iinc
                                         ; - CondLoop::benchCondLoop@24 (line 26)
  0x00007fe45105fce5: cmp    ebp,r10d
  0x00007fe45105fce8: jge    0x00007fe45105fcef  ;*if_icmpge
                                         ; - CondLoop::benchCondLoop@10 (line 26)
  0x00007fe45105fcea: mov    r13d,edx
  0x00007fe45105fced: jmp    0x00007fe45105fcc9
  [...]


The output shows a special instruction that I was not familiar with: cmovne. But it reminds me a thread in mechanical sympathy forum about this instruction (That's why it is important to read this forum!).
It seems this instruction is used specifically to avoid branch mispredictions.
Then, let's rewrite my condition with a more complex one:

    private static int benchCondLoop(int[] data)
    {
        long ms = System.currentTimeMillis();
        HWCounters.start();
        int sum = 0;
        for (int i = 0; i < data.length; i++)
        {
            if (i+ms > 0 && data[i] == 1)
         sum += i;
        }
        HWCounters.stop();
        return sum;
    }

Here are now the results:
[root@archi-srv condloop]# java -cp overseer.jar:. CondLoop true
warmup done
Cycles: 2,114,347
branch mispredicted: 21
-1677344554

[root@archi-srv condloop]# java -cp overseer.jar:. CondLoop false
warmup done
Cycles: 7,471,464
branch mispredicted: 261,988
-1541838686

See, number of cycles jump off the roof: more than 3x cycles ! Remember that a misprediction for CPU means a flush of the pipeline to decode instructions from the new address and it causes a Stop-Of-The-World during this time. Depending of the CPU it lasts 10 to 20 cycles.

In stackoverflow question, sorting the array improved a lot the test. Let's do the same:

 int[] data = createData(512*1024, false, predictable); 
 Arrays.sort(data);
 count += benchCondLoop(data);


[root@archi-srv condloop]# java -cp overseer.jar:. CondLoop false
warmup done
Cycles: 2,112,265
branch mispredicted: 34
-1659649448

This is indeed very efficient, we are now more predictable.

You can find the code of this example on my github

Tuesday, December 17, 2013

ArrayList vs LinkedList

This post was originally posted on Java Advent Calendar

I must confess the title of this post is a little bit catchy. I have recently read this blog post and this is a good summary of  discussions & debates about this subject.
But this time I would like to try a different approach to compare those 2 well known data structures: using Hardware Performance Counters.

I will not perform a micro-benchmark, well not directly. I will not time using System.nanoTime(), but rather using HPCs like cache hits/misses.

No need to present those data structures, everybody knows what they are using for and how they are implemented. I am focusing my study on list iteration because, beside adding an element, this is the most common task for a list. And also because the memory access pattern for a list is a good example of CPU cache interaction.


Here my code for measuring list iteration for LinkedList & ArrayList:

import java.util.ArrayList;
import java.util.LinkedList;
import java.util.List;

import ch.usi.overseer.OverHpc;

public class ListIteration
{
    private static List<String> arrayList = new ArrayList<>();
    private static List<String> linkedList = new LinkedList<>();

    public static void initializeList(List<String> list, int bufferSize)
    {
        for (int i = 0; i < 50000; i++)
        {
            byte[] buffer = null;
            if (bufferSize > 0)
            {
                buffer = new byte[bufferSize];
            }
            String s = String.valueOf(i);
            list.add(s);
            // avoid buffer to be optimized away
            if (System.currentTimeMillis() == 0)
            {
                System.out.println(buffer);
            }
        }
    }

    public static void bench(List<String> list)
    {
        if (list.contains("bar"))
        {
            System.out.println("bar found");
        }
    }

    public static void main(String[] args) throws Exception
    {
        if (args.length != 2) return;
        List<String> benchList = "array".equals(args[0]) ? arrayList : linkedList;
        int bufferSize = Integer.parseInt(args[1]);
        initializeList(benchList, bufferSize);
        HWCounters.init();
        System.out.println("init done");
        // warmup
        for (int i = 0; i < 10000; i++)
        {
            bench(benchList);
        }
        Thread.sleep(1000);
        System.out.println("warmup done");

        HWCounters.start();
        for (int i = 0; i < 1000; i++)
        {
            bench(benchList);
        }
        HWCounters.stop();
        HWCounters.printResults();
        HWCounters.shutdown();
    }
}

To measure, I am using a class called HWCounters based on overseer library to get Hardware Performance Counters. You can find the code of this class here.

The program take 2 parameters: the first one to choose between ArrayList implementation or LinkedList one, the second one to take a buffer size used in initializeList method. This method fills a list implementation with 50K strings. Each string is newly created just before add it to the list. We may also allocate a buffer based on our second parameter of the program. if 0, no buffer is allocated.
bench method performs a search of a constant string which is not contained into the list, so we fully traverse the list.
Finally, main method, perform initialization of the list, warmups the bench method and measure 1000 runs of this method. Then, we print results from HPCs.

Let's run our program with no buffer allocation on Linux with 2 Xeon X5680:

[root@archi-srv]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration array 0
init done
warmup done
Cycles: 428,711,720
Instructions: 776,215,597
L2 hits: 5,302,792
L2 misses: 23,702,079
LLC hits: 42,933,789
LLC misses: 73
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0

[root@archi-srv]# /java -cp .:overseer.jar com.ullink.perf.myths.ListIteration linked 0
init done
warmup done
Cycles: 767,019,336
Instructions: 874,081,196
L2 hits: 61,489,499
L2 misses: 2,499,227
LLC hits: 3,788,468
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0

First run is on the ArrayList implementation, second with LinkedList.

  • Number of cycles is the number of CPU cycle spent on executing our code. Clearly LinkedList spent much more cycles than ArrayList.
  • Instructions is little higher for LinkedList. But it is not so significant here.
  • For L2 cache accesses we have a clear difference: ArrayList has significant more L2 misses compared to LinkedList.  
  • Mechanically, LLC hits are very important for ArrayList.

The conclusion on this comparison is that most of the data accessed during list iteration is located into L2 for LinkedList but into L3 for ArrayList.
My explanation for this is that strings added to the list are created right before. For LinkedList it means that it is local the Node entry that is created when adding the element. We have more locality with the node.

But let's re-run the comparison with intermediary buffer allocated for each new String added.


[root@archi-srv]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration array 256
init done
warmup done
Cycles: 584,965,201
Instructions: 774,373,285
L2 hits: 952,193
L2 misses: 62,840,804
LLC hits: 63,126,049
LLC misses: 4,416
CPU migrations: 0
Local DRAM: 824
Remote DRAM: 0

[root@archi-srv]# java -cp .:overseer.jar com.ullink.perf.myths.ListIteration linked 256
init done
warmup done
Cycles: 5,289,317,879
Instructions: 874,350,022
L2 hits: 1,487,037
L2 misses: 75,500,984
LLC hits: 81,881,688
LLC misses: 5,826,435
CPU migrations: 0
Local DRAM: 1,645,436
Remote DRAM: 1,042

Here the results are quite different:

  • Cycles are 10 times more important.
  • Instructions stay the same as previously
  • For cache accesses, ArrayList have more L2 misses/LLC hits, than previous run, but still in the same magnitude order. LinkedList on the contrary have a lot more L2 misses/LLC hits, but moreover a significant number of LLC misses/DRAM accesses. And the difference is here.

With the intermediary buffer, we are pushing away entries and strings, which generate more cache misses and the end also DRAM accesses which is much more slower than hitting caches.
ArrayList is more predictable here since we keep locality of element from each other.

The memory access pattern here is crucial for list iteration performance. ArrayList is more stable than LinkedList in the way that whatever you are doing between each element adding, you are keeping your data  much more local than the LinkedList.
Remember also that, iterating through an array is much more efficient for CPU since it can trigger Hardware Prefetching because access pattern is very predictable.



Wednesday, October 30, 2013

Hardware Performance Counters: atomic vs standard incrementation

In a previous post on Hardware  Performance Counters, I have said that the main problem of the perf command is that only program from command line can be monitored like this.
It means when you want to monitor HPC for Java program you will also monitor the JVM and the startup phase, including warmup. Obviously it will affect negatively the performance compared to the steady state of the application when the code is JITed.

The perfect illustration of this flaw is presented in this blog post. In an attempt to understand the cost of a memory barrier, the author use the perf command in a micro-benchmark comparing simple incrementation of a regular variable to an incremention of an AtomicInteger. The author is fully aware of this methodlogy contains flaws:

We must be careful interpreting these results, because they are polluted with data not related to our program under test, like the JVM startup sequence.
To avoid this bias in the micro-benchmark we can use the HPC more precisely with the help of the overseer library presented in my previous post.
I have written a similar micro-benchmark with overseer and try to eliminate all the startup & warmup bias when measuring.

Here the code:

import ch.usi.overseer.OverHpc;
import java.util.concurrent.atomic.AtomicInteger;

public class Increment
{
    private static String[] EVENTS = {
        "UNHALTED_CORE_CYCLES",
        "INSTRUCTION_RETIRED",
        "L2_RQSTS:LD_HIT",
        "L2_RQSTS:LD_MISS",
        "LLC_REFERENCES",
        "MEM_LOAD_RETIRED:LLC_MISS",
        "PERF_COUNT_SW_CPU_MIGRATIONS",
        "MEM_UNCORE_RETIRED:LOCAL_DRAM_AND_REMOTE_CACHE_HIT",
        "MEM_UNCORE_RETIRED:REMOTE_DRAM"
    };

    private static String[] EVENTS_NAME = {
        "Cycles",
        "Instructions",
        "L2 hits",
        "L2 misses",
        "LLC hits",
        "LLC misses",
        "CPU migrations",
        "Local DRAM",
        "Remote DRAM"
    };

    private static long[] results = new long[EVENTS.length];
    private static OverHpc oHpc = OverHpc.getInstance();

    static int counter;
    static AtomicInteger atomicCounter = new AtomicInteger();

    static void stdIncrement()
    {
        counter++;
    }

    static void atomicIncrement()
    {
        atomicCounter.incrementAndGet();
    }

    static void benchStdIncrement(int loopCount)
    {
        for (int i = 0; i < loopCount; i++)
        {
            stdIncrement();
        }
    }

    static void benchAtomicIncrement(int loopCount)
    {
        for (int i = 0; i < loopCount; i++)
        }
    }

    static void benchAtomicIncrement(int loopCount)
    {
        for (int i = 0; i < loopCount; i++)
        {
            atomicIncrement();
        }
    }

    public static void main(String[] args) throws Exception
    {
        boolean std = args.length > 0 && args[0].equals("std");
        StringBuilder sb  = new StringBuilder();
        for (int i = 0; i < EVENTS.length; i++)
        {
            if (i > 0)
            {
                sb.append(",");
            }
            sb.append(EVENTS[i]);
        }
        oHpc.initEvents(sb.toString());
        // warmup
        if (std)
        {
            benchStdIncrement(10000);
            benchStdIncrement(10000);
        }
        else
        {
            benchAtomicIncrement(10000);
            benchAtomicIncrement(10000);
        }
        Thread.sleep(1000);
        System.out.println("warmup done");

        int tid = oHpc.getThreadId();
        oHpc.bindEventsToThread(tid);
        // bench
        if (std)
            benchStdIncrement(5*1000*1000);
        else
            benchAtomicIncrement(5*1000*1000);

        for (int i = 0; i < EVENTS.length; i++)
        {
            results[i] = oHpc.getEventFromThread(tid, i);
        }
        for (int i = 0; i < EVENTS.length; i++)
        {
            System.out.println(EVENTS_NAME[i] + ": " + String.format("%,d", results[i]));
        }
        OverHpc.shutdown();
   }
}

Depending on the argument I call 2 times the method performing the loop to make sure everything is compiled before starting the counters. With -XX:+PrintCompilation I verify this.

So now we can run it and compare the raw results for 3 runs each:

[root@archi-srv myths]# java -cp .:overseer.jar Increment atomic
warmup done
Cycles: 150,054,450
Instructions: 59,961,219
L2 hits: 245
L2 misses: 61
LLC hits: 164
LLC misses: 0
CPU migrations: 0
Local DRAM: 10
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment atomic
warmup done
Cycles: 149,907,417
Instructions: 59,968,058
L2 hits: 54
L2 misses: 38
LLC hits: 105
LLC misses: 9
CPU migrations: 0
Local DRAM: 11
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment atomic
warmup done
Cycles: 149,890,503
Instructions: 59,964,047
L2 hits: 372
L2 misses: 30
LLC hits: 85
LLC misses: 0
CPU migrations: 0
Local DRAM: 7
Remote DRAM: 0

[root@archi-srv myths]# java -cp .:overseer.jar Increment std
warmup done
Cycles: 1,217,065
Instructions: 3,155,346
L2 hits: 288
L2 misses: 518
LLC hits: 1,452
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment std
warmup done
Cycles: 1,367,222
Instructions: 3,155,345
L2 hits: 409
L2 misses: 387
LLC hits: 1,153
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0
[root@archi-srv myths]# java -cp .:overseer.jar Increment std
warmup done
Cycles: 1,366,928
Instructions: 3,155,345
L2 hits: 374
L2 misses: 423
LLC hits: 1,183
LLC misses: 0
CPU migrations: 0
Local DRAM: 0
Remote DRAM: 0

As you can see, they are differences:
  • more cycles for atomic version
  • more instructions for atomic versions
  • a little bit more L2 misses for std versions
  • more LLC hits for std versions
  • Local DRAM accesses for atomic versions
We can also noticed than for atomic version we have roughly 3 cycles per instruction and for std version 3 instructions for 1 cycle...

Still, conclusion remains the same. we spend more instructions & more cycles with atomic version. We also have more accesses from DRAM which have significant more latency that caches.
But now ,we are sure that those measurements are more accurate than including the warmup phase.

Stay tuned for another adventure in HPC land...

Tuesday, September 3, 2013

Null check elimination

Among of all kinds of optimization performed by the JIT, one is the most cited as example: null check elimination. Maybe because dealing with null is such a common task in Java.

Regarding performance what is the cost of null checking anyway. Let's take a basic example here:

import java.util.ArrayList;
import java.util.List;

public class NullCheck
{
    private static void bench(List list)
    {
        list.contains("foo");
    }
    
    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
    }
}

Classic example where in bench method I call contains method on list instance passed as parameter. As usual we examine the PrintAssembly output of this method (64 bits version now since I have my hsdis-amd64.dll ;-)):

[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
  # {method} 'bench' '(Ljava/util/List;)V' in 'com/bempel/sandbox/NullCheck'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x40]  (sp of caller)
  0x00000000025302a0: mov    DWORD PTR [rsp-0x6000],eax
  0x00000000025302a7: push   rbp
  0x00000000025302a8: sub    rsp,0x30           ;*synchronization entry
                                                ; - com.bempel.sandbox.NullCheck::bench@-1 (line 10)
  0x00000000025302ac: mov    r8,rdx
  0x00000000025302af: mov    r10,QWORD PTR [rdx+0x8]  ; implicit exception: dispatches to 0x0000000002530621
  0x00000000025302b3: movabs r11,0x5777e60      ;   {oop('java/util/ArrayList')}
  0x00000000025302bd: cmp    r10,r11
  0x00000000025302c0: jne    0x00000000025305fc  ;*invokeinterface contains
                                                ; - com.bempel.sandbox.NullCheck::bench@3 (line 10)
[...]

Nothing fancy here, contains call is inlined, and before that a test against ArrayList class for the instance, to make sure we can apply safely the devirtualization of the call as explained here.

If we now change our example to add a null check on the list instance passed as parameter:

import java.util.ArrayList;
import java.util.List;

public class NullCheck
{
    private static void bench(List list)
    {
        if (list != null)
        {
            list.contains("foo");
        }
    }
    
    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
    }
}

The output is in fact exactly the same.

[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
  # {method} 'bench' '(Ljava/util/List;)V' in 'com/bempel/sandbox/NullCheck'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x40]  (sp of caller)
  0x00000000025402a0: mov    DWORD PTR [rsp-0x6000],eax
  0x00000000025402a7: push   rbp
  0x00000000025402a8: sub    rsp,0x30           ;*synchronization entry
                                                ; - com.bempel.sandbox.NullCheck::bench@-1 (line 10)
  0x00000000025402ac: mov    r8,rdx
  0x00000000025402af: mov    r10,QWORD PTR [rdx+0x8]  ; implicit exception: dispatches to 0x000000000254062d
  0x00000000025402b3: movabs r11,0x5787e60      ;   {oop('java/util/ArrayList')}
  0x00000000025402bd: cmp    r10,r11
  0x00000000025402c0: jne    0x00000000025405fc  ;*invokeinterface contains
                                                ; - com.bempel.sandbox.NullCheck::bench@7 (line 12)
[...]

However the semantic of our bench method is totally different: first version we can have a Null Pointer Exception thrown if we pass null to the bench method. Now it is impossible.
So how JIT can handle this since there is no special instruction to check this in generated code ?

Well, null is in fact a very special value (billion dollars mistake, blah blah blah...). When trying to access this value (I mean dereferencing it), whatever OS you are using you end up with an error (ACCESS_VIOLATION, SEGMENTATION FAULT, ...). Most of the time this error leads to an application crash.
However it is still possible to intercept this kind of error. On Unix SEGMENTATION FAULT is just a signal like SIQ_QUIT, SIG_BRK, etc. So you can provide an handler for this signal and execute some code when it is raised. On Windows there is a mechanism for exception handlers. See RtlAddFunctionTable.

So JVM uses those mechanism to handle nulls. If you look carefully at the generated code, you can see a comment on line

mov r10,QWORD PTR [rdx+0x8] ;implicit exception: dispatches to 0x000000000254062d

If we follow the address we get:

  0x000000000254062d: mov    edx,0xffffffad
  0x0000000002540632: mov    QWORD PTR [rsp],r8
  0x0000000002540636: nop
  0x0000000002540637: call   0x0000000002516f60  ; OopMap{[0]=Oop off=924}
                                                ;*ifnull
                                                ; - com.bempel.sandbox.NullCheck::bench@1 (line 10)
                                                ;   {runtime_call}
  0x000000000254063c: int3                      ;*ifnull
                                                ; - com.bempel.sandbox.NullCheck::bench@1 (line 10)

This code is added also into then generated code along with our bench method. we can see that in this case there is a null check (bytecode instruction ifnull).

Let's add a call to bench method with a null parameter to see what happens:

    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
        bench(null);
    }

Calling the test with -XX:+PrintCompilation we get the following result:

  1       java.util.ArrayList::indexOf (67 bytes)
  2       com.bempel.sandbox.NullCheck::bench (14 bytes)
  3       java.util.ArrayList::contains (14 bytes)
  2   made not entrant  com.bempel.sandbox.NullCheck::bench (14 bytes)

So bench method was the second method compiled here, but then was "made not entrant", it means deoptimized. The call with null parameter does not cause any NullPointerException or any application crash but a trigger to deoptimization of that method that was optimized too aggressively.
If we call again 10 000 times bench with our previous list instance, we get bench method compiled again:

    public static void main(String[] args) throws InterruptedException
    {
        List list = new ArrayList();
        list.add("bar");
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
        bench(null);
        for (int i = 0; i < 10000; i++)
        {
            bench(list);
        }
        Thread.sleep(1000);
    }

PrintCompilation output:

  1       java.util.ArrayList::indexOf (67 bytes)
  2       com.bempel.sandbox.NullCheck::bench (14 bytes)
  3       java.util.ArrayList::contains (14 bytes)
  2   made not entrant  com.bempel.sandbox.NullCheck::bench (14 bytes)
  4       com.bempel.sandbox.NullCheck::bench (14 bytes)

PrintAssembly output:

[Disassembling for mach='i386:x86-64']
[Entry Point]
[Verified Entry Point]
  # {method} 'bench' '(Ljava/util/List;)V' in 'com/bempel/sandbox/NullCheck'
  # parm0:    rdx:rdx   = 'java/util/List'
  #           [sp+0x40]  (sp of caller)
  0x0000000002500da0: mov    DWORD PTR [rsp-0x6000],eax
  0x0000000002500da7: push   rbp
  0x0000000002500da8: sub    rsp,0x30           ;*synchronization entry
                                                ; - com.bempel.sandbox.NullCheck::bench@-1 (line 10)
  0x0000000002500dac: mov    r8,rdx
  0x0000000002500daf: test   rdx,rdx
  0x0000000002500db2: je     0x0000000002501104  ;*ifnull
                                                ; - com.bempel.sandbox.NullCheck::bench@1 (line 10)
  0x0000000002500db8: mov    r10,QWORD PTR [rdx+0x8]
  0x0000000002500dbc: movabs r11,0x5747e60      ;   {oop('java/util/ArrayList')}
  0x0000000002500dc6: cmp    r10,r11
  0x0000000002500dc9: jne    0x0000000002501110  ;*invokeinterface contains
                                                ; - com.bempel.sandbox.NullCheck::bench@7 (line 12)

We have now an explicit null check:

  0x0000000002500daf: test   rdx,rdx
  0x0000000002500db2: je     0x0000000002501104  ;*ifnull

JIT falls back to a classic way to handle nulls, but if the method was never called with a null value, the optimization is good deal !

So not fear about null check in your code, it helps you protect against unexpected value and its costs is virtually zero !

Friday, August 2, 2013

Hardware performance counters

Introduction

Hardware Performance Counters (HPC) are counters directly provided by CPU thanks to a dedicated unit: Performance Monitoring Unit (PMU).
Depending on the CPU arch/vendor you can have different kind of counter available. This is highly hardware dependent, even across a same vendor, each CPU generation has its own implementation.

Those counters include cycles, number of instructions, cache hits/misses, branch mispredictions, etc.

This can be valuable if they are correctly used.

Perf

On linux there is a command line tool which provide access to those counters: perf
for example we can capture statsitcs about command execution like ls:

perf stat -d ls -lR

 Performance counter stats for 'ls -lR':

    1725.533251 task-clock:HG             #    0.119 CPUs utilized
          3,421 context-switches:HG       #    0.002 M/sec
             39 CPU-migrations:HG         #    0.023 K/sec
            590 page-faults:HG            #    0.342 K/sec
  5,696,827,092 cycles:HG                 #    3.301 GHz                     [39.30%]
  3,395,404,627 stalled-cycles-frontend:HG #   59.60% frontend cycles idle    [39.23%]
  1,837,081,737 stalled-cycles-backend:HG #   32.25% backend  cycles idle    [39.84%]
  4,539,875,605 instructions:HG           #    0.80  insns per cycle
                                          #    0.75  stalled cycles per insn [49.95%]
    942,557,723 branches:HG               #  546.241 M/sec                   [50.11%]
     16,070,006 branch-misses:HG          #    1.70% of all branches         [50.57%]
  1,271,339,179 L1-dcache-loads:HG        #  736.780 M/sec                   [50.94%]
    145,209,839 L1-dcache-load-misses:HG  #   11.42% of all L1-dcache hits   [50.73%]
     83,147,216 LLC-loads:HG              #   48.186 M/sec                   [40.04%]
        525,580 LLC-load-misses:HG        #    0.63% of all LL-cache hits    [39.54%]

    14.489543284 seconds time elapsed


Through this command you can also list more events available:

perf list

List of pre-defined events (to be used in -e):
  cpu-cycles OR cycles                               [Hardware event]
  instructions                                       [Hardware event]
  cache-references                                   [Hardware event]
  cache-misses                                       [Hardware event]
  branch-instructions OR branches                    [Hardware event]
  branch-misses                                      [Hardware event]
  bus-cycles                                         [Hardware event]
  stalled-cycles-frontend OR idle-cycles-frontend    [Hardware event]
  stalled-cycles-backend OR idle-cycles-backend      [Hardware event]

  cpu-clock                                          [Software event]
  task-clock                                         [Software event]
  page-faults OR faults                              [Software event]
  context-switches OR cs                             [Software event]
  cpu-migrations OR migrations                       [Software event]
  minor-faults                                       [Software event]
  major-faults                                       [Software event]
  alignment-faults                                   [Software event]
  emulation-faults                                   [Software event]

  L1-dcache-loads                                    [Hardware cache event]
  L1-dcache-load-misses                              [Hardware cache event]
  L1-dcache-stores                                   [Hardware cache event]
  L1-dcache-store-misses                             [Hardware cache event]
  L1-dcache-prefetches                               [Hardware cache event]
  L1-dcache-prefetch-misses                          [Hardware cache event]
  L1-icache-loads                                    [Hardware cache event]
  L1-icache-load-misses                              [Hardware cache event]
  L1-icache-prefetches                               [Hardware cache event]
  L1-icache-prefetch-misses                          [Hardware cache event]
  LLC-loads                                          [Hardware cache event]
  LLC-load-misses                                    [Hardware cache event]
  LLC-stores                                         [Hardware cache event]
  LLC-store-misses                                   [Hardware cache event]
  LLC-prefetches                                     [Hardware cache event]
  LLC-prefetch-misses                                [Hardware cache event]
  dTLB-loads                                         [Hardware cache event]
  dTLB-load-misses                                   [Hardware cache event]
  dTLB-stores                                        [Hardware cache event]
  dTLB-store-misses                                  [Hardware cache event]
  dTLB-prefetches                                    [Hardware cache event]
  dTLB-prefetch-misses                               [Hardware cache event]

[...]

So you can for example specify one of those events during executing your command:

perf stat -e dTLB-load-misses ls -lR

 Performance counter stats for 'ls -lR':

         7,198,657 dTLB-misses


      13.225589146 seconds time elapsed


You can also specify specific and processor dependent counter from the Intel Software Developper's manual Volume 3B chapter 19.

But the major flaw of this approach is that you can only monitor program executed from command line and exiting in order to get results. It is limited to small program, namely micro-benchmark.

How to use hardware counters to monitor real life application ?

libpfm

Enter libpfm which provides an api to access hardware counters. This enables more flexibility to integrate those hardware counters into applications.

overseer

For Java application, I have found the overseer library which provides a Java API through JNI wrapper to libpfm.
With the help of this library it is now possible to integrate hardware counters directly inside any application.

Here an example how to use it:

import ch.usi.overseer.OverHpc;

public class HardwareCounters
{
    private static String[] EVENTS = {
        "UNHALTED_CORE_CYCLES",
        "INSTRUCTION_RETIRED",
        "L2_RQSTS:LD_HIT",
        "L2_RQSTS:LD_MISS",
        "LLC_REFERENCES",
        "MEM_LOAD_RETIRED:LLC_MISS",
        "PERF_COUNT_SW_CPU_MIGRATIONS",
        "MEM_UNCORE_RETIRED:LOCAL_DRAM_AND_REMOTE_CACHE_HIT",
        "MEM_UNCORE_RETIRED:REMOTE_DRAM"
    };
    
    private static String[] EVENTS_NAME = {
        "Cycles",
        "Instructions",
        "L2 hits",
        "L2 misses",
        "LLC hits",
        "LLC misses",
        "CPU migrations",
        "Local DRAM",
        "Remote DRAM"
    };
    
    private static long[] results = new long[EVENTS.length];

    public static void main(String[] args) throws Exception
    {
        OverHpc oHpc = OverHpc.getInstance();
        StringBuilder sb  = new StringBuilder();
        for (int i = 0; i < EVENTS.length; i++)
        {
            if (i > 0)
            {
                sb.append(",");
            }
            sb.append(EVENTS[i]);
        }
        oHpc.initEvents(sb.toString());        
        int tid = oHpc.getThreadId();
        oHpc.bindEventsToThread(tid);
        
        // process to monitor
        
        for (int i = 0; i < EVENTS.length; i++)
        {
            results[i] = oHpc.getEventFromThread(tid, i);
        }
        for (int i = 0; i < EVENTS.length; i++)
        {
            System.out.println(EVENTS_NAME[i] + ": " + String.format("%,d", results[i]));
        }
        
        OverHpc.shutdown();
    }
}

Conclusion

Hardware performance counters provided by CPU can be valuable to understand why you have some spikes into your application process. Thanks to overseer library you can inject those counters deep inside your application to fine-grained monitor your process.

In the next posts we will use this technique for various examples.

References

Overseer library: http://www.peternier.com/projects/overseer/overseer.php
libpfm: http://perfmon2.sourceforge.net/docs_v4.html
Intel Software Developper's manual Volume 3B: http://download.intel.com/products/processor/manual/253669.pdf