Monday, 27 July 2015

JMH perfasm explained: Looking at False Sharing on Conditional Inlining

There is an edge that JMH (read the jmh resources page for other posts and related nuggets) has over other frameworks. That edge is so sharp you may well cut yourself using it, but given an infinite supply of bandages you should definitely use it :-) This edge is the ultimate profiler, the perfasm (pronounced PERF-AWESOME!, the exclamation mark is silent). I've been meaning to write about it for a while and as it just saved my ass recently...

SpscGrowableArrayQueue's False Sharing Issue

JCTools includes a specialized SPSC (single-producer/consumer) bounded queue aimed at actor systems, the SpscGrowableArrayQueue. This queue is quite neat because it is combining the compactness of a linked queue with the awesome throughput of an array backed queue. The idea is quite simple, and similar in spirit to an ArrayList:
  1. Start with a small buffer, default to 16 elements
  2. As long as the queue doesn't grow beyond that just stay small
  3. If on offer you find that the queue is full double the size of the underlying buffer
The mechanics of the queue resizing, handling intermediate queue states, and detecting a new buffer from the consumer are a bit involved and will perhaps be expanded on some other day.
Because the queue is geared at actor systems per queue footprint is important, so I reworked the memory layout of SpscArrayQueue and skimped on the padding where possible. If you have no idea what I mean by padding, and why it is done you can read the following posts:
  1. False sharing and the MESI protocol details related to cached counters: This explores the motivation of padding fields in the first place
  2. Discovering and controlling object layout(Using an early version of JOL): This show cases the JOL tool and how it should be used to discover object layout
  3. SPSC Revisited - part I: An empiricist tale: This post discusses the inlining of counters into a queue use inheritance to control field order and introduce padding. In this post I show how to pad the counter from each other and also how to pre/post pad the class and buffer.
  4. A more high level summary of false sharing is given in this post
With an overhead of roughly 128b per padding this was instrumental to reducing the high memory cost per queue. The hot counters were still padded from each other, but the class pre/post padding were removed as well as the padding of the elements array.
So, I gave up on some of the padding, but reasoned that in most cases this should not make a difference because the hottest fields were still padded from each other.

Why so slow?

Now, I expect a queue that is very similar to SpscArrayQueue, but adds some features to be slower. There's just very little you can do about this, doing more will cost you something. But given that resizing is a rather exceptional event for this queue I thought this will be a minor hit, maybe 10-20% reduction in performance (for a certain definition of performance). JCTools has some benchmarks included and I ran the QueueThroughputBackoffNone which will have a producer and consumer threads chewing on the queue as hard as they can.
Since the numbers are not very important here, I'll stick to running benchmarks on my laptop (an MBP 11,1/ Ubuntu 15.04/Zulu JDK8u45 - Zulu is a supported and tested OpenJDK build) in this post. Rest assured that I have confirmed on real hardware the same results. The results I quote below are the pollsMade figure which reflects the actual delivered throughput.
To my puzzlement I found:
SpscArrayQueue          361.223 ±  7.156  ops/us
SpscGrowableArrayQueue   64.277 ± 31.803  ops/us

Crap performance and terrible variance, bugger me sideways.
Looking at the code, I thought there must be something I was doing to upset the mighty JIT spirits. Maybe my methods were too big? My branches too unpredictable? My variable names too offensive? So I tweaked the code this way and that, looked at the inlining log (-XX:+PrintInlining) and the assembly (-XX:+PrintAssembly/-XX:CompileCommand=print,*) I got some minor improvements, but it mostly still sucked. What's wrong? A quick look at "-prof perfasm" and a fair amount of head scratching lead to the answer. The code (before and after) is on github, the main focus for this post is perfasm and its usage so I won't dive into it.

Before we start: How does perfasm work?

To use perfasm you'll need a Linux(or Windows) OS, running on real hardware, the relevant perf tool installed, and a JVM setup to print assembly.
Perf is (amongst other things) an instruction level profiler, of the kind that don't usually work with Java traditionally (though things are slowly changing). One of the features offered by perf is the "perf record/annotate" workflow. Perf interrupts your process repeatedly and records the current PC (program counter) at each sample. This sampling of the program counter is recorded over a period of time to be post processed by the annotate feature which correlates samples to methods, lines of code and assembly instructions. The challenge for perf when dealing with Java code is that the binary form of each method only exists for the duration of that process lifetime. This means the PC is mostly referring to methods that are nowhere to be found when the annotation stage comes along.
To summarize: perf record works, but perf annotate is broken for Java.
To make perfasm work JMH captures the JVM compiler outputs by enabling the following flags:
  -XX:+UnlockDiagnosticVMOptions
  -XX:+LogCompilation
  -XX:LogFile=...
  -XX:+PrintAssembly
  -XX:+PrintInterpreter
  -XX:+PrintNMethods
  -XX:+PrintNativeNMethods
  -XX:+PrintSignatureHandlers
  -XX:+PrintAdapterHandlers;
  -XX:+PrintStubCode
  -XX:+PrintCompilation
  -XX:+PrintInlining
  -XX:+TraceClassLoading
  -XX:PrintAssemblyOptions=syntax
The data collected here allows perfasm to do the annotation process by itself. I'll not bore you with the details of the output processing, but hat off to Mr. Shipilev who ploughed through the gory mess and made this work. The end result is a detailed output offering you the assembly of the hottest regions in your program, along with a list of the top hottest methods including both native and Java parts of the stack.
Because perf allows the recording of any number of events the definition of hottest depends on the events you choose to profile. The default events are cycles and instructions, but you can specify any number of events (E.g. -Djmh.perfasm.events=cycles,cache-misses). The first event specified will be the 'hotness' qualifier.

What do you get?

The perfasm output is split into 4 sections:
  1. Annotated assembly output for the top HOT regions(titled "Hottest code regions (>10.00% "cycles" events):")
  2. A list of the hot regions in your benchmark (titled "[Hottest Regions]"). This is the list of compiled methods inclusive of native methods, which is why we have the next section.
  3. A list of the hottest methods after inlining, so only java methods in this section (titled "[Hottest Methods (after inlining)]").
  4. A distribution of the cycles between types of regions (titled "[Distribution by Area]"). This will inform you of how the split goes between compiled code, kernel, JVM etc.
The most interesting of the 4 sections (and the only one I'm going to explain here) is the annotated hot regions section. I tend to edit the output of assembly spouting tools to make it more readable, but for the sake of explaining what's what here's the original output with footnotes:
To focus on the main features I trimmed out allot of the assembly code, which is where you see the "<MORE OF SAME/>" comments. Note the legend at the start of the output describing the 2 left most columns: cycles and instructions sample percentages.
Here's the footnotes explained:
  1. (Line 7) This column is the instruction address. This is important as this column is both how perfasm matches the perf record output with the PrintAssembly output and how poor sods like yours truely have to look for jump destinations in the code.
  2. (Line 7) This column has the actual instructions, in AT&T syntax (src -> dst).
  3. (Line 7) This column is the instruction annotation generated by PrintAssembly. It is not always correct, but it's helpful in connecting the dots. Multi-level inlining of methods can be seen in action everywhere.
  4. (Line 12) The condition/loop edge annotation is a wonderful recent addition to perfasm. It makes life marginally more bearable for the assembly consumer by connecting the jumps with their destinations. This is the start point annotation ""
  5. (Line 16) This is the end point annotation ""
When looking at this you will be searching for suspect hot spots in the code. Some instruction which is chewing away your cycles. If you're in luck you'll be able to tweak your code just right to side step it and then you are off and away :-)

Back on Track: Show me the cycles!

Right, so I point Captain Awesome at my benchmark and what do I see (trimmed addresses and edges as there's no important conditions here, also trimmed package names and moved code annotation left):
This instruction, eating 30% of my cycles (line 7, needs HEALING!), is part of a guard generated to ensure that inlining the poll method call is still a valid decision, that the instance observed at the callsite is of the class for which we inlined the poll method. This is not really a Java code line, this is the aftermath of inlining a virtual call (through an interface, see Aleksey's method dispatch inlining article to learn more on inlining).
To be fair, the above is kind of hard to understand. Why is a comparison between a constant and a register eating 30% of the cycles? according to every instruction 'cost' manual this should take 1 cycle (and the CPU can do 3-4 of them in parallel too), this is obviously not the instruction I'm looking for.
This is a phenomena known as 'skid' where the reported instruction at a given sample is inaccurate because modern CPUs are complicated. See the following dialogue on the linux-perf-user mailing list:
> I think Andi mentioned this to me last year -- that instruction profiling was no longer reliable. 
It never was. 
> Is this due to parallel and out-of-order execution? (ie, we're sampling the instruction pointer, but that's set to the resumption instruction, not the instructions being processed in the backend?). 
Most problems are due to 'skid': It takes some time to trigger the profiling interrupt after the event fired. [...] There are also other problems, for example an event may not be tied to an instruction. Some events have inherently large skid.
This sounds grim, but at the end of the day this is as accurate as profiling can get. It's not perfect, but it's still very valuable and you get used to it (there are ways to minimize skid discussed in the link, I've not tried those).
What you end up doing is looking for instructions just before the blamed instruction, or instructions on which the blamed instruction is dependant, which may be more reasonably blamed for the bottleneck. Given the CMP is not the problem, we must ask why would the CPU spend so much time at it? A CMP or a TEST will often get blamed for the price of the load into the registers they use, in this case the CMP is most probably being blamed for the load of the queue type from the object header one instruction back:
"0x8(%r12,%r11,8),%r8d  ; implicit exception: dispatches to 0x00007f83c93f929d"
It doesn't help that the comment talks about some implicit exception (this is an implicit null check), where it could say "getkid(q) + implicit_nullchk(q)" or something similar to indicate we are loading the kid (klass id) from the object header (see object header layout details here).
Now that I had pointed a finger at this operation, I was still quite confused. This was not an issue for any of the other queues, why would loading the kid be such a bottleneck in this case? Maybe I'm wrong about this (always a good assumption)? To prove this is the issue I created a duplicate of the original benchmark where I use the SpscGrowableArrayQueue directly instead of going via the interface, for comparison I also benchmarked the SpscArrayQueue in the same fashion:
SpscArrayQueue          369.544 ±  1.535  ops/us
SpscGrowableArrayQueue  272.021 ± 12.133  ops/us

Now that's more like it! the expected 20% difference is more like 30%, but this is much closer. This still begs the question, why is the type check for SpscGrowableArrayQueue so expensive? We can see that for SpscArrayQueue this makes very little difference, how is SpscGrowableArrayQueue different?

Messing with the KID

I had a long hard look at this issue, which didn't help, then slept on it, which did help, and realized the problem here is that the object header is false-sharing with the producer fields. When I trimmed down the padding on this class in an effort to minimize allocation, I removed the array post and pre-padding as well as the class pre/post padding and reasoned that for the most part I need not worry about the object's neighbours false sharing. What I failed to realize was that the consumer and producer threads might be frequently hitting the object header, in this benchmark on every call. Once I realized this was the issue I reinstated the pre-padding such that the producer index is far enough from the object header to stop interfering with it and the problem went away (see before and after, it's a messy diff as I fixed some other niggles while I was there, you can take the last version and play with adding/removing the padding and reordering the index and cold fields to verify my claims).
Here's the (slightly trimmed for readability) perfasm output for the same conditional inlining check in the padded version:

As an interesting side note, this issue was not visible in the hand rolled throughput benchmarks. This is because in those benchmarks the queue is hoisted into a variable before the consume/produce loop which means the conditional inlining check can be done out of loop as well. This is great for getting nice numbers but hides an issue which users are likely to hit. Credit goes to JMH for herding benchmarks down a path which forces these issues into the measured scope.

Summary

The main focus of this post is perfasm, I hope it helps get you started on using it. The broader context in which you would use this tool is a less explicit background. I use perfasm regularly, but I'm also reasonably happy to look at assembly and compiler annotation which I know most people are not. I find it to be invaluable in issues, like the one described in this post, where the cost is between the Java lines rather than the lines themselves. Any profiler can give you a broad feel of where performance bottlenecks are, and a non-safepoint-biased profiler can show you the hottest line of code. What a Java profiler will not tell you is about the generated JVM code between your lines. It will attribute the cost of those instructions to the nearest Java line, and that can become a very confusing chase.
It also worth pointing out that any nano-benchmark (measuring operations in the 0-500 nanoseconds range) practically requires you to look at the assembly for it's analysis. But 500ns can still be an aweful lot of code and an assembly level profiler is very handy. At this point it worth mentioning the Oracle Solaris Studio (first prize for least known profiler every year since release). It is a great assembly level profiler, and just generally a great profiler. If your measurement needs to take place outside the cosy comforts of a JMH benchmark I would recommend you give it a spin.
Finally, this investigation came in the context of a workflow that is followed in the development of JCTools. I would loosely describe it as follows:
  1. Implement new feature/data structure
  2. Reason about expected impact/performance as part of design and implementation
  3. Test expectations using the existing set of benchmarks
  4. Expectations are far off the mark (if not... not sure, will figure it out when it happens)
  5. Dig in until either expectations or code are cured.
This has a whiff of scientific exploration to it, but I assure you it is not done quite so seriously and I often fail to follow my own advice (or worse other people's advice). The habit of testing performance assumptions/expectation has offered me many an afternoon spent banging my head on a variety of surfaces. Perfasm has been instrumental in reducing the amount of head banging, but I fear nothing short of permanent brain damage will actually solve the problem.
This post has been kindly reviewed by Aleksey Shipilev, Darach Ennis and Richard Warburton. Any remaining errors are entirely their fault ;-)

8 comments:

  1. actor systems most probably require a MPSC growable array queue ^^ ;)

    ReplyDelete
    Replies
    1. Some do, some don't. This one was for RxJava and they needed an SPSC. I think Quasar also use SPSC, but not sure.
      And thanks :)

      Delete
    2. Hm .. SPSC probably is for stream processing, can't see how it makes sense for actors .. I have peeked into your MPSC Array Queue, however efficient dynamic grow is much more complicated in case of MPSC =)

      Delete
    3. Rüdiger...one question... A little OT...why a fixed size queue is not good 4 actors?

      Delete
  2. and ofc great post (as always) ..

    ReplyDelete
  3. really appreciated on code measurement

    ReplyDelete
  4. Yes this is really good. Thanks.

    ReplyDelete
  5. Maybe if i call it "PERF-AWESOME!" my colleagues could start to use it...ahah
    Thanks Nitsan for the post,as ever ^^

    ReplyDelete