Monday 1 February 2016

Wait For It: Counted/Uncounted loops, Safepoints and OSR Compilation

In this previous post about Safepoints I claimed that this here piece of code:

Will get your JVM to hang and will not exit after 5 seconds, as one might expect from reading the code. The reason this happens, I claimed, is because the compiler considers for loops limited by an int to be counted, and will therefore not insert a safepoint poll into the generated assembly code. This was the theory, and I tested it, and it definitely hung on my machine ('It Works On My Machine'™).
To my great surprise, some people not only read the posts, they also run the code (it's like they don't trust me ;-) ). And 2 of those diligent readers got back to me in the comments saying that actually the JVM exists just fine. So I ran it again, and it hung, again. Then I ran it with different versions of the JVM, and it hung as I expected. Then I thought, maybe they are running the code I put on GitHub, so I tried that. That also hung when I ran it from Eclipse, but did not when I built it with Maven and ran it from the command line. EUREKA!
As it happens, Eclipse has it's own Java compiler (a fact well known to those who know it well, no doubt) and the bytecode it generated (the .class file) hung every time, but Maven uses javac and the .class file generated from that did not hang. When testing initially I was running my class inside Eclipse, or using the Eclipse .class file to run from the command line. Both .class files are valid, but different (same-same but different). We can use javap to print out class bytecode and compare ("javap -c -v -p <class file>"). For your pleasure you can find the original Java file, the 2 class files and their respective bytecode printouts on GitHub and come to your own conclusions before you continue reading.
So... is my understanding (and the Eclipse javac) somehow out of date? can this problem of telling counted loops from uncounted loops be solved by the bytecode compiler?

How is the bytecode different?

Well, mostly not that different:
  1. An easy to spot difference which we can tell is semantically the same is the bytecode result for "System.out.println("How Odd:" + l);" you'll notice that Eclipse is using the constructor "java/lang/StringBuilder."<init>":(Ljava/lang/String;)" while javac is opting for "java/lang/StringBuilder."<init>":()V" followed by an append. Semantically the 2 choices are the same. We create a StringBuilder and append the String "How odd:" and the variable l. The reality is that there are subtle differences, the Eclipse version will allocate a larger char[] for instance. It may also be that the Eclipse pattern will miss out on some String concatenation optimization because the OpenJDK implementers are likely to focus on patterns generated from the official javac.
  2. The loop interpretation is different. Eclipse is checking the loop conditions at the end of the loop whereas javac is building a loop with the checks upfront and the loop backedge at the end. That seems like the important difference.
Let's zoom in on the relevant bytecode for the loop:

If we were to reverse translate the loop bytecode to ugly Java (which supported goto) we'd get:

Note that there's no special Java bytecode for loop control flows. So why so different at runtime? If we dump out the assembly for the run method we will see that the Eclipse version is missing the safepoint in the loop (the method has a {poll_return} but there's no {poll}), and the javac version has that safepoint there. Is it the bytecode difference that makes this happen?

Variations on bytecode

I tinkered with the class a bit and refactored the loop out of the lambda. This results in similar bytecode for the loop and same behaviour. I then made the limit of the loop a parameter of the method, this time it behaved like the javac version, but had the same funky loop structure:

WTF? if this is a purely a bytecode subtlety then it's seems pretty weird. What sort of flaky compiler would swing like this?

The compiler did it?

But... which compiler? so we looked at the above bytecode compiler results, this is but the first of many compilers to process your code on it's way to being executed. Traditionally when discussing Java code compilation we look at the following stages:
  1. Javac: From code to bytecode
  2. Interpreter: Step through the bytecode
  3. C1/Client: Short compilation time, does not generate the best optimized code. Intended for short lived applications.
  4. C2/Server: Longer compilation time, better optimized code. Intended for long running applications.
And the compilation timeline would look something like this:
                                                   /->C1
*.java->javac->*.class->interpreter->class+profile<
                                                   \->C2
The situation with OpenJDK8 is more complex with Tiered Compilation allowing us a smooth transition from interpreter to C2 compilation via C1 compilations. 
Given the behaviour we are after was supposed to manifest after a C1/C2 compilation we can run the code above to find which compiler compiled the method in question by adding the flag: -XX:+PrintCompilation:
    159  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    159  202       3       safepoint.hang.WhenWillItExitInt2::countOdds (44 bytes)
    159  203 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
    162  201 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
    142  200 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  202 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
   1896  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1896  204 %     4       safepoint.hang.WhenWillItExitInt2::countOdds @ 13 (44 bytes)
   1899  203 %     3       safepoint.hang.WhenWillItExitInt2::countOdds @ -2 (44 bytes)   made not entrant
The 3/4 printouts on the 4th column above indicate the method is compiled by C1 (profiled) and C2. But what is that % about? This is an indicator that the compilations are not 'normal' compilations but rather On Stack Replacement(OSR) compilations. OSR compilations are 'special', from the OpenJDK glossary:
on-stack replacement
Also known as 'OSR'. The process of converting an interpreted (or less optimized) stack frame into a compiled (or more optimized) stack frame. This happens when the interpreter discovers that a method is looping, requests the compiler to generate a special nmethod with an entry point somewhere in the loop (specifically, at a backward branch), and transfers control to that nmethod. A rough inverse to deoptimization.
I like the "rough inverse to deoptimization", it's a great mental qualification helper. OSR is the opposite of deoptimization because it is a mid-method replacement of code with compiled code, whereas a deoptimization is a mid-method replacement of code with interpreted code. Normal compilation unit replacements happen out of method, so while the method is off stack. OSR is specifically done to replace long running loops, like our countOdds method. As we can see from the printout above both C1 and C2 can produce OSR compilations.
It is impossible to disable OSR directly in a product build (there's a developer option -XX:-CICompileOSR), but we can force upfront compilation by using the following flags: "-XX:-TieredCompilation -server/client -Xcomp". Running with these arguments all 3 variations (original, refactored loop, refactored parametrized loop) now hang both for the Eclipse and the javac generated bytecode.
In both cases where the Eclipse versions hang we can see OSR compilations of the relevant code.
So it seems:
  • C1/C2 non-OSR compilations of the loop treat it as a counted loop, thus eliminating the safepoint poll and demonstrating the hung behaviour due to huge TTSP.
  • C1/C2 OSR compilations of the same code result in different results depending on the generated bytecode for the loop. I'm not going to dig further into the reasons, but I would welcome any feedback from OpenJDK compiler guys.
The OSR compilation decision to treat the loop as non-counted is quite reasonable. If we have a loop which remains on stack for very long periods we should perhaps consider it uncounted.

Killer Task

I started down this path looking for valid Java code which will result in a hung JVM due to huge TTSP. Since OSR got in the way, can we build such a task? Sure we can, we just need to warmup the loop until it is C2 compiled and then run it with a large limit, like so:
This fucker now hangs in both the Eclipse and javac versions. Joy!

Side note: OSR and JMH

I look at benchmarks people write quite a bit, and it is very common to see people rolling their own benchmark harnesses put their "code under measurement" in a counted loop, run it 100,000 times and derive their results. Often that means the code they are measuring is OSR compiled code, which is different (for better or worse, mostly worse) from the code they will be running normally. If you are one of these people, I beg you: cut that fucking shit out!
JUST USE JMH!
PLEASE?
To a large extent it will cover your use case. Where it doesn't cover your use case, post a question on jmh-dev mailing list, it might be that you've missed something. If it turns out JMH cannot cover your usecase, make sure you've read the JMH samples before rolling your own harness.
I'm not going to include a before and after example, because it just makes me sad. Look at SO for some fine examples.

Counted Loops?

I keep saying 'counted loops' like it means something, so here are some examples for you to play with and contemplate:

So, as you can see the definition ain't so clear and I would imagine some of the above behaviour will vary depending on JVM version/vendor/phase of the moon etc.

Summary

So, it seems my previous example was not as tested as it should have been. I apologize to anyone who has based life changing decisions on that sample. I hope this post helps in explaining away the subtleties which led to my downfall.
The above examples help demonstrate that the line between counted and uncounted loops is a tad confusing and the classification may depend on differences in bytecode, compiler mode, and so on.

10 comments:

  1. I filed https://bugs.eclipse.org/bugs/show_bug.cgi?id=486926 with the link to this post, in case there is merit in looking at reorganising Eclipse's compiler (which is used in a number of other products such as Tomcat's JSP compiler)

    ReplyDelete
    Replies
    1. Thanks.
      I read the bug. IMO this is not an issue for Eclipse, it's just an OSR compilation quirk. The Eclipse .class file is valid it's not clear to me that it's not being handled more correctly then the javac .class. It's just wierd really.

      Delete
    2. brilliant investigative work :)

      Delete
  2. There is old Cliff Click post (http://www.cliffc.org/blog/2011/11/22/what-the-heck-is-osr-and-why-is-it-bad-or-good/) about OCR and it's interaction with JIT loop optimization heuristics. Probably you've already read it, but if not -- it may worth it.

    ReplyDelete
    Replies
    1. Very good article, I know it.
      I did a lousy job on the cross references here.
      Needs more links...

      Delete
  3. It has passed a long time but there is a very besic thing I'm not sure
    I have understood of the JIT mechanics here:
    - Thread::run contains a call to countOdds that will be executed by the interpreter, given that Thread::run isn't compiled yet.
    - countOdds is forbidden by policy to be inlined (ie the caller is not compiled)
    - given that countOdds has been already compiled C2/Level 4, invokestatic will call the most optimized version of it without dropping the {poll_return} on method exit

    Sorry for the step-by-step explanation, but it is correct? Am I missing anything?

    ReplyDelete
    Replies
    1. Why do you think "countOdds is forbidden by policy to be inlined (ie the caller is not compiled)" is a relevant piece of information? Also, how is "invokestatic will call the most optimized version of it without dropping the {poll_return} on method exit" relevant?

      We have enough profiling data to compile the countOdds method, we hope it's been compiled before we call it from Thread::run, where the parameter we give it means it will never return, so the safepoint poll at the end of the method is not a concern. We are counting on the removal of the counted loop safepoints.

      The important thing here is that countOdds

      Delete
    2. Sorry Nitsan, I forgot to remove this comment as I've moved forward and understood the mechanics here, but thanks for the answer anyway!

      Delete
  4. Hi,

    First of all, awesome post!!!

    I was trying to run your example and the result is the expected but I was not able to understand why. Based on my understanding the JVM will died when all user threads finishes so my assumption was ok after sleeping 5 seconds the main thread will finish but the jvm will not died because the counted loop does not reach a safepoint so it cannot validate if all the user threads has finished (That was my assumption) but after running the same code and adding System.out.println("Before Sleep"); Thread.sleep(5000); System.out.println("After Sleep"); The after sleep is never printed so my new assumption is that the main thread is timed_waiting but after wait for 5 seconds it can be moved to runnable queues but maybe this movement is done as vm operation? is that correct? or could you help me to understand this. Thanks in advance

    ReplyDelete
    Replies
    1. There are many VMOperations triggers, see the post on safepoints for details and instructions on how to get the JVM to print out safepoint details. The main thread is never coming back from sleeping because the JVM is waiting for a VMOperation to start, and there's no point unsuspending Java threads before it's finished.

      Delete

Note: only a member of this blog may post a comment.