Psychosomatic, Lobotomy, Saw: February 2014

Love running? Love scissors? I know just the thing for you! Following on from recent discussion on the Mechanical Sympathy mailing list I see an anti pattern worth correcting in the way people use Unsafe. I say correcting as I doubt people are going to stop, so they might as well be made aware of the pitfalls. This pattern boils down to a classic concurrency bug:

Q: "But... I not be doing no concurrency or nuffin' guv"
A: Using Unsafe to gain a view of on-heap addresses is concurrent access by definition.

Unsafe address: What is it good for?

Absolutely nothing! sayitagain-huh! I exaggerate, if it was good for nothing it would not be there, let's look at the friggin manual:

As we can see the behaviour is only defined if we use the methods together, and by that I mean that get/putAddress are only useful when used with an address that is within a block of memory allocated by allocateMemory. Now undefined is an important word here. It means it might work some of the time... or it might not... or it might crash your VM. Let's think about this.

Q: What type of addresses are produced by allocateMemory?

A: Off-Heap memory addresses -> unmanaged memory, not touched by GC or any other JVM processes

The off-heap addresses are stable from the VM point of view. It has no intention of running around changing them, once allocated they are all yours to manage and if you cut your fingers in the process or not is completely in your control, this is why the behaviour is defined. On-Heap addresses on the other hand are a different story.

Playing With Fire: Converting An Object Ref to An Address

So imagine you just had to know the actual memory address of a given instance... perhaps you just can't resist a good dig under the hood, or maybe you are concerned about memory layout... Here's how you'd go about it:

Now... you'll notice the object ref needs a bit of cuddling to turn into an address. Did I come up with such devilishly clever code myself? No... I will divulge a pro-tip here:

If you are going to scratch around the underbelly of the JVM, learn from as close to the JVM as you can -> from the JDK classes, or failing that, from an OpenJDK project like JOL (another Shipilev production)

In fact, the above code could be re-written to:
Now that we have the address what can we do with it? Could we use it to copy the object? maybe we could read or modify the object state? NO! we can but admire it's numerical beauty and muse on the temperamental values waiting at the other end of that address. The value at the other end of this address may have already been moved by GC...

Key Point: On-Heap Addresses Are NOT Stable

Consider the fact that at any time your code may be paused and the whole heap can be moved around... any address value you had which pointed to the heap is now pointing to a location holding data which may be trashed/outdated/wrong and using that data will lead to a funky result indeed. Also consider that this applies to class metadata or any other internal accounting managed by the JVM.
If you are keen to use Unsafe in the heap, use object references, not addresses. I would urge you not to mix the 2 together (i.e. have object references to off-heap memory) as that can easily lead to a very confused GC trying to chase references into the unknown and crashing your VM.

Case Study: SizeOf an Object (Don't do this)

This dazzling fit of hackery cropped up first (to my knowledge) here on the HighScalability blog:
This is some sweet macheta swinging action :-). The dude who wrote this is not suggesting it is safe, and only claims it is correct on a 32bit VM. And indeed, it can work and passes cursory examination. The author also states correctly that this will not work for arrays and that with some corrections this can be made to work for 64 bit JVMs as well. I'm not going to try and fix it for 64 bit JVMs, though most of the work is already done in the JOL code above. The one flaw in this code that cannot be reliably fixed is that it relies on the native Klass address (line 6) to remain valid long enough for it to chase the pointer through to read the layout helper (line 8). Spot the similarity to the volatile bug above?
This same post demonstrates how to forge references from on-heap objects to off-heap 'objects' which in effect let you cast a pointer to a native reference to an object. It goes on to state that is a BAD IDEA, and indeed it can easily crash your VM when GC comes a knocking (but it might not, I didn't try).

Case Study: Shallow Off-Heap Object Copy (Don't do this)

Consider the following method of making an off-heap copy of an object (from here, Mishadof's blog):
We see the above is using the exact same method for computing size as demonstrated above. It's getting the on-heap object address (limited correctness, see addresses discussion above) than copying the object off-heap and reading it back as a new object copy... Calling the Unsafe.copyMemory(srcAddress, destAddress, length) is inviting the same concurrency bug discussed above. A similar method is demonstrated in the HighScalability post, but there the copy method used is Unsafe.copyMemory(srcRef, srcOffset, destRef, destOffset, length). This is important as the reference using method is not exposed to the same concurrency issue.
Both are playing with fire ofcourse by converting off-heap memory to objects. Imagine this scenario:

a copy of object A is made which refers to another object B, the copy is presented as object C
object A is de-referenced leading to A and B being collected in the next GC cycle
object C is still storing a stale reference to B which is no managed by the VM

What will happen if we read that stale reference? I've seen the VM crash in similar cases, but it might just give you back some garbage values, or let you silently corrupt some other instance state... oh, the fun you will have chasing that bugger down...

Apologies

I don't mean to present either of the above post authors as fools, they are certainly clever and have presented interesting findings for their readers to contemplate without pretending their readers should run along and build on their samples. I have personally commented on some of the code on Mishadof's post and admit my comments were incomplete in identifying the issues discussed above. If anything I aim to highlight that this hidden concurrency aspect can catch out even the clever.

Finally, I would be a hypocrite if I told people not to use Unsafe, I end up using it myself for all sorts of things. But as Mr. Maker keeps telling us "Be careful, because scissors are sharp!"

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Having recently bitched about the lack of treatment of final field as final I was urged by Mr. Shipilev to demonstrate the issue in a more structured way (as opposed to a drunken slurred rant), I have now recovered my senses to do just that. The benchmark being run and the queue being discussed are covered in this post, so please refresh you memory for context if you need. The point is clear enough without full understanding of the context though.
It is perhaps a fact well known to those who know it well that final fields, while providing memory visibility guarantees, are not actually immutable. One can always use reflection, or Unsafe, to store new values into those fields, and in fact many people do (and Cliff Click hates them and wishes them many nasty things). This is (I believe) the reason behind some seemingly trivial optimizations not being done by the JIT compiler.

Code Under Test: FFBufferWithOfferBatch.poll()

The buffer field is a final field of FFBufferWithOfferBatch and is being accessed twice in the method above. A trivial optimization on the JIT compiler side would be to load it once into a register and reuse the value. It is 'immutable' after all. But if we look at the generate assembly (here's how to, I also took the opportunity to try out JITWatch which is brilliant):
We can see buffer is getting loaded twice (line 15, and again at line 24). Why doesn't JIT do the optimization? I'm not sure... it may be due to the volatile load forcing a load order that could in theory require the 'new' value in buffer to be made visible... I don't know.

Hack around it, see if it makes a difference

Is that a big deal? Let's find out. The fix is trivial:
And the assembly code generated demonstrates the right behaviour now (one load at line 15):
Now, was that so hard to do? And more importantly, does it make any difference to performance? As discussed previously, the throughput benchmark is sensitive to changes in the cost balance between offer/poll. The optimization creates an interesting change in the pattern of the results:

The benchmark is run on Ubuntu13.10/JDK7u45/i7@2.4, the x axis is the index of the benchmark run and the Y axis is the result in ops/sec. The chart displays the results for before the change (B-*) and after(A-*) with different sparse data settings. We can see the change has accelerated the consumer, leading to increased benefit from sparse data that was not visible before. With sparse data set to 1 the optimization results in a 2% increase in performance. Not mind blowing, but still. The same change applied to the producer thread loop (localizing the reference to the queue field) discussed in the previous post enabled a 10% difference in performance as the field reference stopped the loop from unrolling and was read on each iteration. I used the poll() example here because it involves allot less assembly code to wade through.

Hopefully this illustrates the issue to Mr. Shipilev's content. Thanks goes to Gil Tene for pointing out the optimization to me and to Chris Newland for JITWatch.

Psychosomatic, Lobotomy, Saw

Tuesday 25 February 2014

Unsafe Pointer Chasing: Running With Scissors