Lock Free Queues Index

I had a long running series of posts about lock free queues, benchmarks run, optimizations made and all around concurrency fun. Most of the posts are stand alone, but often back link to each other and some, on occasion, assume prior knowledge of posts on the topic. This page aims to help you find your way through the posts and offer some clear trails to follow.
NOTE: The posts include code for the queues but if you are just looking for working implementations to use you should look at JCTools where the productized versions ended up.
Legend:

  • SPSC - Single Producer Single Consumer, similarly MPMC is Multi Producer Multi Consumer. These refer to the concurrency access supported on the producer/consumer sides of the queue.
  • JMH - The Java Micro-benchmarking Harness. There's a separate reference page on this topic.

I want to read in chronological order:

  • Single Producer/Consumer lock free Queue step by step - This post covers the some of the work done by Martin Thompson towards an SPSC queue and presents a step by step breakdown of the optimizations. Some of the optimizations/techniques used are discussed in more detail separately:
  • 135 Million messages a second between processes in pure Java - I went on to port the same SPSC algorithm outlined in the first post on top of off-heap memory and building a proof of concept inter-process communication channel. The post covers the data structure, some background and benchmark results. Some preliminary work done to verify/enable this method is covered separately:
  • SPSC Revisited - part I: An empiricist tale - Looking more closely at run to run variance in the benchmark results of the SPSC queue from the first post. Using class hierarchy to pad counters on either side (original implementation only padded one side), and 'inlining' them into the queue data structure. The class itself is further padded and so is the data. I then go on to run the different queue variants with several JVM configurations to observe some variance still remaining.
  • SPSC Revisited - part II: Floating vs Inlined Counters - This post took a closer look at the implications of the counters inlining step discussed in part I. The implementations with the various JVM settings were tested in particular demonstrated the utility of -XX:+UseCondCardMark in the context of false sharing. In the comments to this post a reader shared their Fast Flow Java port with me, which became the basis for the next post.
  • SPSC Revisited - part III: Fast Flow + Sparse Data - Looking at the Fast Flow algorithm for SPSC, the post walks through correcting the concurrency issues of the original port by introducing volatile/lazySet access to queue elements. The post also considers the near empty/full states of the queue and how by using sparse data we can reduce the 'density' of the contended cache line. A further false sharing aspect is touched on where pre-fetching can increase the scope of false sharing from 1 cache line to 2. The reason the original port did not have proper barriers in place was down to it's developer falling for a common misinterpretation of the JMM cookbook. I had a little rant on that separately.
  • A lock free MPSC - Exploring the transition from SPSC to MPSC and comparing throughput with JDK alternatives. Note that the implementation discussed does not conform to the Queue.poll() interface correctly as poll() can return null even if the queue is not empty. This particular aspect of the Queue interface is ranted on separately in another post. The notion of CAS failure backoff is discussed and several strategies compared.
  • SPSC Revisited - part IV: BQueue - Reading through and implementing a Fast Flow variant which introduces some new optimizations: probing ahead of the producer (which turned out to be a great idea) and consumer backoff as it approaches empty queue state (not an acceptable solution as it increases latency). The comments include a long Q&A comparing the queues with the Disruptor. The post touches again on the near empty/full queue behaviour as an important focus and refers to a new set of benchmarks based on the original which play around with the startup order.
  • Announcing JAQ - At this point I decided to put the implementations into a releasable form, this project ended up being JCTools.
  • Using JMH to benchmark SPSC Queue Latency - part I - Writing a round trip latency benchmark in JMH. The post deals mostly with JMH concurrency details and benchmarking methodology. There is a detailed breakdown of round trip timing and how the benchmark works. We then have a look at numbers and analysis. This post refers to the implementation which will end up the final SPSC queue for JCTools
  • Using JMH to benchmark SPSC Queue Latency - part II - This post takes a closer look at the results obtained from the latency benchmark in several configurations for several implementations.
  • SPSC Queue Champion: benchmarking woes - We look at the final design arrived at for SPSC queues which is building on a combination of Fast Flow with the producer probing technique from BQueue and sparse data. This post goes more into issues with the original benchmark both in it's implementation and it's concept:
    • Time measured includes threads creation and startup/shutdown
    • Not loading the queue reference into a variable results in the field getting reloaded on each iteration of the producer loop. This was further explained in a post on the lack of finality of final fields, and the surprising side effects of volatile loads.
    • The benchmark use case under test depends on the balance between the offer/poll methods. The slightest imbalance can lead to a test focusing on the near full/empty case. A perfectly balanced queue (which is hypothetical) will get streaming like behaviour in this benchmark.
  • Notes on Concurrent Ring Buffer Mechanics - This is a wider view on the concurrency  related intricacies of queue implementation. The post examines the changes to implementation and internal invariants as we move from SPSC to SPMC/MPSC to MPMC queues. There are no benchmarks, but plenty of code. This post is more high level, less battle scars.
  • Notes on False Sharing - A walk through the different instances of false sharing encountered on the queue implementation trail. This post is more high level, the healing continues.
  • MPMC vs CLQ - Comparing throughput and latency results for the JCTools MPMC and JDK CLQ. This post walks through 3 types on benchmarks and their results.
  • Porting Pitfalls: Turning D.Vyukov MPSC Wait-free queue into a j.u.Queue - This post looks at the potential complications in adapting D. Vyukov MPSC linked queue to the j.u.Queue interface with a detailed look at the poll method (some overlap with Poll me, maybe?) and the size method.

I want to implement a concurrent queue:

I want to laugh at your benchmarking pains:


No comments:

Post a Comment

Note: only a member of this blog may post a comment.