A cautionary tale on thread locals, pooling and reflection

Most non-trivial Java applications use thread pools to improve performance by avoiding thread creation and optionally by limiting the maximum numbers of threads in use.

The end goal is for each task to execute in a dedicated thread, without it being aware that the thread is not a fresh one, but was used before. Starting with Java 5 the Java runtime libraries offer a wealth of options for creating thread pools in the java.util.concurrent package.

So far, all is well – we have a simple, robust way of adding thread pooling to our application. That is, unless we are running code which is out of our control and is misbehaving. One kind of particularly subtle misbehaviour is using ThreadLocal storage and never cleaning that up. If you’re not familiar with thread-local storage in Java, this kind of access allows each thread to store its own value for a certain variable. The example below comes straight from the javadoc.

import java.util.concurrent.atomic.AtomicInteger;

public class ThreadId {
  // Atomic integer containing the next thread ID to be assigned
  private static final AtomicInteger nextId = new AtomicInteger(0);

  // Thread local variable containing each thread's ID
  private static final ThreadLocal threadId =
    new ThreadLocal() {
      @Override protected Integer initialValue() {
        return nextId.getAndIncrement();
      }
    };

  // Returns the current thread's unique ID, assigning it if necessary
  public static int get() {
    return threadId.get();
   }
 }

This kind of usage is fine when working on new threads each time, but when threads are reused as part of a thread pool, all thread local storage must be cleaned up. Failing to do so has the potential to trigger hard-to-find memory leaks. A quick google search on thread local memory leak brings up enough examples to make us wary of leaving dangling ThreadLocal instances behind.

For application code, it’s critical to use ThreadLocal#remove when a value is no longer needed, to make sure there are no leaks.

For infrastructure code providing the thread pool, an added safety is the ability to warn and recover from misbehaving application behaviour.

Apache Tomcat does so by logging warnings when applications leave behind ThreadLocal values, and extensively document it via their MemoryLeakProtection wiki page .

In Apache Sling, as web application framework, we aim to be as robust as possible by preventing any memory leaks due to leftover thread-local storage. To that end, we recycled threads by throwing an exception in the thread pool’s afterExecute method when the thread was too old.  You can see the details in our ThreadExpiringThreadPool, but the gist of it is

public class ExpiringThreadPool extends ThreadPoolExecutor {
  @Override
  protected void afterExecute(Runnable r, Throwable t) {
    super.afterExecute(r, t);
    if ( isExpired(t) ) {
      throw new RuntimeException("Expired");
    }
  }
}

There are some tricks that you can use to make sure the exception is not logged, but the details are in the Sling code linked above.

However, we found this to not be good enough. The reasons are two:

  1. By throwing away threads we partially negate the performance advantages of thread pools
  2. It made debugging hard as the Eclipse debugger halted whenever such an exception was thrown

So we set out to finder a better way of ensuring thread-local storage is cleaned up between threads. We found some ideas on the JavaSpecialists article on cleaning thread locals , which presented a reflection-based solution. The article shows how you can access the references stores in a ThreadLocalMap and save/restore then whenever a taks starts and ends execution. This way, each starting task gets a fresh set of values and also preserves the values it received when being created. A simplified version of the code is

public class CleaningThreadPool extends ThreadPoolExecutor {

  @Override
  protected void beforeExecute(Thread t, Runnable r) {
    super.beforeExecute(t, r);

    Object threadLocals = threadLocalsField.get(t);
    if (threadLocals == null)
      return null;
    Reference[] table = (Reference[]) tableField.get(threadLocals);
    save(t, table);
  }

  @Override
  protected void afterExecute(Runnable r, Throwable t) {

    Reference[] table = load(Thread.currentThread());
    tableField.set(threadLocalsField.get(t), table);

    super.afterExecute(r, t);
  }
}

We pushed this code but discovered that under certain conditions application threads would lock, with a very implausible stack trace:

"pool-1-thread-1" prio=10 tid=0x00007f47f4374800 nid=0xb19 runnable [0x00007f47e734e000]
 java.lang.Thread.State: RUNNABLE
   at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:429)
   at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:261)
   at java.lang.ThreadLocal.set(ThreadLocal.java:183)
   at org.apache.sling.commons.threads.impl.ThreadPoolExecutorCleaningThreadLocalsTest$RunnableImplementation.run(ThreadPoolExecutorCleaningThreadLocalsTest.java:100)
   at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)

An infinite loop in thread-local storage is quite unplausible, and furthermore others should have reported it as well. All things pointed to our implementation, but we did not yet find the fault. To do so, we need to dig in to the ThreadLocal storage implementation.

In the Java standard libraries I’ve studied, thread-local storage is implemented by the ThreadLocalMap inner class. This map is the one we’re altering using reflection, and it is backed by an array of Entry objects named table. Whenever an entry is added or removed the table array is altered. However, what we missed is that there are two more fields that are linked with the table – size and threshold. The size represents the number of entries – non-null values of the table – and the threshold is the next size value which will trigger a resize.

What went wrong with our implementation is that we ignored these fields and instead only saved/restored the table array. Where this went wrong was the following scenario:

  1. Thread A is created
  2. Task 1 is executed, sets thread-local values but does not clean them up. It adds more than the initial capacity of the table, so it forces a resize
  3. Task 1 is executed and attempts to set new thread-local variables. Since the table is full ( no more non-null entries ) but the size and threshold are the default ones the setter methods loop infinitely trying to find an empty slot

Once we realized the issue we fixed it quickly and added a test for it ( link to the commit for the curious ) but discovering it was quite a trip.

With all this behind us, there are some personal takeaways for me

  1.  Keep away from thread-local storage. There is almost always a better solution, and thread-locals should be last resort. If you do use thread locals, make sure you clean them up when done
  2. Accessing private data via reflection is tricky and if not done carefully can lead to invariants being broken and seemingly impossible situations.

As final references, our thread pool implemention can be found on GitHub: ThreadPoolExecutorCleaningThreadLocals , and more information about Sling is on the Apache Sling website.