Performance impact of locks vs. lockless code

Dec 27, 2018

10 minutes read

Introduction:

In this post, I am going to talk about an interesting investigation our team did into MySQL and OpenSSL. In our investigation, we discovered that OpenSSL 1.0.2g has performance implications under highly parallel workloads. If you enjoy systems-level problems, this post is for you.

I will journal the main struggles of our investigation and talk about some personal findings I’ve made along the way.

Background:

One of our performance benchmark suites fails when we turn on TLS to the MySQL database. The logs showed that there were many network timeouts.

Interestingly enough, if we switched to PostgreSQL, the issue went away. If we compiled MySQL with yaSSL (MySQL’s bundled SSL library), the issue also went away.

Our team initially suspected the TLS handshake. But, moving to weak cipher suites had no impact. Long story short, there was something about opening new connections with OpenSSL that was causing network timeouts in the benchmarking suite.

Context:

In order to get an understanding of what the CPU on MySQL was doing while the benchmark suite was running, we used perf to generate a FlameGraph. ¹

The FlameGraph with TLS on: TLS on

The FlameGraph with TLS off: TLS off

With TLS turned on, we noticed that roughly ¹⁄₃ of the time was spent waiting in OpenSSL. From what we can tell from the FlameGraph, one function in particular stood out: CRYPTO_add_lock.

For the rest of this post, CRYPTO_add_lock will be referred to as “the locking function”.

Naturally, we began searching the PXC code for references to CRYPTO_add_lock. (PXC is a Percona distribution of MySQL that we use). We couldn’t find any direct references, and we were confused as to how it was being called.

The following is an explanation of how PXC and OpenSSL interact w.r.t the locking function.

Along the way, we also learned that the locking function is non-atomic, which spoiler alert ends up being the root cause.

Journey of CRYPTO_add_lock:

Let’s start with what we know:

CRYPTO_add_lock is a part of OpenSSL.
PXC does not reference it directly.
We compile PXC to use OpenSSL.

Let’s try to trace how CRYPTO_add_lock is invoked in OpenSSL.

Here’s CRYPTO_add_lock itself (stripped of a bunch of macros that are unimportant), from openssl 1.0.2g:

int CRYPTO_add_lock(int *pointer, int amount, int type, const char *file,
                    int line)
{
    int ret = 0;

    if (add_lock_callback != NULL) {
        ret = add_lock_callback(pointer, amount, type, file, line);
    } else {
        CRYPTO_lock(CRYPTO_LOCK | CRYPTO_WRITE, type, file, line);
        ret = *pointer + amount;
        *pointer = ret;
        CRYPTO_lock(CRYPTO_UNLOCK | CRYPTO_WRITE, type, file, line);
    }
    return (ret);
}

It calls CRYPTO_lock to do the actual work, so let’s dive into that. (Also stripped of unimportant macros.).

void CRYPTO_lock(int mode, int type, const char *file, int line)
{
    if (type < 0) {
        if (dynlock_lock_callback != NULL) {
            struct CRYPTO_dynlock_value *pointer
                = CRYPTO_get_dynlock_value(type);

            OPENSSL_assert(pointer != NULL);

            dynlock_lock_callback(mode, pointer, file, line);

            CRYPTO_destroy_dynlockid(type);
        }
    } else if (locking_callback != NULL)
        locking_callback(mode, type, file, line);
}

Ah, so depending on type, it will do one of two things. In our case, it is branching into the else if branch and calling locking_callback. How is that set? Let’s take a look below.

locking_callback is a global variable in OpenSSL. It is set in CRYPTO_set_locking_callback. That must mean that PXC calls this function to set it.

In PXC 5.7, locking_callback is set in vio/viosslfactories.c inside of set_lock_callback_functions. OpenSSL allows applications that are compiled against it to set their own locking functions instead of using the provided defaults.

static void set_lock_callback_functions(my_bool init)
{
  CRYPTO_set_locking_callback(init ? openssl_lock_function : NULL);
  CRYPTO_set_id_callback(init ? openssl_id_function : NULL);
  CRYPTO_set_dynlock_create_callback(init ? openssl_dynlock_create : NULL);
  CRYPTO_set_dynlock_destroy_callback(init ? openssl_dynlock_destroy : NULL);
  CRYPTO_set_dynlock_lock_callback(init ? openssl_lock : NULL);
}

Notice that if the init variable is true, then openssl_lock_function will be set as the locking callback. The init variable is true if MySQL is compiled with OpenSSL. (I am purposefully omitting the entire trace.).

OK, so we’ve learned how about the locking function gets called. Let’s learn about what it really does under the hood.

Journey into locks:

Here’s openssl_lock_function: it’s a function pointer, which looks like this:

static void openssl_lock_function(int mode, int n,
                                  const char *file MY_ATTRIBUTE((unused)),
                                  int line MY_ATTRIBUTE((unused)))
{
  if (n < 0 || n > CRYPTO_num_locks())
  {
    /* Lock number out of bounds. */
    DBUG_PRINT("error",
      ("Fatal OpenSSL: %s:%d: interface problem (n = %d)", file, line, n));

    fprintf(stderr, "Fatal: OpenSSL interface problem (n = %d)", n);
    fflush(stderr);
    abort();
  }
  openssl_lock(mode, &openssl_stdlocks[n], file, line);
}

And openssl_lock() looks like this:

/*OpenSSL callback functions for multithreading. We implement all the functions
  as we are using our own locking mechanism.
*/
static void openssl_lock(int mode, openssl_lock_t *lock,
                         const char *file MY_ATTRIBUTE((unused)),
                         int line MY_ATTRIBUTE((unused)))
{
  int err;
  char const *what;

  switch (mode) {
    case CRYPTO_LOCK|CRYPTO_READ:
      what = "read lock";
      err= mysql_rwlock_rdlock(&lock->lock);
      break;
    case CRYPTO_LOCK|CRYPTO_WRITE:
      what = "write lock";
      err= mysql_rwlock_wrlock(&lock->lock);
      break;
    case CRYPTO_UNLOCK|CRYPTO_READ:
    case CRYPTO_UNLOCK|CRYPTO_WRITE:
      what = "unlock";
      err= mysql_rwlock_unlock(&lock->lock);
      break;
    default:
      /* Unknown locking mode. */
      DBUG_PRINT("error",
        ("Fatal OpenSSL: %s:%d: interface problem (mode=0x%x)\n",
          file, line, mode));

      fprintf(stderr, "Fatal: OpenSSL interface problem (mode=0x%x)", mode);
      fflush(stderr);
      abort();
  }
  if (err)
  {
    DBUG_PRINT("error",
      ("Fatal OpenSSL: %s:%d: can't %s OpenSSL lock\n",
        file, line, what));

    fprintf(stderr, "Fatal: can't %s OpenSSL lock", what);
    fflush(stderr);
    abort();
  }
}

OK, so now we’re starting to get some interesting results. However, there is more to dig in to. We need to see what mysql_rw_lock_*() is doing.

For the sake of brevity, mysql_rw_lock_*() is actually a macro for inline_mysql_rwlock_*().

Where does the macro take us to?

(Stripped of unimportant macros.)

static inline int inline_mysql_rwlock_rdlock(
  mysql_rwlock_t *that
#ifdef HAVE_PSI_RWLOCK_INTERFACE
  , const char *src_file, uint src_line
#endif
  )
{
  int result;

  /* Non instrumented code */
  result= native_rw_rdlock(&that->m_rwlock);

  return result;
}

Ah, yes, another function to dive into – native_rw_rdlock(). Luckily, this is the end of it.

static inline int native_rw_rdlock(native_rw_lock_t *rwp)
{
#ifdef _WIN32
  AcquireSRWLockShared(&rwp->srwlock);
  return 0;
#else
  return pthread_rwlock_rdlock(rwp);
#endif
}

What does this all mean?

Well, we traced the call stack of CRYPTO_add_lock and learned it’s a platform-agnostic way to do locking/unlocking. In other words, on Linux, as we see above, it uses the pthreads library under the hood to implement mutexes. Essentially, this function is an abstraction over platform-specific locking. CRYPTO_add_lock exists because the C standard library (before C11) does not provide a way to have platform-agnostic mutexes.

Great. Let’s tie that back to the FlameGraph. In our FlameGraph, we saw a significant amount of CPU cycles spent in CRYPTO_add_lock, which essentially boils down to a mutex. This means our bottleneck is lock contention.

Atomicity:

Let’s take a closer look at the critical section of the code (between the mutex). The code suspiciously looks like a fetch-and-add. Why is a mutex being used when this should be an atomic operation?

/* acquire mutex */
ret = *pointer + amount;
*pointer = ret;
/* release mutex */

Well, we’ve answered one question of why we were spending so much time in CRYPTO_add_lock. (Waiting on a mutex.)

The next question is why is a mutex being used to protect what should be an atomic operation.

I spent some time trying to understand how the above code was not atomic. I assumed the C compiler would optimize this.

The answer is two-fold: the compiler does not optimize the code and the performance difference between mutexes and atomic code is staggering. ²

Now it makes sense that the benchmark suite was failing due to network timeouts. The threads handling the connections simply took too long because of needless lock contention.

How can this be fixed? Well, on x86, there is an atomic implementation of fetch-and-add, which performs better than using a mutex.

Why did the OpenSSL developers opt for a mutex? I tried to answer that question by running git log on version 1.0.2g:

$ git log -G "ret= \*pointer\+amount;"
commit ae5c8664e587f2445c8e4eb436cffbb64af4f6b8
Author: Matt Caswell <matt@openssl.org>
Date:   Thu Jan 22 03:29:12 2015 +0000

Run util/openssl-format-source -v -c .

Reviewed-by: Tim Hudson <tjh@openssl.org>

commit d02b48c63a58ea4367a0e905979f140b7d090f86
Author: Ralf S. Engelschall <rse@openssl.org>
Date:   Mon Dec 21 10:52:47 1998 +0000

Import of old SSLeay release: SSLeay 0.8.1b

Unfortunately, it will be hard to find answers because the oldest commit was when OpenSSL moved from SSLeay. Oh, well.

Putting it all together:

At the beginning, we were under the assumption that MySQL was the cause of the performance hit, but it is more nuanced than that. It is actually more like:

MySQL (or PXC) sets the locking callback to a MySQL function (native_rw_rdlock()).
OpenSSL uses the above in replacement of the CRYPTO_lock macro, inside CRYPTO_add_lock.
CRYPTO_add_lock is wrapper function providing the locking and unlocking.
CRYPTO_add_lock protects the critical section with an unnecessary mutex.

Resolution:

We end up bumping our version of OpenSSL to 1.1.0 which fixed the lock contention after doing some research online.

The fix was to deprecate and remove the usage of locks, and rely on atomic operations. The code now defines a macro which keeps to the same interface for source code compatibility, but essentially results in a no-op.

The below file is include/openssl/crypto.h:

/*
 * The old locking functions have been removed completely without compatibility
 * macros. This is because the old functions either could not properly report
 * errors, or the returned error values were not clearly documented.
 * Replacing the locking functions with with no-ops would cause race condition
 * issues in the affected applications. It is far better for them to fail at
 * compile time.
 * On the other hand, the locking callbacks are no longer used.  Consequently,
 * the callback management functions can be safely replaced with no-op macros.
 */
#  define CRYPTO_num_locks()            (1)
#  define CRYPTO_set_locking_callback(func)
#  define CRYPTO_get_locking_callback()         (NULL)
#  define CRYPTO_set_add_lock_callback(func)
#  define CRYPTO_get_add_lock_callback()        (NULL)

At this point, the investigation is wrapped up, but I do continue with random musings to satisfy my own curiosity. If you’re interested, read on.

Musings

I decided to dig further. I wanted to see what a solution would look like without a major bump.

How would a correct (hypothetical) implementation look like?

int CRYPTO_add_lock(int *pointer, int amount, int type, const char *file,
                    int line)
{
    return atomic_load_add(pointer, amount);
}

That’s how one could implement this if OpenSSL used C11.

This is a classic use of fetch-and-add mentioned earlier. Because this function will do those operations atomically, we have hardware guarantees that no other thread can be accessing this memory location at the same time. No need for explicit locking.

Musings part 2:

I wanted to see if I could reproduce the issue on my local machine. I ran the same benchmarks on my local machine inside a two-core VM. They failed of course because they did not meet the time constraints. But I was still surprised to see the FlameGraph suggest that the CPU was spending almost all its time doing what we expect. It did not spend any measurable time waiting for locks. I thought “how could this be”, until I realized that in reality, there could only be two threads serving connections on my VM.

If the machine did not have many cores, this issue could not manifest itself, because the CPU has literally no physical means of doing work in parallel (in this case, handling simultaneous connections).

Something that was nagging at me this whole time was how no one had discovered this before. How did this piece of code from the 90’s survive this long?

It’s possible that OpenSSL developers at the time also assumed that the compiler would optimize the code, but never bothered to verify.

But not until I tried to reproduce this locally did I realize a more likely reason. This code was probably not performance tested to the degree of the number of parallel connections our benchmarking suite has. The amount of load and parallel connections required is quite expensive to replicate. Our benchmarking environment is a very beefy machine and the load it’s meant to handle simulates a large amount of traffic at the same time.

That’s when it became an eye-opening moment when I realized this performance degradation manifests only at scale.

In the world of software, it can be easy to forget about the hardware. It is always fun to peel back the abstraction and truly come to an understanding.

https://github.com/brendangregg/FlameGraph ^[return]
See this repo for performance comparison between using mutexes vs. atomic functions. ^[return]

Tags: blog

Back to posts