Introduction:
In this post, I am going to talk about an interesting investigation our team
did into MySQL and OpenSSL. In our investigation, we discovered that OpenSSL
1.0.2g
has performance implications under highly parallel workloads. If you
enjoy systems-level problems, this post is for you.
I will journal the main struggles of our investigation and talk about some personal findings I’ve made along the way.
Background:
One of our performance benchmark suites fails when we turn on TLS to the MySQL database. The logs showed that there were many network timeouts.
Interestingly enough, if we switched to PostgreSQL, the issue went away. If we compiled MySQL with yaSSL (MySQL’s bundled SSL library), the issue also went away.
Our team initially suspected the TLS handshake. But, moving to weak cipher suites had no impact. Long story short, there was something about opening new connections with OpenSSL that was causing network timeouts in the benchmarking suite.
Context:
In order to get an understanding of what the CPU on MySQL was doing while the
benchmark suite was running, we used perf
to generate a FlameGraph. 1
The FlameGraph with TLS on:
The FlameGraph with TLS off:
With TLS turned on, we noticed that roughly 1⁄3 of the time was spent waiting
in OpenSSL. From what we can tell from the FlameGraph, one function in
particular stood out: CRYPTO_add_lock
.
For the rest of this post, CRYPTO_add_lock
will be referred to as “the
locking function”.
Naturally, we began searching the PXC code for references to CRYPTO_add_lock
.
(PXC is a Percona distribution of MySQL that we use). We couldn’t find any
direct references, and we were confused as to how it was being called.
The following is an explanation of how PXC and OpenSSL interact w.r.t the locking function.
Along the way, we also learned that the locking function is non-atomic, which spoiler alert ends up being the root cause.
Journey of CRYPTO_add_lock
:
Let’s start with what we know:
CRYPTO_add_lock
is a part of OpenSSL.- PXC does not reference it directly.
- We compile PXC to use OpenSSL.
Let’s try to trace how CRYPTO_add_lock
is invoked in OpenSSL.
Here’s CRYPTO_add_lock
itself (stripped of a bunch of macros that are
unimportant), from openssl 1.0.2g
:
int CRYPTO_add_lock(int *pointer, int amount, int type, const char *file,
int line)
{
int ret = 0;
if (add_lock_callback != NULL) {
ret = add_lock_callback(pointer, amount, type, file, line);
} else {
CRYPTO_lock(CRYPTO_LOCK | CRYPTO_WRITE, type, file, line);
ret = *pointer + amount;
*pointer = ret;
CRYPTO_lock(CRYPTO_UNLOCK | CRYPTO_WRITE, type, file, line);
}
return (ret);
}
It calls CRYPTO_lock
to do the actual work, so let’s dive into that. (Also
stripped of unimportant macros.).
void CRYPTO_lock(int mode, int type, const char *file, int line)
{
if (type < 0) {
if (dynlock_lock_callback != NULL) {
struct CRYPTO_dynlock_value *pointer
= CRYPTO_get_dynlock_value(type);
OPENSSL_assert(pointer != NULL);
dynlock_lock_callback(mode, pointer, file, line);
CRYPTO_destroy_dynlockid(type);
}
} else if (locking_callback != NULL)
locking_callback(mode, type, file, line);
}
Ah, so depending on type
, it will do one of two things. In our case, it is
branching into the else if
branch and calling locking_callback
. How is that
set? Let’s take a look below.
locking_callback
is a global variable in OpenSSL. It is set in
CRYPTO_set_locking_callback
. That must mean that PXC calls this function to
set it.
In PXC 5.7, locking_callback
is set in vio/viosslfactories.c
inside of
set_lock_callback_functions
. OpenSSL allows applications that are compiled
against it to set their own locking functions instead of using the provided
defaults.
static void set_lock_callback_functions(my_bool init)
{
CRYPTO_set_locking_callback(init ? openssl_lock_function : NULL);
CRYPTO_set_id_callback(init ? openssl_id_function : NULL);
CRYPTO_set_dynlock_create_callback(init ? openssl_dynlock_create : NULL);
CRYPTO_set_dynlock_destroy_callback(init ? openssl_dynlock_destroy : NULL);
CRYPTO_set_dynlock_lock_callback(init ? openssl_lock : NULL);
}
Notice that if the init
variable is true
, then openssl_lock_function
will
be set as the locking callback. The init
variable is true
if MySQL is
compiled with OpenSSL. (I am purposefully omitting the entire trace.).
OK, so we’ve learned how about the locking function gets called. Let’s learn about what it really does under the hood.
Journey into locks:
Here’s openssl_lock_function
: it’s a function pointer, which looks like this:
static void openssl_lock_function(int mode, int n,
const char *file MY_ATTRIBUTE((unused)),
int line MY_ATTRIBUTE((unused)))
{
if (n < 0 || n > CRYPTO_num_locks())
{
/* Lock number out of bounds. */
DBUG_PRINT("error",
("Fatal OpenSSL: %s:%d: interface problem (n = %d)", file, line, n));
fprintf(stderr, "Fatal: OpenSSL interface problem (n = %d)", n);
fflush(stderr);
abort();
}
openssl_lock(mode, &openssl_stdlocks[n], file, line);
}
And openssl_lock()
looks like this:
/*OpenSSL callback functions for multithreading. We implement all the functions
as we are using our own locking mechanism.
*/
static void openssl_lock(int mode, openssl_lock_t *lock,
const char *file MY_ATTRIBUTE((unused)),
int line MY_ATTRIBUTE((unused)))
{
int err;
char const *what;
switch (mode) {
case CRYPTO_LOCK|CRYPTO_READ:
what = "read lock";
err= mysql_rwlock_rdlock(&lock->lock);
break;
case CRYPTO_LOCK|CRYPTO_WRITE:
what = "write lock";
err= mysql_rwlock_wrlock(&lock->lock);
break;
case CRYPTO_UNLOCK|CRYPTO_READ:
case CRYPTO_UNLOCK|CRYPTO_WRITE:
what = "unlock";
err= mysql_rwlock_unlock(&lock->lock);
break;
default:
/* Unknown locking mode. */
DBUG_PRINT("error",
("Fatal OpenSSL: %s:%d: interface problem (mode=0x%x)\n",
file, line, mode));
fprintf(stderr, "Fatal: OpenSSL interface problem (mode=0x%x)", mode);
fflush(stderr);
abort();
}
if (err)
{
DBUG_PRINT("error",
("Fatal OpenSSL: %s:%d: can't %s OpenSSL lock\n",
file, line, what));
fprintf(stderr, "Fatal: can't %s OpenSSL lock", what);
fflush(stderr);
abort();
}
}
OK, so now we’re starting to get some interesting results. However, there is
more to dig in to. We need to see what mysql_rw_lock_*()
is doing.
For the sake of brevity, mysql_rw_lock_*()
is actually a macro for
inline_mysql_rwlock_*()
.
Where does the macro take us to?
(Stripped of unimportant macros.)
static inline int inline_mysql_rwlock_rdlock(
mysql_rwlock_t *that
#ifdef HAVE_PSI_RWLOCK_INTERFACE
, const char *src_file, uint src_line
#endif
)
{
int result;
/* Non instrumented code */
result= native_rw_rdlock(&that->m_rwlock);
return result;
}
Ah, yes, another function to dive into – native_rw_rdlock()
. Luckily, this
is the end of it.
static inline int native_rw_rdlock(native_rw_lock_t *rwp)
{
#ifdef _WIN32
AcquireSRWLockShared(&rwp->srwlock);
return 0;
#else
return pthread_rwlock_rdlock(rwp);
#endif
}
What does this all mean?
Well, we traced the call stack of CRYPTO_add_lock
and learned it’s a
platform-agnostic way to do locking/unlocking. In other words, on Linux, as we
see above, it uses the pthreads
library under the hood to implement mutexes.
Essentially, this function is an abstraction over platform-specific locking.
CRYPTO_add_lock
exists because the C standard library (before C11) does not
provide a way to have platform-agnostic mutexes.
Great. Let’s tie that back to the FlameGraph. In our FlameGraph, we saw a
significant amount of CPU cycles spent in CRYPTO_add_lock
, which essentially
boils down to a mutex. This means our bottleneck is lock contention.
Atomicity:
Let’s take a closer look at the critical section of the code (between the mutex). The code suspiciously looks like a fetch-and-add. Why is a mutex being used when this should be an atomic operation?
/* acquire mutex */
ret = *pointer + amount;
*pointer = ret;
/* release mutex */
Well, we’ve answered one question of why we were spending so much time in
CRYPTO_add_lock
. (Waiting on a mutex.)
The next question is why is a mutex being used to protect what should be an atomic operation.
I spent some time trying to understand how the above code was not atomic. I assumed the C compiler would optimize this.
The answer is two-fold: the compiler does not optimize the code and the performance difference between mutexes and atomic code is staggering. 2
Now it makes sense that the benchmark suite was failing due to network timeouts. The threads handling the connections simply took too long because of needless lock contention.
How can this be fixed? Well, on x86, there is an atomic implementation of fetch-and-add, which performs better than using a mutex.
Why did the OpenSSL developers opt for a mutex? I tried to answer that question
by running git log
on version 1.0.2g
:
$ git log -G "ret= \*pointer\+amount;"
commit ae5c8664e587f2445c8e4eb436cffbb64af4f6b8
Author: Matt Caswell <matt@openssl.org>
Date: Thu Jan 22 03:29:12 2015 +0000
Run util/openssl-format-source -v -c .
Reviewed-by: Tim Hudson <tjh@openssl.org>
commit d02b48c63a58ea4367a0e905979f140b7d090f86
Author: Ralf S. Engelschall <rse@openssl.org>
Date: Mon Dec 21 10:52:47 1998 +0000
Import of old SSLeay release: SSLeay 0.8.1b
Unfortunately, it will be hard to find answers because the oldest commit was when OpenSSL moved from SSLeay. Oh, well.
Putting it all together:
At the beginning, we were under the assumption that MySQL was the cause of the performance hit, but it is more nuanced than that. It is actually more like:
- MySQL (or PXC) sets the locking callback to a MySQL function
(
native_rw_rdlock()
). - OpenSSL uses the above in replacement of the
CRYPTO_lock
macro, insideCRYPTO_add_lock
. CRYPTO_add_lock
is wrapper function providing the locking and unlocking.CRYPTO_add_lock
protects the critical section with an unnecessary mutex.
Resolution:
We end up bumping our version of OpenSSL to 1.1.0 which fixed the lock contention after doing some research online.
The fix was to deprecate and remove the usage of locks, and rely on atomic operations. The code now defines a macro which keeps to the same interface for source code compatibility, but essentially results in a no-op.
The below file is include/openssl/crypto.h
:
/*
* The old locking functions have been removed completely without compatibility
* macros. This is because the old functions either could not properly report
* errors, or the returned error values were not clearly documented.
* Replacing the locking functions with with no-ops would cause race condition
* issues in the affected applications. It is far better for them to fail at
* compile time.
* On the other hand, the locking callbacks are no longer used. Consequently,
* the callback management functions can be safely replaced with no-op macros.
*/
# define CRYPTO_num_locks() (1)
# define CRYPTO_set_locking_callback(func)
# define CRYPTO_get_locking_callback() (NULL)
# define CRYPTO_set_add_lock_callback(func)
# define CRYPTO_get_add_lock_callback() (NULL)
At this point, the investigation is wrapped up, but I do continue with random musings to satisfy my own curiosity. If you’re interested, read on.
Musings
I decided to dig further. I wanted to see what a solution would look like without a major bump.
How would a correct (hypothetical) implementation look like?
int CRYPTO_add_lock(int *pointer, int amount, int type, const char *file,
int line)
{
return atomic_load_add(pointer, amount);
}
That’s how one could implement this if OpenSSL used C11.
This is a classic use of fetch-and-add mentioned earlier. Because this function will do those operations atomically, we have hardware guarantees that no other thread can be accessing this memory location at the same time. No need for explicit locking.
Musings part 2:
I wanted to see if I could reproduce the issue on my local machine. I ran the same benchmarks on my local machine inside a two-core VM. They failed of course because they did not meet the time constraints. But I was still surprised to see the FlameGraph suggest that the CPU was spending almost all its time doing what we expect. It did not spend any measurable time waiting for locks. I thought “how could this be”, until I realized that in reality, there could only be two threads serving connections on my VM.
If the machine did not have many cores, this issue could not manifest itself, because the CPU has literally no physical means of doing work in parallel (in this case, handling simultaneous connections).
Something that was nagging at me this whole time was how no one had discovered this before. How did this piece of code from the 90’s survive this long?
It’s possible that OpenSSL developers at the time also assumed that the compiler would optimize the code, but never bothered to verify.
But not until I tried to reproduce this locally did I realize a more likely reason. This code was probably not performance tested to the degree of the number of parallel connections our benchmarking suite has. The amount of load and parallel connections required is quite expensive to replicate. Our benchmarking environment is a very beefy machine and the load it’s meant to handle simulates a large amount of traffic at the same time.
That’s when it became an eye-opening moment when I realized this performance degradation manifests only at scale.
In the world of software, it can be easy to forget about the hardware. It is always fun to peel back the abstraction and truly come to an understanding.
- https://github.com/brendangregg/FlameGraph [return]
- See this repo for performance comparison between using mutexes vs. atomic functions. [return]