Tracking down a seven-year-old segfault
Back in August 2014 a user reported they couldn’t run Elasticsearch: it would immediately crash with a segmentation fault. Elasticsearch is almost entirely written in Java which as a managed language is supposed to protect us from low-level issues like segmentation faults. The “almost” in the previous sentence is the problem: Elasticsearch calls out to native code in a few places, and it was one of these places that was triggering the crash.
Elasticsearch makes its native calls using JNA which is a deeply magical library that makes it easy for Java code to call into libraries written in C. There was definitely a suspicion that this was something to do with JNA, but the investigation fizzled out before it got very far.
Then in May 2016 Github user @fxh reported the
same thing on Github.
The investigation got further this time: the issue seemed to manifest only when
SELinux was enabled, and seemed to be related to whether temporary files were
permitted to contain executable code. Forbidding temporary executables is a
security measure: anyone can write to /tmp
so it’s possible to attack a
system by writing a nefarious executable to /tmp
and then tricking someone
more privileged into running it.
In this context “executable” means more than just programs that you can run
from the command line. Modern processors distinguish code from
data at a very low level, with a flag on
each page of memory that determines whether the data it contains can ever be
interpreted as instructions that the CPU will execute. If /tmp
is mounted
with the noexec
option then every page that’s associated with a file under
/tmp
will have the no-execute bit set. This forbids fully-fledged programs
and dynamically-linked libraries and also, crucially, any other kind of
memory-mapped executable page that is backed by a file in /tmp
.
Some of JNA’s deep magic works by dynamically generating a temporary library
(wrapping around the C library) into which the Java code can call. It’s
sometimes possible to generate code dynamically in pages that aren’t backed by
a file, but for security reasons the system might also proscribe pages from
being both writeable and executable, and obviously we need write access to the
memory in which we’re generating the code. The usual solution seems to be to
write the code into a file and then use something like mmap()
to load it
again into read-only-but-executable pages. In order to do this we need some
temporary space that isn’t mounted noexec
.
However the crash wasn’t just caused by having /tmp
mounted with the
noexec
option: if you do that then you get a different
error
and not a segmentation fault. And anyway you can tell JNA to create its
temporary library in a different location by setting the java.io.tmpdir
or
jna.tmpdir
system properties, which is a sensible workaround for when
executables are forbidden in the default temporary directory. This doesn’t
always fix the problem.
At this point it became hard to make further progress: there was no sufficiently well-locked-down SELinux system on which to analyse things further and it’s not a configuration that gets a lot of testing. We verified that the problem really wasn’t in Elasticsearch code itself and concluded that it must be a problematic and untested interaction between SELinux and JNA in the hope that a future version of SELinux and/or JNA would fix it.
A few months later user @vineet01 reported that they were having the same problem and that they fixed it by creating a home directory for the user as which Elasticsearch was running. They hypothesised that this was because the JVM wanted to create a usage-tracking file in the home directory. More recently user @cyamal1b4 reported the same fix and blamed the same usage-tracking file, although neither user gave an explanation of how a failure to write this file might lead to a segmentation fault in JNA-related code.
Over the years many users have reported this same crash. It still definitely exists on very locked-down systems. It always appears to be related to temporary executable files and is generally fixed by fiddling with environment variables/system properties/permissions until the segfault goes away. The trouble is that the users that have these very locked-down systems are also the users that can provide the least amount of debugging context, and that struggle to make changes to permissions or other environmental settings. It’s often quite a long process to find the right combination of settings that fix it, and all these cases consume much time from many engineers. At least the crash reliably happens at every startup. It’d be much worse if it were nondeterministic or took a long time to manifest, but Elasticsearch really should not be failing with a segmentation fault like this, and we really shouldn’t be spending this much time helping users solve the same issue over and over again.
Recently I decided it was worth taking another look to see if I could work out what was at the bottom of it.
JVM error log analysis
@fxh included the JVM error log file in the original Github issue which contains a fantastic amount of detail about the state of the JVM at the time of the crash. Here’s an excerpt of the important bits:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f424226f40a, pid=28216, tid=139922878629632
#
# JRE version: Java(TM) SE Runtime Environment (7.0_75-b13) (build 1.7.0_75-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.75-b04 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C [jna4948368637624641726.tmp+0x1240a] ffi_prep_closure_loc+0x1a
#
Registers:
RAX=0x00007f424226f8c2, RBX=0x00007f425579ed48, RCX=0x00007f424c44a7b0, RDX=0x00007f4242264590
RSP=0x00007f425579eae0, RBP=0x00007f425579eae0, RSI=0x00007f424c44a7d0, RDI=0x0000000000000000
R8 =0x00007f425007ae43, R9 =0x0000000000000002, R10=0x00007f425579e870, R11=0x00007f424226f3f0
R12=0x0000000000000000, R13=0x0000000000000008, R14=0x00007f424c44a7b0, R15=0x0000000000000004
RIP=0x00007f424226f40a, EFLAGS=0x0000000000010246, CSGSFS=0x0000000000000033, ERR=0x0000000000000006
TRAPNO=0x000000000000000e
Instructions: (pc=0x00007f424226f40a)
0x00007f424226f3ea: 66 90 66 66 66 90 8b 06 55 41 b9 02 00 00 00 48
0x00007f424226f3fa: 89 e5 ff c8 83 f8 01 77 44 48 8b 05 c6 49 10 00
0x00007f424226f40a: 66 c7 07 49 bb 4c 89 47 0c 66 c7 47 0a 49 ba 48
0x00007f424226f41a: 89 47 02 8b 46 1c 48 89 77 18 48 89 57 20 48 89
Stack: [0x00007f42556a0000,0x00007f42557a1000], sp=0x00007f425579eae0, free space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [jna4948368637624641726.tmp+0x1240a] ffi_prep_closure_loc+0x1a
C [jna4948368637624641726.tmp+0xd4dd] Java_com_sun_jna_Native_registerMethod+0x45d
j com.sun.jna.Native.registerMethod(Ljava/lang/Class;Ljava/lang/String;Ljava/lang/String;[I[J[JIJJLjava/lang/Class;JIZ[Lcom/sun/jna/ToNativeConverter;Lcom/sun/jna/FromNativeConverter;Ljava/lang/String;)J+0
...
j org.elasticsearch.bootstrap.JNACLibrary.<clinit>()V+45
...
j org.elasticsearch.bootstrap.JNANatives.definitelyRunningAsRoot()Z+8
The header tells us that Elasticsearch received a fatal SIGSEGV
signal while
executing the instruction at ffi_prep_closure_loc+0x1a
, i.e. the one which
starts 0x1a
bytes into the function ffi_prep_closure_loc
. This signal
usually means the program attempted to dereference a pointer that doesn’t point
to a valid memory location. The stack trace shows that Elasticsearch was in the
process of executing
JNANatives#definitelyRunningAsRoot()
which looks like this:
static boolean definitelyRunningAsRoot() {
if (Constants.WINDOWS) {
return false; // don't know
}
try {
return JNACLibrary.geteuid() == 0;
} catch (UnsatisfiedLinkError e) {
// this will have already been logged by Kernel32Library, no need to repeat it
return false;
}
}
This method is ultimately trying to call the C library’s geteuid()
function,
and it’s the first time we’ve touched the JNACLibrary
class so we’re running
the static constructor (<clinit>
) which is setting up all the JNA magic.
The dump of instruction memory is useful too: the instruction pointer is at
0x00007f424226f40a
and as mentioned above this is only 0x1a
bytes into
executing ffi_prep_closure_loc
, which means the function starts at address
0x00007f424226f3f0
and hence we can disassemble all the instructions in this
function leading up to the one that caused the crash:
0: 8b 06 mov eax,DWORD PTR [rsi]
2: 55 push rbp
3: 41 b9 02 00 00 00 mov r9d,0x2
9: 48 89 e5 mov rbp,rsp
c: ff c8 dec eax
e: 83 f8 01 cmp eax,0x1
11: 77 44 ja 0x57
13: 48 8b 05 c6 49 10 00 mov rax,QWORD PTR [rip+0x1049c6] # 0x1049e0
1a: 66 c7 07 49 bb mov WORD PTR [rdi],0xbb49
The SIGSEGV
happened on the last line which is trying to write something to
the address to which register RDI
points, and the crash dump also tells us
that RDI
is currently 0x0000000000000000
which is the null pointer and
definitely not a valid address. This function hasn’t tried to write to RDI
before it gets to the faulting instruction, which means it must have been
expecting the caller to set RDI
to a valid address.
If a function takes arguments then the caller is responsible for putting them
in appropriate places so that the callee can find them. A calling
convention is an agreement
between caller and callee which defines (amongst other things) where the
arguments to a function are when the function is called. Even on a particular
processor architecture there are many possible calling
conventions but in
practice on a 64-bit system running Linux RDI
will contain the first argument
to the function. We can see from the stack dump that the caller is JNA’s
Java_com_sun_jna_Native_registerMethod
function, and here’s how it calls
ffi_prep_closure_loc
:
closure = ffi_closure_alloc(sizeof(ffi_closure), &code);
status = ffi_prep_closure_loc(closure, closure_cif, dispatch_direct, data, code);
The first argument is the closure
which is returned from ffi_closure_alloc
.
The docs for this
function
say it allocates and returns a chunk of memory and doesn’t suggest that it
might return NULL
, but if we look at its
source
it’s clear that it does return NULL
to indicate various kinds of failure.
Hurrah, we worked it out: the segmentation fault is because JNA isn’t checking for a failure to allocate this closure, which turns out to be a known issue which should be a small thing to fix.
But wait, there’s more
Although it’s definitely an improvement to throw a Java exception on this
failure instead of a segmentation fault, this doesn’t actually solve anything.
Elasticsearch will still fail to start up even with this fix: it’ll report a
more descriptive message and shut down more gracefully but still users will
need to fiddle around with permissions and ask for help to get Elasticsearch up
and running. Ideally we need to make ffi_closure_alloc
succeed or at least to
understand better why it’s failing.
Note that from here on this investigation takes a couple of leaps of faith: I’ve only read the code, I don’t have a locked-down system on which to run experiments to verify any of this.
There are a number of different implementations of ffi_closure_alloc
depending on operating system and selected by #ifdef
pragmas but they all
ultimately need to allocate some memory into which some machine code can be
written: like JNA, libffi
does some of its magic with dynamically-generated
executable code.
The allocation mechanism is kinda complicated: there’s actually a whole
separate implementation of malloc()
and
friends
which relies on mmap()
to actually acquire memory from the operating system,
but then mmap()
is
redefined
to call a custom implementation which does its best to allocate executable
pages using different techniques until it finds one which succeeds. Fortunately
there are some helpful comments about how this works on
Linux:
#if !FFI_MMAP_EXEC_WRIT && !FFI_EXEC_TRAMPOLINE_TABLE
# if __linux__ && !defined(__ANDROID__)
/* This macro indicates it may be forbidden to map anonymous memory
with both write and execute permission. Code compiled when this
option is defined will attempt to map such pages once, but if it
fails, it falls back to creating a temporary file in a writable and
executable filesystem and mapping pages from it into separate
locations in the virtual memory space, one location writable and
another executable. */
# define FFI_MMAP_EXEC_WRIT 1
# define HAVE_MNTENT 1
# endif
...
#if FFI_MMAP_EXEC_WRIT && !defined FFI_MMAP_EXEC_SELINUX
# if defined(__linux__) && !defined(__ANDROID__)
/* When defined to 1 check for SELinux and if SELinux is active,
don't attempt PROT_EXEC|PROT_WRITE mapping at all, as that
might cause audit messages. */
# define FFI_MMAP_EXEC_SELINUX 1
# endif
#endif
That tells us that libffi
will sometimes create temporary executable files,
and it will always create them when running under SELinux. It’s important to
note that this is completely independent of the fact that JNA creates temporary
executable files: libffi
is a language-independent library for calling
foreign functions so it doesn’t know anything about Java and therefore doesn’t
have access to the Java system properties java.io.tmpdir
and jna.tmpdir
which control where JNA does its work. Instead, on Linux libffi
tries to
create its temporary executable files in various places in the following order
of
preference:
$LIBFFI_TMPDIR
$TMPDIR
/tmp
/var/tmp
/dev/shm
$HOME
Elasticsearch doesn’t set any of these environment variables specially, so even
if JNA is creating its temporary files somewhere that permits executables it’s
entirely possible that libffi
does not. This also helpfully resolves the
mystery of why giving the elasticsearch
user a writeable home directory seems
to make the problem go away: when nothing else works, libffi
will try writing
to $HOME
which typically does permit executable code as long as it exists.
Finally, this leads us to a proper fix: we don’t need to give the
elasticsearch
user a whole home directory, instead we should be able to set
$LIBFFI_TMPDIR
to
point to the same directory that JNA uses.
Addendum 2021-08-31: a colleague pointed out that JNA contains a vendored
version of libffi
, and the version of libffi
used by the latest release of
JNA dates back to before support for the LIBFFI_TMPDIR
environment
variable.
This’ll be the right fix eventually, but until then the best we can do is to
set TMPDIR
or HOME
.