A few days ago, some researchers from UCSB approached me regarding a few questions on my EEVBNN code. I decided to run the code to give them a definite answer. Since my original environment has been lost, I setup a new virtualenv with the current versions of dependence libraries.

I thought it should be quick. Oops, I got segfault instead of seeing the result. Fine. Let’s get a backtrace from gdb:

(gdb) bt
#0  Minisat::vec<Minisat::Option*>::push (
    this=0x7ffef57fdbd0 <Minisat::Option::getOptionList()::options>,
    elem=@0x7fffffff97b0: 0x7fff94b4c3a0 <opt_var_decay>)
    at /.../eevbnn/eevbnn/minisatcs/minisat/mtl/Vec.h:107
#1  0x00007fff94b2c1ae in Minisat::Option::Option (this=0x7fff94b4c3a0 <opt_var_decay>,
    name_=0x7fff94b3cf8b "var-decay",
    desc_=0x7fff94b3cf68 "The variable activity decay factor", cate_=0x7fff94b3b4b4 "CORE",
    type_=0x7fff94b3b388 "<double>")
    at /.../eevbnn/eevbnn/minisatcs/minisat/utils/Options.h:76
#2  0x00007fff94b2c2c2 in Minisat::DoubleOption::DoubleOption (
    this=0x7fff94b4c3a0 <opt_var_decay>, c=0x7fff94b3b4b4 "CORE",
    n=0x7fff94b3cf8b "var-decay", d=0x7fff94b3cf68 "The variable activity decay factor",
    def=0.94999999999999996, r=...)
    at /.../eevbnn/eevbnn/minisatcs/minisat/utils/Options.h:129
#3  0x00007fff94b2a02b in __static_initialization_and_destruction_0 (__initialize_p=1,
    __priority=65535)
    at /.../eevbnn/eevbnn/minisatcs/minisat/core/Solver.cc:51
#4  0x00007fff94b2a595 in _GLOBAL__sub_I_Solver.cc(void) ()
    at /.../eevbnn/eevbnn/minisatcs/minisat/core/Solver.cc:1922
#5  0x00007ffff7fdce2e in call_init () from /lib64/ld-linux-x86-64.so.2

The code in question comes from the MinisatCS solver, derived from Minisat. The segfault occurs in the constructor Option::Option(), which is called during initialization of global objects that represent command line options. The constructor pushes the current class instance to a list of options:

static vec<Option*>& getOptionList () { static vec<Option*> options; return options; }
// ...
Option(...) { getOptionList().push(this); }

Now the vec<Option*> data structure seems to contain garbage that ultimately causes the segfault:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
Minisat::vec<Minisat::Option*>::push (this=0x7ffef57fdbd0 <Minisat::Option::getOptionList()::options>, elem=@0x7fffffff97b0: 0x7fff10e893a0 <opt_var_decay>) at /.../eevbnn/eevbnn/minisatcs/minisat/mtl/Vec.h:107
107             m_data[m_sz++] = elem;
(gdb) p *this
$1 = {m_sz = 1487297953, m_cap = 21845, m_data = 0x3400000033}

Variable initialization

Well, variable initialization in C++ does have some catches. For example, global variables that need dynamic initialization are initialized before main() in a standalone executable or upon library loading as in our case. The order of initialization of variables in different TUs (translation units) is indeterminately sequenced. Thus function calls during global variable initialization call for special attention, as you need to ensure that the callee does not access global variables.

Still, I am quite confident that the usage in MinisatCS/Minisat is well-defined, since the local static variable in getOptionList() is initialized at the first function call that can safely occur during global initialization, which is a standard C++ practice. We can figure out more details by looking at the disassemble:

(gdb) disassemble $pc,$pc+64
Dump of assembler code from 0x7fff10e690bb to 0x7fff10e690fb:
=> 0x00007fff10e690bb <_ZN7Minisat6Option13getOptionListEv+4>:  mov    0x1eb5e(%rip),%rax        # 0x7fff10e87c20
   0x00007fff10e690c2 <_ZN7Minisat6Option13getOptionListEv+11>: movzbl (%rax),%eax
   0x00007fff10e690c5 <_ZN7Minisat6Option13getOptionListEv+14>: test   %al,%al
   0x00007fff10e690c7 <_ZN7Minisat6Option13getOptionListEv+16>: sete   %al
   0x00007fff10e690ca <_ZN7Minisat6Option13getOptionListEv+19>: test   %al,%al
   0x00007fff10e690cc <_ZN7Minisat6Option13getOptionListEv+21>: je     0x7fff10e69118 <_ZN7Minisat6Option13getOptionListEv+97>
   0x00007fff10e690ce <_ZN7Minisat6Option13getOptionListEv+23>: mov    0x1eb4b(%rip),%rax        # 0x7fff10e87c20
   0x00007fff10e690d5 <_ZN7Minisat6Option13getOptionListEv+30>: mov    %rax,%rdi
   0x00007fff10e690d8 <_ZN7Minisat6Option13getOptionListEv+33>: call   0x7fff10e56f80 <__cxa_guard_acquire@plt>
   0x00007fff10e690dd <_ZN7Minisat6Option13getOptionListEv+38>: test   %eax,%eax
   0x00007fff10e690df <_ZN7Minisat6Option13getOptionListEv+40>: setne  %al
   0x00007fff10e690e2 <_ZN7Minisat6Option13getOptionListEv+43>: test   %al,%al
   0x00007fff10e690e4 <_ZN7Minisat6Option13getOptionListEv+45>: je     0x7fff10e69118 <_ZN7Minisat6Option13getOptionListEv+97>
   0x00007fff10e690e6 <_ZN7Minisat6Option13getOptionListEv+47>: lea    0x1ef13(%rip),%rax        # 0x7fff10e88000
   0x00007fff10e690ed <_ZN7Minisat6Option13getOptionListEv+54>: mov    %rax,%rdx
   0x00007fff10e690f0 <_ZN7Minisat6Option13getOptionListEv+57>: mov    0x1e9f1(%rip),%rax        # 0x7fff10e87ae8
   0x00007fff10e690f7 <_ZN7Minisat6Option13getOptionListEv+64>: mov    %rax,%rsi
   0x00007fff10e690fa <_ZN7Minisat6Option13getOptionListEv+67>: mov    0x1eae7(%rip),%rax        # 0x7fff10e87be8
End of assembler dump.

Note that _ZN7Minisat6Option13getOptionListEv is the mangled name:

$ c++filt _ZN7Minisat6Option13getOptionListEv
Minisat::Option::getOptionList()

The assembly above shows that __cxa_guard_acquire is first called to check a guard variable that records whether the static object has been initialized. It roughly corresponds to the following code snippet (taken from here):

if (obj_guard.first_byte == 0) {
    if ( __cxa_guard_acquire (&obj_guard) ) {
        try {
            ... initialize the object ...;
        } catch (...) {
            __cxa_guard_abort (&obj_guard);
            throw;
        }
        ... queue object destructor with __cxa_atexit() ...;
        __cxa_guard_release (&obj_guard);
    }
}

We can also check that the guard variables do exist, as indicated by the last line below:

$ nm -nCD /home/jiakai/.pyxbld/lib.linux-x86_64-3.9/eevbnn/_minisatcs.cpython-39-x86_64-linux-gnu.so | grep getOptionList
000000000003a0b7 W Minisat::Option::getOptionList()
000000000005a730 u Minisat::Option::getOptionList()::options
000000000005a740 u guard variable for Minisat::Option::getOptionList()::options

Enough for the background.

I’ve confirmed via gdb that the initialization part is skipped. As we now have garbage in the options object at the fist function call, the reason should be that the guard variable and the object itself somehow get corrupted, perhaps during initialization of other objects. Or is it?

Symbol resolution in dynamic linking

While debugging, I noticed something wrong at the first hit of the breakpoint Minisat::Option::getOptionList():

(gdb) f
#0  0x00007ffef538b6c0 in Minisat::Option::getOptionList() ()
   from /.../eevbnn/venv/lib/python3.9/site-packages/pysolvers.cpython-39-x86_64-linux-gnu.so

Aha! Another Minisat exists in the process space, which was brought in by pysolvers.so that belongs to pysat. When dlopen is used to load a shared library, “unique global” symbols (as indicated by the lowercase u in the above nm output) are resolved to one definition globally. GCC identifies static local variables as such unique global symbols. I did not find a document to describe the underlying reasoning of this design. It makes sense to share the same variables in different libraries, but GCC uses D for global variables in the meanwhile, which seems not very consistent. For comparison, inline functions such as Option::getOptionList() are weak symbols as indicated by the letter W in nm output, and the dynamic linker does not try to resolve duplicated weak symbols to one definition by default (unless LD_DYNAMIC_WEAK is set or RTLD_GLOBAL is used with dlopen; more details can be found in man ld.so and man dlopen). Therefore, the variable getOptionList()::options in MinisatCS is resolved to the definition used by Minisat in pysat while the functions getOptionList() or Option:: Option() are still the MinisatCS version. They are incompatible since I changed the data layout of vec in MinisatCS. This is a classical ABI issue.

The following example demonstrates this behavior:

// header.h
#pragma once
#include <cstdio>
int gint;
class Init {
   public:
    Init() { printf("Init %p\n", this); }
};
class GlobarVar {
    static Init& get() {
        static Init init;
        return init;
    }

   public:
    GlobarVar() {
        printf("GlobarVar in %s: self=%p: init=%p\n", NAME, this, &get());
    }
};
static GlobarVar gv;
void f() { printf("f() in %s\n", NAME); }
extern "C" void run() {
    f();
    printf("run() in %s, gint=%p\n", NAME, &gint);
}

// a.cpp
#define NAME "a"
#include "header.h"

// b.cpp
#define NAME "b"
#include "header.h"

// main.cpp
#include <dlfcn.h>
int main(int argc, char **) {
    int flags = argc >= 2 ? RTLD_NOW | RTLD_GLOBAL : RTLD_NOW;
    auto h0 = dlopen("./a.so", flags), h1 = dlopen("./b.so", flags);
    auto run0 = reinterpret_cast<void (*)()>(dlsym(h0, "run")),
         run1 = reinterpret_cast<void (*)()>(dlsym(h1, "run"));
    run0();
    run1();
}
# Makefile
all: main

%.so: %.cpp header.h
	$(CXX) $< -o $@ -fPIC -shared -g

main: main.cpp a.so b.so
	$(CXX) $< -o $@ -ldl -g

clean:
	rm -f *.so main

.PHONY: clean all

When we run ./main, the two libraries are loaded with RTLD_NOW. As can be seen from the following output, they use their private function implementations but share the same GlobarVar::get()::init variable. Note that they have private versions of the global variable gint, which is different from the static local variable:

$ ./main
Init 0x7faba5637079
GlobarVar in a: self=0x7faba5637078: init=0x7faba5637079
GlobarVar in b: self=0x7faba5632078: init=0x7faba5637079
f() in a
run() in a, gint=0x7faba5637074
f() in b
run() in b, gint=0x7faba5632074

If we run ./main 0, the flags to dlopen are RTLD_NOW | RTLD_GLOBAL, which cause ld.so to reuse functions exported by a.so when resolving symbols in b.so. Note that we can still get run() in b.so by directly querying dlsym, but its internal call to f() is resolved to another implementation:

$ ./main 0
Init 0x7fe8f4916079
GlobarVar in a: self=0x7fe8f4916078: init=0x7fe8f4916079
GlobarVar in a: self=0x7fe8f4911078: init=0x7fe8f4916079
f() in a
run() in a, gint=0x7fe8f4916074
f() in a
run() in b, gint=0x7fe8f4916074

The root cause

The code did work one year ago. Lots of things have been changed in one year because I use Arch Linux, a rolling release system. As we know there are symbol name conflicts, a natural question is, is the MinisatCS solver integration functioning as expected? Are the experimental results reported in the paper still valid? Oh no these are terrible thoughts. I need to figure out what’s going on exactly.

It could be the compiler/linker. For example, the old linker might resolve weak symbols to a unique definition so that MinisatCS actually invoked the constructors defined in Minisat, or the old compiler might use something other than unique global symbols for static local variables so that getOptionList():: options was not shared. But this is extremely unlikely, since such a modification in this fundamental layer of software stack is surely to break lots of things. To be sure, I still confirmed that the above test code behaves similarly on an old ubuntu 10.0 system.

It could be that cpython modified its library loading behavior. It might have been using RTLD_GLOBAL for a long time and only removed this flag recently. However, this is not the case, as the code was actually written years ago and remains unchanged.

I know that pysat has always included Minisat and thus I did not suspect it at first. I assumed the symbol name conflict always existed and spent a few hours trying to figure out why it worked one year ago. After ruling out all other possibilities, I took a closer look at pysat, and noticed something interesting: the Minisat solver shipped with pysat actually uses the namespace Minisat22! Moreover, pysat uses the namespace Minisat for a recently introduced solver MergeSat3. So there were no name conflicts one year ago. Well, fine, I totally did not expect this.

The solution and the lessons

The fix is simple: use a dedicated namespace for MinisatCS. To make it more robust, I also added -fvisibility=hidden to the compiler command line, which marks all the internal symbols private so that the linker can not resolve them to other definitions.

A few lessons learnt:

  1. Use a dedicated namespace early on. I initially did some experimental modifications to Minisat and did not change the namespace. These modifications accumulated as a separate project MinisatCS, but I totally forgot about the namespace thing.
  2. Pinning dependencies is a good idea. I thought EEVBNN is simple and self-contained enough with only a few dependencies and should not be broken by future updates of third-party libraries. This turns out to be a naive thought of a young person.

Updates on Jun 15, 2023

Someone told me that the above observation is a result of inline functions. The current symbol types are:

$ nm -C a.so  | grep ::get
0000000000004068 u guard variable for GlobarVar::get()::init
000000000000122e W GlobarVar::get()
0000000000004061 u GlobarVar::get()::init

By moving the definition of GlobarVar::get() out of the class (thus making it no longer an inline function), the symbols become normal code and data symbols:

--- header_old.h	2023-06-15 13:38:09.729780723 -0400
+++ header_new.h	2023-06-15 13:47:01.526508670 -0400
@@ -6,10 +6,7 @@
     Init() { printf("Init %p\n", this); }
 };
 class GlobarVar {
-    static Init& get() {
-        static Init init;
-        return init;
-    }
+    static Init& get();

    public:
     GlobarVar() {
@@ -22,3 +19,7 @@
     f();
     printf("run() in %s, gint=%p\n", NAME, &gint);
 }
+Init& GlobarVar::get() {
+    static Init init;
+    return init;
+}
$ nm -C a.so  | grep ::get
0000000000004068 b guard variable for GlobarVar::get()::init
00000000000011e4 T GlobarVar::get()
0000000000004065 b GlobarVar::get()::init