Python3.12 C-API segfaults with openMP

Question

Here is a small c++ program that embeds python.

It works with python 3.11.6, but segfaults with python 3.12.0:

#include <iostream>
#include "omp.h"
#include "Python.h"

int main()
{
    Py_Initialize();
    
    #pragma omp parallel
    {
        #pragma omp single
        {
            std::cout << "One character:"<<std::endl;
            PyObject *nameobj1 = PyUnicode_FromString("a");
            std::cout << nameobj1 << std::endl;
            Py_DECREF(nameobj1);
            
            std::cout << "Two characters:"<<std::endl;
            PyObject *nameobj2 = PyUnicode_FromString("aa");
            std::cout << nameobj2 << std::endl;
            Py_DECREF(nameobj2);
        }
    }
    
    Py_Finalize();
}

Compiling and running with 3.11:

$ g++ pytest.cpp `python3.11-config --ldflags --cflags` -lpython3.11 -fopenmp
$ ./a.out 
One character:
0x730a12d466e0
Two characters:
0x730a121f33f0

Compiling and running with 3.12:

$ g++ pytest.cpp `python3.12-config --ldflags --cflags` -lpython3.12 -fopenmp
$ ./a.out 
One character:
0x734752e48a08
Two characters:
Segmentation fault (core dumped)

Has something changed in python 3.12 that prevents to use PyUnicode_FromString with more than 1 character, with openMP? Is there a workaround?

Remarks:

g++ 13.2.0
2 openMP threads
it actually works when not using -fopenmp
Here is a backtrace using gdb:

#0  0x00007ffff77f3f80 in _PyInterpreterState_GET () at ../Include/internal/pycore_pystate.h:118
#1  get_state () at ../Objects/obmalloc.c:866
#2  _PyObject_Malloc (ctx=<optimized out>, nbytes=43) at ../Objects/obmalloc.c:1563
#3  0x00007ffff782b509 in PyUnicode_New (maxchar=<optimized out>, size=2) at ../Objects/unicodeobject.c:1208
#4  PyUnicode_New (size=2, maxchar=<optimized out>) at ../Objects/unicodeobject.c:1154
#5  0x00007ffff7837081 in unicode_decode_utf8 (s=<optimized out>, size=2, error_handler=_Py_ERROR_UNKNOWN, errors=0x0, consumed=0x0)
    at ../Objects/unicodeobject.c:4647
#6  0x0000555555555422 in main._omp_fn.0(void) () at pytest.cpp:19
#7  0x00007ffff7f6b48e in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:129
#8  0x00007ffff6e97b5a in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#9  0x00007ffff6f285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

I just found out that replacing #pragma omp single by #pragma omp master works! I guess this is a workaround, but it does not really make sense to me. — fffred
– fffred, Commented Mar 21, 2024 at 13:36
Based on the answer below this behavior makes sense. The master thread owns the lock and therefore can access the GIL without error. Using single selects a random thread which might not own the lock at that time. — Joachim
– Joachim, Commented Mar 22, 2024 at 9:42

Ahmed AEK · Accepted Answer · 2024-03-21 16:42:08Z

2

Your code has a bug, you never acquire the GIL inside the child threads, you must acquire the GIL when creating or deleting (or modifying) any python object (with a few exceptions on the modify part), your code just didn't crash in python3.11 but crashes in python3.12

Some of the interpreter state is threadlocal, and locking the GIL properly initializes this state.

To acquire and drop the GIL use PyGILState_Ensure and PyGILState_Release respectively

You also need to drop the GIL from the main thread before the parallel section to avoid deadlocks.

i think the biggest change is the Per Interpreter GIL which was added in python3.12, which pushed more state into the threadlocal section, making your code crash, before this change your code was wrong but it wasn't crashing.

edited Mar 21, 2024 at 16:42

answered Mar 21, 2024 at 13:41

Ahmed AEK

23.2k3 gold badges19 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

fffred Over a year ago

Thanks for this. Does it mean it is impossible to have calls to Py functions inside the parallel section? Can't I lock the GIL before the parallel section, then drop it after the parallel section?

Ahmed AEK Over a year ago

@fffred you can have calls inside the parallel section, just whoever is making this call has to lock the gil before making the call, and unlock it when the call ends, basically only 1 thread can be using python objects at a time, you can create multiple interpreters in a single process using the subinterpreters API, but you can't pass python objects between these interpreters.

Ahmed AEK Over a year ago

@fffred i think your code specifically breaks in python3.12 because of the per-interpreter-gil that was added in python3.12, this had the biggest change on the threadlocal data in the interpreter.

fffred Over a year ago

Thank you for these explanations again. It is unfortunately very hard to test, as this seems very random.

Collectives™ on Stack Overflow

Python3.12 C-API segfaults with openMP

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related