1

Here is a small c++ program that embeds python.

It works with python 3.11.6, but segfaults with python 3.12.0:

#include <iostream>
#include "omp.h"
#include "Python.h"

int main()
{
    Py_Initialize();
    
    #pragma omp parallel
    {
        #pragma omp single
        {
            std::cout << "One character:"<<std::endl;
            PyObject *nameobj1 = PyUnicode_FromString("a");
            std::cout << nameobj1 << std::endl;
            Py_DECREF(nameobj1);
            
            std::cout << "Two characters:"<<std::endl;
            PyObject *nameobj2 = PyUnicode_FromString("aa");
            std::cout << nameobj2 << std::endl;
            Py_DECREF(nameobj2);
        }
    }
    
    Py_Finalize();
}

Compiling and running with 3.11:

$ g++ pytest.cpp `python3.11-config --ldflags --cflags` -lpython3.11 -fopenmp
$ ./a.out 
One character:
0x730a12d466e0
Two characters:
0x730a121f33f0

Compiling and running with 3.12:

$ g++ pytest.cpp `python3.12-config --ldflags --cflags` -lpython3.12 -fopenmp
$ ./a.out 
One character:
0x734752e48a08
Two characters:
Segmentation fault (core dumped)

Has something changed in python 3.12 that prevents to use PyUnicode_FromString with more than 1 character, with openMP? Is there a workaround?

Remarks:

  • g++ 13.2.0
  • 2 openMP threads
  • it actually works when not using -fopenmp
  • Here is a backtrace using gdb:
#0  0x00007ffff77f3f80 in _PyInterpreterState_GET () at ../Include/internal/pycore_pystate.h:118
#1  get_state () at ../Objects/obmalloc.c:866
#2  _PyObject_Malloc (ctx=<optimized out>, nbytes=43) at ../Objects/obmalloc.c:1563
#3  0x00007ffff782b509 in PyUnicode_New (maxchar=<optimized out>, size=2) at ../Objects/unicodeobject.c:1208
#4  PyUnicode_New (size=2, maxchar=<optimized out>) at ../Objects/unicodeobject.c:1154
#5  0x00007ffff7837081 in unicode_decode_utf8 (s=<optimized out>, size=2, error_handler=_Py_ERROR_UNKNOWN, errors=0x0, consumed=0x0)
    at ../Objects/unicodeobject.c:4647
#6  0x0000555555555422 in main._omp_fn.0(void) () at pytest.cpp:19
#7  0x00007ffff7f6b48e in gomp_thread_start (xdata=<optimized out>) at ../../../src/libgomp/team.c:129
#8  0x00007ffff6e97b5a in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#9  0x00007ffff6f285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
2
  • I just found out that replacing #pragma omp single by #pragma omp master works! I guess this is a workaround, but it does not really make sense to me. Commented Mar 21, 2024 at 13:36
  • Based on the answer below this behavior makes sense. The master thread owns the lock and therefore can access the GIL without error. Using single selects a random thread which might not own the lock at that time. Commented Mar 22, 2024 at 9:42

1 Answer 1

2

Your code has a bug, you never acquire the GIL inside the child threads, you must acquire the GIL when creating or deleting (or modifying) any python object (with a few exceptions on the modify part), your code just didn't crash in python3.11 but crashes in python3.12

Some of the interpreter state is threadlocal, and locking the GIL properly initializes this state.

To acquire and drop the GIL use PyGILState_Ensure and PyGILState_Release respectively

You also need to drop the GIL from the main thread before the parallel section to avoid deadlocks.

i think the biggest change is the Per Interpreter GIL which was added in python3.12, which pushed more state into the threadlocal section, making your code crash, before this change your code was wrong but it wasn't crashing.

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for this. Does it mean it is impossible to have calls to Py functions inside the parallel section? Can't I lock the GIL before the parallel section, then drop it after the parallel section?
@fffred you can have calls inside the parallel section, just whoever is making this call has to lock the gil before making the call, and unlock it when the call ends, basically only 1 thread can be using python objects at a time, you can create multiple interpreters in a single process using the subinterpreters API, but you can't pass python objects between these interpreters.
@fffred i think your code specifically breaks in python3.12 because of the per-interpreter-gil that was added in python3.12, this had the biggest change on the threadlocal data in the interpreter.
Thank you for these explanations again. It is unfortunately very hard to test, as this seems very random.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.