I'm having problems with python code that uses pytorch. The details are a bit complicated (the code is part of a quantum mechanical calculation) but the code structure is very straightforward and looks more or less like this:
# p is a batch containing 100000 sets of momenta.
# Each set contains four vectors in 3 dimensions.
p = momenta[startbatch : endbatch]
# p.shape : (100000 , 4 , 3)
# It should be easy to parallelize the following
# with respect to the first index of `p`:
result = 1.0 # * <complicated expression involving p>
# result.shape : (100000 , 16 , 16)
The same calculation <complicated expression involving p> is performed for, say, a 100000 sets of momenta. Parallelizing this using Fortran would involve adding a simple !$omp parallel do.
I'm using python partially as a learning opportunity and partially because I wold later like to automatically calculate gradients with respect to some parameters. Unfortunately, when measuring the performance of the code I get the following relationship between execution time and number of cores used:

Since memory is plentiful and the calculation can be easily parallelized along the first index of p I would expect that this relationship would instead be much more strongly decreasing. For instance at 20 threads I would expect the execution time to be around 7 seconds: 1/20 of the single thread time.
I'm guessing the automatic parallelization of <complicated expression involving p> is not optimal. Is it possible to specify explicitly, that the calculation needs to be performed in parallel for each i in p[i , : , :]? I could run 20 (or more) python threads and in each run <complicated expression involving p> with threads == interop threads == 1 but I'm hoping there is a simpler / more elegant solution.
Tips, comments would be greatly appreciated.
PS For a single thread the performance is similar to Fortran.