Optimize this function with numpy (or other vectorization methods)

Question

I am computing with Python a classic calculation in the field of population genetics. I am well aware that there exists many algorithm that do the job but I wanted to build my own for some reason.

The below paragraph is a picture because MathJax is not supported on StackOverflow

enter image description here

I would like to have an efficient algorithm to calculate those Fst. For the moment I only manage to make for loops and no calculations are vectorized How can I make this calculation using numpy (or other vectorization methods)?

Here is a code that I think should do the job:

def Fst(W, p):
    I = len(p[0])
    K = len(p)
    H_T = 0
    H_S = 0
    for i in xrange(I):
        bar_p_i = 0
        for k in xrange(K):
            bar_p_i += W[k] * p[k][i]
            H_S += W[k] * p[k][i] * p[k][i]
        H_T += bar_p_i*bar_p_i
    H_T = 1 - H_T
    H_S = 1 - H_S
    return (H_T - H_S) / H_T

def main():
    W = [0.2, 0.1, 0.2, 0.5]
    p = [[0.1,0.3,0.6],[0,0,1],[0.4,0.5,0.1],[0,0.1,0.9]]
    F = Fst(W,p)
    print("Fst = " + str(F))
    return

main()

I don't think MathJax works on StackOverflow. Consider posting your equations as images or re-writing them in a more readable way in plaintext. — aganders3
– aganders3, Commented Jul 1, 2015 at 17:45
Have you considered using cython, it would not be too much work to transfer. — kezzos
– kezzos, Commented Jul 1, 2015 at 17:46
@aganders3 Thanks. I made an image out of my paragraph that contains equations. (I'd wish MathJax to be supported on StackOverflow). — Remi.b
– Remi.b, Commented Jul 1, 2015 at 17:48
@kezzos I don't know cython. Is it more or less the equivalent to Rcpp in R? — Remi.b
– Remi.b, Commented Jul 1, 2015 at 17:50
The thing is, if you are aware that this function should be vectorized in numpy, why don't you check out one the multuple numpy tutorals online and try doing it yourself (and post here if you have issues with it)? SO is not a code writing service either. Possible duplicate of stackoverflow.com/questions/20108817/… etc. I agree with @kezzos that converting your existing code to Cython would require only a minimal effort. — rth
– rth, Commented Jul 1, 2015 at 17:55

Patrick Mineault · Accepted Answer · 2015-07-01 18:17:37Z

There's no reason to use loops here. And you really shouldn't use Numba or Cython for this stuff - linear algebra expressions like the one you have are the whole reason behind vectorized operations in Numpy.

Since this type of problem is going to pop up again and again if you keep using Numpy, I would recommend getting a basic handle on linear algebra in Numpy. You might find this book chapter helpful:

https://www.safaribooksonline.com/library/view/python-for-data/9781449323592/ch04.html

As for your specific situation: start by creating numpy arrays from your variables:

import numpy as np
W = np.array(W)
p = np.array(p)

Now, your \bar p_i^2 are defined by a dot product. That's easy:

bar_p_i = p.T.dot(W)

Note the T, for the transpose, because the dot product takes the sum of the elements indexed by the last index of the first matrix and the first index of the second matrix. The transpose inverts the indices so the first index becomes the last.

You H_t is defined by a sum. That's also easy:

H_T = 1 - bar_p_i.sum()

Similarly for your H_S:

H_S = 1 - ((bar_p_i**2).T.dot(W)).sum()

Collectives™ on Stack Overflow

Optimize this function with numpy (or other vectorization methods)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related