1

I need to calculate the sum of a list of matrices, however, I can't use np.sum, even with axis=0, I don't know why. The current solution is a loop, but is there a better way for that?

import numpy as np
SAMPLE_SIZES = [10, 100, 1000, 10000]
ITERATIONS = 1
MEAN = np.array([1, 1])
COVARIANCE = np.array([[1, 0.5], [0.5, 1]])
for sample_size in SAMPLE_SIZES:
    max = -1
    for i in range(ITERATIONS):
        xs = np.random.multivariate_normal(MEAN, COVARIANCE, size=sample_size)
        sigma = [[0, 0], [0, 0]]
        for x in xs:
            sigma += np.outer((x-MEAN), (x-MEAN)) / (sample_size-1)

In the code above, can I replace the last loop using some numpy function? I guess using a loop would be not efficient if the data is very large.

8
  • 2
    SAMPLE_SIZES is a Python list, not a Numpy array. Commented Sep 14, 2016 at 7:00
  • 3
    Tell us about xs. Array? List? Shape ,dtype?. Is MEAN a scalar? Commented Sep 14, 2016 at 7:03
  • 2
    Please show us your erroneous implementation with np.sum, and the error message. Commented Sep 14, 2016 at 7:14
  • @ÉbeIsaac no, it doesn't, list(range(1)) is [0] Commented Sep 14, 2016 at 7:46
  • @ewcz: Sorry, clear now Commented Sep 14, 2016 at 7:48

2 Answers 2

2

Read up about numpy broadcasting.

xs = np.random.multivariate_normal(MEAN, COVARIANCE, size=sample_size)

xs now has shape (sample_size, 2), which means you can just subtract MEAN directly. You now need to take the outer product between xs - MEAN and xs - MEAN while adding over the sample_size axis. This is best done using np.einsum:

>>> sigma = np.einsum('ij,ik->jk', xs - MEAN, xs - MEAN) / sample_size
>>> sigma    
array([[ 1.00216043,  0.49549231],
       [ 0.49549231,  1.00004423]])

An alternative is to use broadcasting:

>>> sigma = np.sum((xs - MEAN)[:, :, np.newaxis]
                   * (xs - MEAN)[:, np.newaxis, :], axis=0) / sample_size

Though the broadcasting solution seems easier to understand, np.einsum will usually be more efficient than broadcasting.

Additional note: Note that I've divided by sample_size, and not by sample_size - 1. This is because for estimating the covariance matrix of a random variable with known mean, you need to divide by sample_size. Use sample_size - 1 when you are estimating the mean from the same dataset as well, and using it in your covariance estimate. Otherwise your covariance estimate will be biased.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your help, could you explain a little bit more the additional note? Actually this is part of my homework assignment, I just followed the equation given by professor. But I would like to hear it if this is an error
@1a1a11a I'd have to give you a whole course on estimation to explain why... Read about unbiased estimators, and then try to derive the unbiased estimator of variance for a gaussian random variable with known mean, and for one with unknown mean.
0

If you just want to compute the empirical covariance then I would suggest using numpy.cov(xs.T).

Otherwise, the last 3 lines can be replaced by:

xm = xs - np.mean(xs, axis=0)
sigma = np.inner(xm.T. xm.T) / (sample_size-1)

1 Comment

Thank you for your help, but the answer above is more relevant.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.