1

I have a 2d numpy array called my_data. Each row represents information about one data point and each column represents different attributes of that data point.

I have a function called processRow. It takes in a row, and does some processing on the info and returns the modified row. The length of the row returned by the function is longer than the row taken in by the function (the function basically expands some categorical data into one-hot vectors)

How can I have a numpy array where every row has been processed by this function?

I tried

answer = np.array([])
for row in my_data:
    answer = np.append(answer,processRow(row))

but at the end, the answer is just a single really long row rather than a 2d grid

4
  • This code can't be correct, it gives AttributeError: 'numpy.ndarray' object has no attribute 'append'. Please include an entire snippet that we can run to demonstrate the issue. Commented May 6, 2018 at 21:33
  • See the edited snippet in the post Commented May 6, 2018 at 21:37
  • List append is useful. np.append has too many boobytraps. Reread its docs. What does it say about the axis parameter? Commented May 6, 2018 at 22:21
  • np.append like vstack is slow. Without axis it ravels the inputs. Commented May 6, 2018 at 23:27

2 Answers 2

2

You can use vstack rather since row has a different shape to answer. You also need to be explicit with the shape of answer:

In [11]: my_data = np.array([[1, 2], [3, 4]])
    ...: process_row = lambda x: x  # do nothing

In [12]: answer = np.empty((0, 2), dtype='int64')
    ...: for row in my_data:
    ...:     answer = np.vstack([answer, process_row(row)])
    ...:

In [13]: answer
Out[13]:
array([[ 1,  2],
       [ 3,  4]])

However, you're probably better off doing a list comprehension, and then passing it to numpy after:

In [21]: np.array([process_row(row) for row in my_data])
Out[21]:
array([[1, 2],
       [3, 4]])
Sign up to request clarification or add additional context in comments.

4 Comments

You may want to do this in cython, or pandas, to be more performant, but it's unclear what the best strategy is without more information on process_row.
I tried this but it was way too slow. It ran for 10 minutes before I quit it. I think answer = np.vstack([answer, process_row(row)]) takes linear time because it copies answer each time, making it take O(n^2) time overall. @Aklys 's solution runs much faster so I'm accepting his solution
@quantumbutterfly yes, his answer is the same as the one at the end of mine "you're better of a list comprehension" ... only more verbose.
Oh whoops, overlooked that part. Sorry about that
1

I'm not sure if I entirely got what you were after without seeing a sample of the data. But hopefully this helps you get to the result you want. I simplified the concept and just added one to each value in the row passed to the function and added the results together for a total (just to expand the size of the returned array). Of course you could adjust the processing to whatever you wanted.

def funky(x):
    temp = []
    for value in x:
        value += 1
        temp.append(value)
    temp.append(temp[0] + temp[1])
    return np.array(temp)

my_data = np.array([[1,1], [2,2]]) 

answer = np.apply_along_axis(funky, 1, my_data)
print("This is the original data:\n{}".format(my_data))
print("This is the adjusted data:\n{}".format(answer))

Below is the before and after of the array modification:

This is the original data:
[[1 1]
 [2 2]]
This is the adjusted data:
[[2 2 4]
 [3 3 6]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.