3

I have a numpy array and depending on the value from another array, I would like to either update the value of the row, or delete it, or add one.

Example:

I have arr, the one with all values and to keep updated with value from new_arr. If a value in the first column of new_arr exists in arr, then the second column of arr is updated. If the value does no exist, then add a new row. If the second column in new_arr == 0, then delete the row in arr with the matching first column.

arr = np.array([[1, 10],
                [2, 15],
                [3,  5],
                [4, 10]])

new_arr = np.array([[2, 20], # 2 exists in arr and 20 > 0 --> update in arr
                    [5, 20], # 5 does not exists in arr --> add row in arr
                    [1, 0]]) # 1 exists in arr but col 2 == 0--> delete row in arr

Then I would like to obtain:

arr = np.array([[2, 20],
                [3,  5],
                [4, 10],
                [5, 20]])

Observe that arr is ordered by the first column. Also arr has a maximum lenght of 1000 rows.

Any simple and fast method please?

Initially arr and new_arr are lists. I've turned them into numpy arrays. However, as I do not do any strong calculation with arr, most likely it would be faster to keep it as a list.

9
  • Can you clarify your rules for manipulating arrays: When you say if value in NewArr exists in Arr do you mean anywhere in Arr, just column 1 or just column 2? When you say the value of col 2 in arr is updated, how is it updated? When you say if value doesn't exist, do you mean the value of col1 in newArr? When you say add a row, do you mean to arr? When you say if col2 of Newarr = 0, then delete row in arr, do you mean the row inarr where col1 == value of NewArr col 1? Commented Dec 24, 2020 at 13:46
  • Only col 1 is relevant for adding or updating. So if a value in col 1 of newArr exists in arr and col 2 > 0, then i update col 2. But if a value in col 1 of newArr exists in arr and col 2 = 0, then I delete the row where the value is in col 1 of arr. If a value in col 1 of newArr does not exist in arr then add row (in order of col 1). Commented Dec 24, 2020 at 15:46
  • I have added some comment to clarify a bit. Once again, but arr and newArr might be lists instead if it can be faster. Commented Dec 24, 2020 at 15:52
  • The description is pretty clear. I am convinced that updating the value to zero and later on masking these entries would be the better approach, but this does not solve your problem with adding entries to the numpy array. Are you sure that a dictionary is not the better approach for you? Commented Dec 24, 2020 at 15:55
  • 1
    I am by no means a numpy expert. But as far as I understand it deleting/adding rows/columns forces numpy to reassign a new array space and copying the data there. This could become a problem with larger arrays, hence my suggestion to mask empty entries. Numpy arrays are fast for vectorized operations on them but I understand that it is better to collect data in lists/dictionaries, transfer the final list/dictionary into an array, and perform the desired operations on the array then. But let's see what the gurus say. P.S.: What vectorized operation do you want to use with the numpy array? Commented Dec 24, 2020 at 16:21

2 Answers 2

1

Keeping the input arrays as numpy constructs, here's how I would do it.

def process_arrays(np1, np2):
    np1d = dict((np1[x][0], np1[x][1]) for x in range(len(np1)))
    np2d = dict((np2[x][0], np2[x][1]) for x in range(len(np2)))
    for ky2 in np2d.keys():
        if ky2 in np1d.keys():
            if np2d[ky2] == 0:
                del np1d[ky2]
            else:
                np1d[ky2] = np2d[ky2]
        else:
            np1d[ky2] = np2d[ky2]
    return np.array(np1d)   

Given you input executing:

process_arrays(arr, newArr)  

Yields:

array({2: 20, 3: 5, 4: 10, 5: 20}, dtype=object)
Sign up to request clarification or add additional context in comments.

3 Comments

There's no point in using numpy like this. You made an array with a single dictionary element in it. That's just convoluted and unnecessary overhead.
So is the dictionary solution proposed by itprorh66 the best one or a list-based approach more relevant in terms of speed of execution?
@NicolasRey, you could certainly replace the second array with a list and iterate through the list as opposed to iterating through the keys of the second dict. I don't know how much of a difference in performance would be achieved, it would certainly provide for some improvement not having to create the second dict. I think converting the first array to a dict is a good idea, since it allows for quick access to the elements .
0

You really want three different operations, each of which is easy to implement. Adding and deleting allocates new arrays, so you want to do those in bulk one time. Your goal is therefore mostly to split the new data into three portions.

First identify and split the delete portion:

mask = new_arr[:, -1] == 0
to_del = new_arr[mask, :]
to_add_update = new_arr[~mask, :]

Now you can find the insertion indices of the add and update portions:

insert_index = np.searchsorted(arr[:, 0], to_add_update[:, 0])

The elements whose insertion indices match between the arrays are places where you want to update vs the ones you want to insert. Let's define a function for this since we can use it twice:

 def get_insert_update_index(a, v):
     """
     Get new and existing insertion indices.

     Parameters
     ----------
     a :
         The array to insert into.
     v :
         The values to insert

     Returns
     -------
     mask :
         True indicates new elements of `v`.
     insert :
         Insertion indices of new elements in `a`
     update :
         Indices if existing elements in `a`
     """
     index = np.searchsorted(a, v)
     mask = index >= a.size
     mask2 = a[index[~mask]] != v[~mask]
     mask[~mask] = mask2
     return mask, index[mask], index[~mask]

mask, insert_index, update_index = get_insert_update_index(arr[:, 0], to_add_update[:, 0])

If you process the updates first, your indices won't change:

arr[update_index, -1] = to_add_update[~mask, -1]

Processing an update involves making a new array:

arr = np.insert(arr, insert_index, to_add_update[mask, :], axis=0)

Now that you've done the insertion, you will need to recompute the indices of the deletions you found in the beginning. You probably don't want to remove the non-existent stuff, so get_insert_update_index is going to come in handy again:

_, _, delete_index = get_insert_update_index(arr[:, 0], to_del[:, 0])
arr = np.delete(are, delete_index, axis=0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.