Most efficient way to forward-fill NaN values in numpy array

Question

Example Problem

As a simple example, consider the numpy array arr as defined below:

import numpy as np
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])

where arr looks like this in console output:

array([[  5.,  nan,  nan,   7.,   2.],
       [  3.,  nan,   1.,   8.,  nan],
       [  4.,   9.,   6.,  nan,  nan]])

I would now like to row-wise 'forward-fill' the nan values in array arr. By that I mean replacing each nan value with the nearest valid value from the left. The desired result would look like this:

array([[  5.,   5.,   5.,  7.,  2.],
       [  3.,   3.,   1.,  8.,  8.],
       [  4.,   9.,   6.,  6.,  6.]])

Tried thus far

I've tried using for-loops:

for row_idx in range(arr.shape[0]):
    for col_idx in range(arr.shape[1]):
        if np.isnan(arr[row_idx][col_idx]):
            arr[row_idx][col_idx] = arr[row_idx][col_idx - 1]

I've also tried using a pandas dataframe as an intermediate step (since pandas dataframes have a very neat built-in method for forward-filling):

import pandas as pd
df = pd.DataFrame(arr)
df.fillna(method='ffill', axis=1, inplace=True)
arr = df.as_matrix()

Both of the above strategies produce the desired result, but I keep on wondering: wouldn't a strategy that uses only numpy vectorized operations be the most efficient one?

Summary

Is there another more efficient way to 'forward-fill' nan values in numpy arrays? (e.g. by using numpy vectorized operations)

Update: Solutions Comparison

I've tried to time all solutions thus far. This was my setup script:

import numba as nb
import numpy as np
import pandas as pd

def random_array():
    choices = [1, 2, 3, 4, 5, 6, 7, 8, 9, np.nan]
    out = np.random.choice(choices, size=(1000, 10))
    return out

def loops_fill(arr):
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

@nb.jit
def numba_loops_fill(arr):
    '''Numba decorator solution provided by shx2.'''
    out = arr.copy()
    for row_idx in range(out.shape[0]):
        for col_idx in range(1, out.shape[1]):
            if np.isnan(out[row_idx, col_idx]):
                out[row_idx, col_idx] = out[row_idx, col_idx - 1]
    return out

def pandas_fill(arr):
    df = pd.DataFrame(arr)
    df.fillna(method='ffill', axis=1, inplace=True)
    out = df.as_matrix()
    return out

def numpy_fill(arr):
    '''Solution provided by Divakar.'''
    mask = np.isnan(arr)
    idx = np.where(~mask,np.arange(mask.shape[1]),0)
    np.maximum.accumulate(idx,axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

followed by this console input:

%timeit -n 1000 loops_fill(random_array())
%timeit -n 1000 numba_loops_fill(random_array())
%timeit -n 1000 pandas_fill(random_array())
%timeit -n 1000 numpy_fill(random_array())

resulting in this console output:

1000 loops, best of 3: 9.64 ms per loop
1000 loops, best of 3: 377 µs per loop
1000 loops, best of 3: 455 µs per loop
1000 loops, best of 3: 351 µs per loop

@TadhgMcDonald-Jensen In this case, pandas leaves the NaN untouched. I would assume the OP wants the same behavior for consistency. — DYZ
– DYZ, Commented Dec 16, 2016 at 19:14
Fill zero values of 1d numpy array with last nonzero values. You might find this useful. — boot-scootin
– boot-scootin, Commented Dec 16, 2016 at 19:15
Ah, good question. In my use cases the first column of the input array is not supposed to ever contain any nan values. So it's okay for me when the code (upon encounter of a nan in the first column) either raises an exception or leaves that nan in place. — Xukrao
– Xukrao, Commented Dec 16, 2016 at 19:19
BTW, there is not even a need to call as_matrix(): the original arr is changed. — DYZ
– DYZ, Commented Dec 16, 2016 at 19:32

Divakar · Accepted Answer · 2016-12-16 20:04:32Z

83

Here's one approach -

mask = np.isnan(arr)
idx = np.where(~mask,np.arange(mask.shape[1]),0)
np.maximum.accumulate(idx,axis=1, out=idx)
out = arr[np.arange(idx.shape[0])[:,None], idx]

If you don't want to create another array and just fill the NaNs in arr itself, replace the last step with this -

arr[mask] = arr[np.nonzero(mask)[0], idx[mask]]

Sample input, output -

In [179]: arr
Out[179]: 
array([[  5.,  nan,  nan,   7.,   2.,   6.,   5.],
       [  3.,  nan,   1.,   8.,  nan,   5.,  nan],
       [  4.,   9.,   6.,  nan,  nan,  nan,   7.]])

In [180]: out
Out[180]: 
array([[ 5.,  5.,  5.,  7.,  2.,  6.,  5.],
       [ 3.,  3.,  1.,  8.,  8.,  5.,  5.],
       [ 4.,  9.,  6.,  6.,  6.,  6.,  7.]])

edited Dec 16, 2016 at 20:04

answered Dec 16, 2016 at 19:20

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Xukrao Over a year ago

A vectorized numpy-only solution, nice. Thanks! This solution indeed appears to be faster than the loop-based and pandas-based solutions (see timings in updated question).

Divakar Over a year ago

@Xukrao Yeah I just saw those, thanks for adding in those timing results! Good to see some speedups there!

user189035 Over a year ago

How do you adapt this solution to the case arr is a one dimensional numpy array? Like numpy.array([0.83, 0.83, 0.83, 0.83, nan, nan, nan])?

C8H10N4O2 Over a year ago

@user189035 replace mask.shape[1] with mask.size and remove axis=1 and replace the last line with out = arr[idx]

financial_physician Over a year ago

I had a case where I built a second matrix for what I wanted to forward fill with. On the last line I just replaced arr with fillMatrix. My case was reducing resolution on time-series data, so I forward filled with the most recent entry

|

Scott B · Accepted Answer · 2020-07-15 16:22:58Z

I liked Divakar's answer on pure numpy. Here's a generalized function for n-dimensional arrays:

def np_ffill(arr, axis):
    idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
    idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
    np.maximum.accumulate(idx, axis=axis, out=idx)
    slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
        for dim in range(len(arr.shape))])]
        for i, k in enumerate(arr.shape)]
    slc[axis] = idx
    return arr[tuple(slc)]

AFIK pandas can only work with two dimensions, despite having multi-index to make up for it. The only way to accomplish this would be to flatten a DataFrame, unstack desired level, restack, and finally reshape as original. This unstacking/restacking/reshaping, with the pandas sorting involved, is just unnecessary overhead to achieve the same result.

Testing:

def random_array(shape):
    choices = [1, 2, 3, 4, np.nan]
    out = np.random.choice(choices, size=shape)
    return out

ra = random_array((2, 4, 8))
print('arr')
print(ra)
print('\nffull')
print(np_ffill(ra, 1))
raise SystemExit

Output:

arr
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3. nan  4.  4.  3.]
  [ 3.  2. nan  4. nan nan  3.  4.]
  [ 2.  2.  2. nan  1.  1. nan  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1. nan]
  [ 4.  2. nan  4.  4.  3. nan  4.]
  [ 2.  4.  2.  1.  4.  1.  3. nan]]]

ffull
[[[ 3. nan  4.  1.  4.  2.  2.  3.]
  [ 2. nan  1.  3.  4.  4.  4.  3.]
  [ 3.  2.  1.  4.  4.  4.  3.  4.]
  [ 2.  2.  2.  4.  1.  1.  3.  2.]]

 [[ 2.  3.  2. nan  3.  3.  3.  3.]
  [ 3.  3.  1.  4.  1.  4.  1.  3.]
  [ 4.  2.  1.  4.  4.  3.  1.  4.]
  [ 2.  4.  2.  1.  4.  1.  3.  4.]]]

cchwala · Accepted Answer · 2021-11-18 20:29:05Z

7

Update: As pointed out by financial_physician in the comments, my initially proposed solution can simply be exchanged with ffill on the reversed array and then reversing the result. There is no relevant performance loss. My initial solution seems to be 2% or 3% faster according to %timeit. I updated the code example below but left my initial text as it was.

For those that came here looking for the backward-fill of NaN values, I modified the solution provided by Divakar above to do exactly that. The trick is that you have to do the accumulation on the reversed array using the minimum except for the maximum.

Here is the code:


# ffill along axis 1, as provided in the answer by Divakar
def ffill(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), 0)
    np.maximum.accumulate(idx, axis=1, out=idx)
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out

# Simple solution for bfill provided by financial_physician in comment below
def bfill(arr): 
    return ffill(arr[:, ::-1])[:, ::-1]

# My outdated modification of Divakar's answer to do a backward-fill
def bfill_old(arr):
    mask = np.isnan(arr)
    idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)
    idx = np.minimum.accumulate(idx[:, ::-1], axis=1)[:, ::-1]
    out = arr[np.arange(idx.shape[0])[:,None], idx]
    return out


# Test both functions
arr = np.array([[5, np.nan, np.nan, 7, 2],
                [3, np.nan, 1, 8, np.nan],
                [4, 9, 6, np.nan, np.nan]])
print('Array:')
print(arr)

print('\nffill')
print(ffill(arr))

print('\nbfill')
print(bfill(arr))

Output:

Array:
[[ 5. nan nan  7.  2.]
 [ 3. nan  1.  8. nan]
 [ 4.  9.  6. nan nan]]

ffill
[[5. 5. 5. 7. 2.]
 [3. 3. 1. 8. 8.]
 [4. 9. 6. 6. 6.]]

bfill
[[ 5.  7.  7.  7.  2.]
 [ 3.  1.  1.  8. nan]
 [ 4.  9.  6. nan nan]]

Edit: Update according to comment of MS_

edited Nov 18, 2021 at 20:29

answered Feb 3, 2019 at 22:00

cchwala

5705 silver badges15 bronze badges

3 Comments

MS_ Over a year ago

idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[0] + 1) in bfill should be idx = np.where(~mask, np.arange(mask.shape[1]), mask.shape[1] - 1)

financial_physician Over a year ago

Isn't flipping O(n) and you're doing it twice so wouldn't flipping, using forward fill, and then unflipping, be just as fast as your bfill method with the original array?

cchwala Over a year ago

Thanks! This is indeed a very good point. I did time your solution and mine using %%timeit and there is only a negligible but consistent difference, 10.3 µs (your solution) vs 9.95 µs (my solution). I will update my response accordingly.

Joseph Gonzalez · Accepted Answer · 2021-11-24 22:44:41Z

5

bottleneck push function is a good option to forward fill. It's normally used internally in packages like Xarray, it should be faster than other alternatives and the package also has a set of benchmarks.

Example:

import numpy as np

from bottleneck import push

a = np.array(
    [
        [1, np.nan, 3],
        [np.nan, 3, 2],
        [2, np.nan, np.nan]
    ]
)
push(a, axis=0)
array([[ 1., nan,  3.],
       [ 1.,  3.,  2.],
       [ 2.,  3.,  2.]])

edited Nov 24, 2021 at 22:44

answered Nov 22, 2021 at 15:24

Joseph Gonzalez

731 gold badge3 silver badges6 bronze badges

Comments

shx2 · Accepted Answer · 2016-12-16 20:25:16Z

4

Use Numba. This should give a significant speedup:

import numba
@numba.jit
def loops_fill(arr):
    ...

answered Dec 16, 2016 at 20:25

shx2

64.8k17 gold badges139 silver badges166 bronze badges

3 Comments

Xukrao Over a year ago

Would Numba only speed up the loops-based solution? Or would it speed up the other solutions as well?

shx2 Over a year ago

It is good for loops. It would not speed up functions implemented in numpy/pandas.

Xukrao Over a year ago

Thanks! I've included this solution in the timing comparison (see updated question). It looks like the addition of the numba decorator to the loop-based solution reduces its runtime by one order of magnitude.

Charles Woo · Accepted Answer · 2021-07-08 08:18:47Z

I like Divakar's answer, but it doesn't work for an edge case where a row starts with np.nan, like the arr below

arr = np.array([[9, np.nan, 4, np.nan, 6, 6, 7, 2, 3, np.nan],
[ np.nan, 5, 5, 6, 5, 3, 2, 1, np.nan, 10]])

The output using Divakar's code would be:

[[ 9.  9.  4.  4.  6.  6.  7.  2.  3.  3.]
 [nan  4.  5.  6.  5.  3.  2.  1.  1. 10.]]

Divakar's code can be simplified a bit, and the simplified version solves this issue at the same time:

arr[np.isnan(arr)] = arr[np.nonzero(np.isnan(arr))[0], np.nonzero(np.isnan(arr))[1]-1]

In case of several np.nans in a row (either in the beginning or in the middle), just repeat this operation several times. For instance, if the array has 5 consecutive np.nans, the following code will "forward fill" all of them with the number before these np.nans:

for i in range(0, 5):
   value[np.isnan(value)] = value[np.nonzero(np.isnan(value))[0], np.nonzero(np.isnan(value))[1]-1]

Vikrant Gupta · Accepted Answer · 2022-01-13 09:02:43Z

3

Use bottleneck module, it comes along with pandas or numpy module so no need to separately install.

Below code should give you desired result.

import bottleneck as bn
bn.push(arr,axis=1)

edited Jan 13, 2022 at 9:02

answered Jan 11, 2022 at 12:17

Vikrant Gupta

2472 silver badges7 bronze badges

Comments

christian_bock · Accepted Answer · 2018-10-09 11:31:13Z

2

For those who are interested in the problem of having leading np.nan after foward-filling, the following works:

mask = np.isnan(arr)
first_non_zero_idx = (~mask!=0).argmax(axis=1) #Get indices of first non-zero values
arr = [ np.hstack([
             [arr[i,first_nonzero]]*(first_nonzero), 
             arr[i,first_nonzero:]])
             for i, first_nonzero in enumerate(first_non_zero_idx) ]

answered Oct 9, 2018 at 11:31

christian_bock

711 silver badge7 bronze badges

2 Comments

Xukrao Over a year ago

I'm not sure I understand the purpose of this code. What exactly do you mean by 'problem of having leading np.nan after forward-filling'?

christian_bock Over a year ago

In the example array in the beginning of the threat, each entry begins with a non nan. Some people might find themselves dealing with a data set that requires backward filling because forward filling will leave the first entries untouched. So I thought it might be useful to present a solution in this threat.

RobertHannah89 · Accepted Answer · 2021-11-18 00:41:47Z

1

If you're willing to use Pandas/ xarray: Let axis be the direction you wish to ffill/bfill over, as shown below,

xr.DataArray(arr).ffill(f'dim_{axis}').values
xr.DataArray(arr).bfill(f'dim_{axis}').values

More information: http://xarray.pydata.org/en/stable/generated/xarray.DataArray.ffill.html https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.ffill.html

edited Nov 18, 2021 at 0:41

answered Nov 16, 2021 at 2:45

RobertHannah89

1876 bronze badges

Comments

LearnToGrow · Accepted Answer · 2021-10-06 21:43:39Z

0

unless I miss something, the solutions does not works on any example:

arr  = np.array([[ 3.],
 [ 8.],
 [np.nan],
 [ 7.],
 [np.nan],
 [ 1.],
 [np.nan],
 [ 3.],
 [ 8.],
 [ 8.]])
print("A:::: \n", arr)

print("numpy_fill::: \n ",  numpy_fill(arr))
print("loop_fill",  loops_fill(arr))

A:::: 
 [[ 3.]
 [ 8.]
 [nan]
 [ 7.]
 [nan]
 [ 1.]
 [nan]
 [ 3.]
 [ 8.]
 [ 8.]]
numpy_fill::: 
  [[ 3.]
 [ 8.]
 [nan]
 [ 7.]
 [nan]
 [ 1.]
 [nan]
 [ 3.]
 [ 8.]
 [ 8.]]
loop_fill [[ 3.]
 [ 8.]
 [nan]
 [ 7.]
 [nan]
 [ 1.]
 [nan]
 [ 3.]
 [ 8.]
 [ 8.]]

Comments ??

answered Oct 6, 2021 at 21:43

LearnToGrow

1,8107 gold badges35 silver badges62 bronze badges

Comments

kho · Accepted Answer · 2022-07-01 14:23:19Z

Minor improvement of of RichieV generalized pure numpy solution with axis selection and 'backward' support

def _np_fill_(arr, axis=-1, fill_dir='f'):
    """Base function for np_fill, np_ffill, np_bfill."""
    if axis < 0:
        axis = len(arr.shape) + axis
    
    if fill_dir.lower() in ['b', 'backward']:
        dir_change = tuple([*[slice(None)]*axis, slice(None, None, -1)])
        return np_ffill(arr[dir_change])[dir_change]
    elif fill_dir.lower() not in ['f', 'forward']:
        raise KeyError(f"fill_dir must be one of: 'b', 'backward', 'f', 'forward'. Got: {fill_dir}")
    
    idx_shape = tuple([slice(None)] + [np.newaxis] * (len(arr.shape) - axis - 1))
    idx = np.where(~np.isnan(arr), np.arange(arr.shape[axis])[idx_shape], 0)
    np.maximum.accumulate(idx, axis=axis, out=idx)
    slc = [np.arange(k)[tuple([slice(None) if dim==i else np.newaxis
        for dim in range(len(arr.shape))])]
        for i, k in enumerate(arr.shape)]
    slc[axis] = idx
    return arr[tuple(slc)]

def np_fill(arr, axis=-1, fill_dir='f'):
    """General fill function which supports multiple filling steps. I.e.: 
    fill_dir=['f', 'b'] or fill_dir=['b', 'f']"""
    if isinstance(fill_dir, (tuple, list, np.ndarray)):
        for i in fill_dir:
            arr = _np_fill_(arr, axis=axis, fill_dir=i)
    else:
        arr = _np_fill_(arr, axis=axis, fill_dir=fill_dir)
    return arr

def np_ffill(arr, axis=-1):
    return np_fill(arr, axis=axis, fill_dir='forward')

def np_bfill(arr, axis=-1):
    return np_fill(arr, axis=axis, fill_dir='backward')

Tan Phan · Accepted Answer · 2021-06-18 00:31:46Z

-1

I used np.nan_to_num Example:

data = np.nan_to_num(data, data.mean())

Reference : Numpy document

answered Jun 18, 2021 at 0:31

Tan Phan

5132 gold badges6 silver badges15 bronze badges

Collectives™ on Stack Overflow

Most efficient way to forward-fill NaN values in numpy array

Example Problem

Tried thus far

Summary

Update: Solutions Comparison

12 Answers 12

7 Comments

Comments

3 Comments

Comments

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Example Problem

Tried thus far

Summary

Update: Solutions Comparison

12 Answers 12

7 Comments

Comments

3 Comments

Comments

3 Comments

Comments

Comments

2 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related