Apply an operation to specific columns in numpy array

Question

I would like to apply feature normalisation to a numpy array. Normally this would be trivial with python broadcasting, for example one would do something like this:

train_mean = train.mean(axis=0)
train_std = train.std(axis=0)

train = (train - train_mean) / train_std
val = (val - train_mean) / train_std
test = (test - train_mean) / train_std

However, my numpy array has 9 columns, hence the shape of train_mean and train_std is (9,), and I only want to apply normalisation to specific columns in my array, for which I have the indexes in a dictionary:

column_indices
{'blind angle': 0,
 'fully open': 1,
 'ibn': 2,
 'idh': 3,
 'altitude': 4,
 'azimuth_sin': 5,
 'azimuth_cos': 6,
 'dgp': 7,
 'ill': 8}

I have made a list of the columns I would like to normalise:

FEATURE_NORM_COLS = ['blind angle', 'ibn', 'idh', 'altitude']

I only want to normalise these columns based on their index and the respective indexes in my train_mean and train_std lists (which are the same as the indexes of my data).

What is the best way to achieve this operation?

I have done the following, which seems to get the desired result, but it seems very cumbersome. Is there a better way of doing this?

for name in FEATURE_NORM_COLS:
    train[:, column_indices[name]] = (train[:, column_indices[name]] - train_mean[column_indices[name]]) / train_std[column_indices[name]]

UPDATE

I have followed an approach similar to the comments which I think is more elegant and avoids looping over each column in the dataset.

def normalise(dataset, col_indices=COLUMN_INDICES, norm_cols=NORM_COLS,
              train_mean=TRAIN_MEAN, train_std=TRAIN_STD):
    """
    Returns normalised features with mean of zero and std of 1.
    formula is (train - train_mean) / train_std, but we index by indices
    since we dont want to normalise all columns.
    Args:
        dataset: numpy array to normalise
        col_indices -> dict: the indices of cols in dataset
        norm_cols -> list: columns to be normalised
        train_mean -> list: means of train set columns
        train_std -> list: std's of train set columns
    """
    indices = [col_indices[col] for col in norm_cols]
    dataset[:,indices] = (dataset[:,indices] - train_mean[indices]) / train_std[indices]
    return dataset

David M. · Accepted Answer · 2020-12-18 23:31:43Z

Would that be a better way?

from sklearn import preprocessing

np.set_printoptions(suppress=True, linewidth=1000, precision=3)
np.random.seed(5)

train = np.array([np.random.uniform(low=0, high=100, size=10),
        np.random.uniform(low=0, high=30, size=10),
        np.random.uniform(low=0, high=70, size=10),
        np.random.uniform(low=0, high=20, size=10),
        np.random.uniform(low=0, high=90, size=10),
        np.random.uniform(low=0, high=50, size=10),
        np.random.uniform(low=0, high=30, size=10),
        np.random.uniform(low=0, high=80, size=10),
        np.random.uniform(low=0, high=90, size=10)]).T

column_indices = {'blind angle': 0,
                  'fully open': 1,
                  'ibn': 2,
                  'idh': 3,
                  'altitude': 4,
                  'azimuth_sin': 5,
                  'azimuth_cos': 6,
                  'dgp': 7,
                  'ill': 8}

FEATURE_NORM_COLS = ['blind angle', 'ibn', 'idh', 'altitude']
indices = [column_indices[c] for c in FEATURE_NORM_COLS]

print('TRAIN\n', train, '\n')
scaler = preprocessing.StandardScaler().fit(train)
train[:,indices] = scaler.transform(train)[:, indices]
print('PARTIALLY SCALED\n', train)

You can reuse the scaler for your validation set and test set if need be (see documentation).

Output:

TRAIN
 [[22.199  2.422 41.995  0.486 23.319 38.543 19.061  4.091 84.919]
 [87.073 22.153 18.607  4.091 72.225 24.247 24.357 15.093 10.052]
 [20.672 13.239 19.928 13.997 78.343  1.456 27.8   29.238 75.92 ]
 [91.861  4.749 17.751 15.59  83.047  4.326 27.379 19.543 31.143]
 [48.841 26.398 22.929  0.459  0.199  5.573 24.744 63.607  9.074]
 [61.174  8.223 10.092 11.553 42.254 12.562  2.826 28.168 34.507]
 [76.591 12.427 11.593  0.033 88.332 48.246 10.831 51.11  45.932]
 [51.842  8.882 67.475 10.309 35.905 31.588  1.065 39.473 86.499]
 [29.68  18.864 67.216 12.796 73.236 40.833 16.391 46.68  33.436]
 [18.772 17.395 13.189 19.712 49.181 28.304 23.884 75.144  1.113]] 

PARTIALLY SCALED
 [[-1.085  2.422  0.618 -1.247 -1.132 38.543 19.061  4.091 84.919]
 [ 1.371 22.153 -0.501 -0.713  0.638 24.247 24.357 15.093 10.052]
 [-1.143 13.239 -0.437  0.755  0.859  1.456 27.8   29.238 75.92 ]
 [ 1.552  4.749 -0.542  0.991  1.029  4.326 27.379 19.543 31.143]
 [-0.077 26.398 -0.294 -1.251 -1.969  5.573 24.744 63.607  9.074]
 [ 0.39   8.223 -0.908  0.393 -0.447 12.562  2.826 28.168 34.507]
 [ 0.974 12.427 -0.836 -1.314  1.22  48.246 10.831 51.11  45.932]
 [ 0.037  8.882  1.836  0.208 -0.677 31.588  1.065 39.473 86.499]
 [-0.802 18.864  1.824  0.577  0.674 40.833 16.391 46.68  33.436]
 [-1.215 17.395 -0.76   1.601 -0.196 28.304 23.884 75.144  1.113]]

Thanks, I have implemented something similar and added to my original post. Will accept your answer.

Collectives™ on Stack Overflow

Apply an operation to specific columns in numpy array

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related