I would like to apply feature normalisation to a numpy array. Normally this would be trivial with python broadcasting, for example one would do something like this:
train_mean = train.mean(axis=0)
train_std = train.std(axis=0)
train = (train - train_mean) / train_std
val = (val - train_mean) / train_std
test = (test - train_mean) / train_std
However, my numpy array has 9 columns, hence the shape of train_mean and train_std is (9,), and I only want to apply normalisation to specific columns in my array, for which I have the indexes in a dictionary:
column_indices
{'blind angle': 0,
'fully open': 1,
'ibn': 2,
'idh': 3,
'altitude': 4,
'azimuth_sin': 5,
'azimuth_cos': 6,
'dgp': 7,
'ill': 8}
I have made a list of the columns I would like to normalise:
FEATURE_NORM_COLS = ['blind angle', 'ibn', 'idh', 'altitude']
I only want to normalise these columns based on their index and the respective indexes in my train_mean and train_std lists (which are the same as the indexes of my data).
What is the best way to achieve this operation?
I have done the following, which seems to get the desired result, but it seems very cumbersome. Is there a better way of doing this?
for name in FEATURE_NORM_COLS:
train[:, column_indices[name]] = (train[:, column_indices[name]] - train_mean[column_indices[name]]) / train_std[column_indices[name]]
UPDATE
I have followed an approach similar to the comments which I think is more elegant and avoids looping over each column in the dataset.
def normalise(dataset, col_indices=COLUMN_INDICES, norm_cols=NORM_COLS,
train_mean=TRAIN_MEAN, train_std=TRAIN_STD):
"""
Returns normalised features with mean of zero and std of 1.
formula is (train - train_mean) / train_std, but we index by indices
since we dont want to normalise all columns.
Args:
dataset: numpy array to normalise
col_indices -> dict: the indices of cols in dataset
norm_cols -> list: columns to be normalised
train_mean -> list: means of train set columns
train_std -> list: std's of train set columns
"""
indices = [col_indices[col] for col in norm_cols]
dataset[:,indices] = (dataset[:,indices] - train_mean[indices]) / train_std[indices]
return dataset