0

I would like to apply feature normalisation to a numpy array. Normally this would be trivial with python broadcasting, for example one would do something like this:

train_mean = train.mean(axis=0)
train_std = train.std(axis=0)

train = (train - train_mean) / train_std
val = (val - train_mean) / train_std
test = (test - train_mean) / train_std

However, my numpy array has 9 columns, hence the shape of train_mean and train_std is (9,), and I only want to apply normalisation to specific columns in my array, for which I have the indexes in a dictionary:

column_indices
{'blind angle': 0,
 'fully open': 1,
 'ibn': 2,
 'idh': 3,
 'altitude': 4,
 'azimuth_sin': 5,
 'azimuth_cos': 6,
 'dgp': 7,
 'ill': 8}

I have made a list of the columns I would like to normalise:

FEATURE_NORM_COLS = ['blind angle', 'ibn', 'idh', 'altitude']

I only want to normalise these columns based on their index and the respective indexes in my train_mean and train_std lists (which are the same as the indexes of my data).

What is the best way to achieve this operation?

I have done the following, which seems to get the desired result, but it seems very cumbersome. Is there a better way of doing this?

for name in FEATURE_NORM_COLS:
    train[:, column_indices[name]] = (train[:, column_indices[name]] - train_mean[column_indices[name]]) / train_std[column_indices[name]]

UPDATE

I have followed an approach similar to the comments which I think is more elegant and avoids looping over each column in the dataset.

def normalise(dataset, col_indices=COLUMN_INDICES, norm_cols=NORM_COLS,
              train_mean=TRAIN_MEAN, train_std=TRAIN_STD):
    """
    Returns normalised features with mean of zero and std of 1.
    formula is (train - train_mean) / train_std, but we index by indices
    since we dont want to normalise all columns.
    Args:
        dataset: numpy array to normalise
        col_indices -> dict: the indices of cols in dataset
        norm_cols -> list: columns to be normalised
        train_mean -> list: means of train set columns
        train_std -> list: std's of train set columns
    """
    indices = [col_indices[col] for col in norm_cols]
    dataset[:,indices] = (dataset[:,indices] - train_mean[indices]) / train_std[indices]
    return dataset

1 Answer 1

2

Would that be a better way?

from sklearn import preprocessing

np.set_printoptions(suppress=True, linewidth=1000, precision=3)
np.random.seed(5)

train = np.array([np.random.uniform(low=0, high=100, size=10),
        np.random.uniform(low=0, high=30, size=10),
        np.random.uniform(low=0, high=70, size=10),
        np.random.uniform(low=0, high=20, size=10),
        np.random.uniform(low=0, high=90, size=10),
        np.random.uniform(low=0, high=50, size=10),
        np.random.uniform(low=0, high=30, size=10),
        np.random.uniform(low=0, high=80, size=10),
        np.random.uniform(low=0, high=90, size=10)]).T

column_indices = {'blind angle': 0,
                  'fully open': 1,
                  'ibn': 2,
                  'idh': 3,
                  'altitude': 4,
                  'azimuth_sin': 5,
                  'azimuth_cos': 6,
                  'dgp': 7,
                  'ill': 8}

FEATURE_NORM_COLS = ['blind angle', 'ibn', 'idh', 'altitude']
indices = [column_indices[c] for c in FEATURE_NORM_COLS]

print('TRAIN\n', train, '\n')
scaler = preprocessing.StandardScaler().fit(train)
train[:,indices] = scaler.transform(train)[:, indices]
print('PARTIALLY SCALED\n', train)

You can reuse the scaler for your validation set and test set if need be (see documentation).

Output:

TRAIN
 [[22.199  2.422 41.995  0.486 23.319 38.543 19.061  4.091 84.919]
 [87.073 22.153 18.607  4.091 72.225 24.247 24.357 15.093 10.052]
 [20.672 13.239 19.928 13.997 78.343  1.456 27.8   29.238 75.92 ]
 [91.861  4.749 17.751 15.59  83.047  4.326 27.379 19.543 31.143]
 [48.841 26.398 22.929  0.459  0.199  5.573 24.744 63.607  9.074]
 [61.174  8.223 10.092 11.553 42.254 12.562  2.826 28.168 34.507]
 [76.591 12.427 11.593  0.033 88.332 48.246 10.831 51.11  45.932]
 [51.842  8.882 67.475 10.309 35.905 31.588  1.065 39.473 86.499]
 [29.68  18.864 67.216 12.796 73.236 40.833 16.391 46.68  33.436]
 [18.772 17.395 13.189 19.712 49.181 28.304 23.884 75.144  1.113]] 

PARTIALLY SCALED
 [[-1.085  2.422  0.618 -1.247 -1.132 38.543 19.061  4.091 84.919]
 [ 1.371 22.153 -0.501 -0.713  0.638 24.247 24.357 15.093 10.052]
 [-1.143 13.239 -0.437  0.755  0.859  1.456 27.8   29.238 75.92 ]
 [ 1.552  4.749 -0.542  0.991  1.029  4.326 27.379 19.543 31.143]
 [-0.077 26.398 -0.294 -1.251 -1.969  5.573 24.744 63.607  9.074]
 [ 0.39   8.223 -0.908  0.393 -0.447 12.562  2.826 28.168 34.507]
 [ 0.974 12.427 -0.836 -1.314  1.22  48.246 10.831 51.11  45.932]
 [ 0.037  8.882  1.836  0.208 -0.677 31.588  1.065 39.473 86.499]
 [-0.802 18.864  1.824  0.577  0.674 40.833 16.391 46.68  33.436]
 [-1.215 17.395 -0.76   1.601 -0.196 28.304 23.884 75.144  1.113]]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I have implemented something similar and added to my original post. Will accept your answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.