Standardizing Data
Explore how to standardize data using scikit-learn and NumPy by converting various feature values to a standard format with zero mean and unit variance. Understand why standardization is critical when working with datasets containing features with different units or scales. Learn to apply the scale function to prepare data effectively for machine learning models.
We'll cover the following...
A. Standard data format
Data can contain all sorts of different values. For example, Olympic 100m sprint times will range from 9.5 to 10.5 seconds, while calorie counts in large pepperoni pizzas can range from 1500 to 3000 calories. Even data measuring the exact same quantities can range in value (e.g. weight in kilograms vs. weight in pounds).
When data can take on any range of values, it makes it difficult to interpret. Therefore, data scientists will convert the data into a standard format to make it easier to understand. The standard format refers to data that has 0 mean and unit variance (i.e. standard deviation = 1), and the process of converting data into this format is called data standardization.
Data standardization is a relatively simple process. For each data value, x, we subtract the overall mean of the data, μ, then divide by the overall standard deviation, σ. The new value, z, represents ...