Data Preprocessing & Feature Engineering — Lesson Content
Learn how to prepare raw data for machine learning by handling missing values, scaling features, and encoding categorical variables.
Learning Objectives
- Understand why data quality affects model performance
- Learn different normalization and scaling techniques
- Master categorical variable encoding methods
- See the impact of preprocessing on learning
Step 1: The Problem with Raw Data
# Raw student data
students = [
{"age": 22, "gpa": 3.8, "hours": 25, "income": "$45,000"},
{"age": 19, "gpa": 2.1, "hours": 8, "income": "$0"},
# Different scales, text values, inconsistent formats!
]Step 2: Step 1: Handle Categorical Data
Ordinal Encoding: \text{"Low"} \rightarrow 0, \text{"Medium"} \rightarrow 1, \text{"High"} \rightarrow 2# Convert income to ordinal encoding
income_mapping = {"$0": 0, "$25,000": 1, "$45,000": 2, "$55,000": 3, "$65,000": 4}
data["income_encoded"] = data["income"].map(income_mapping)Step 3: Step 2: The Scaling Problem
\text{Problem: } |\text{study_hours}| >> |\text{gpa}| \text{ in magnitude}# Feature scales are very different!
age_range = [18, 24] # Small range
study_hours_range = [5, 30] # Medium range
gpa_range = [1.9, 3.9] # Small range
income_range = [0, 3] # Very small rangeStep 4: Step 3: Min-Max Normalization
\text{normalized} = \frac{x - \text{min}}{\text{max} - \text{min}}# Min-Max normalization to [0, 1]
def normalize(value, min_val, max_val):
return (value - min_val) / (max_val - min_val)
age_norm = normalize(age, 18, 24)
gpa_norm = normalize(gpa, 1.9, 3.9)Step 5: Step 4: Z-Score Standardization
z = \frac{x - \mu}{\sigma} \text{ where } \mu \text{ is mean, } \sigma \text{ is std dev}# Z-score standardization
import numpy as np
def standardize(values):
mean = np.mean(values)
std = np.std(values)
return (values - mean) / std
age_standardized = standardize(ages)Step 6: Step 5: Feature Engineering
\text{Study Efficiency} = \frac{\text{GPA}}{\text{Study Hours}} \times 100# Create new features from existing ones
data["study_efficiency"] = data["gpa"] / data["study_hours"] * 100
data["age_group"] = data["age"].apply(lambda x: "young" if x < 21 else "older")
# Polynomial features
data["gpa_squared"] = data["gpa"] ** 2Step 7: Step 6: Impact on Model Performance
\text{Accuracy}_{\text{preprocessed}} > \text{Accuracy}_{\text{raw}}# Compare model performance
raw_accuracy = train_model(raw_data) # 65% accuracy
clean_accuracy = train_model(clean_data) # 89% accuracy
print(f"Improvement: {clean_accuracy - raw_accuracy:.1%}")Step 8: Step 7: Best Practices Summary
# Data preprocessing pipeline
def preprocess_data(raw_data):
# 1. Handle missing values
data = handle_missing(raw_data)
# 2. Encode categorical variables
data = encode_categories(data)
# 3. Scale numerical features
data = scale_features(data)
# 4. Engineer new features
data = create_features(data)
return dataStep 9: Test Your Understanding
Prerequisites
- perceptron
Key Concepts
- Data Quality
- Feature Scaling
- Categorical Encoding
- Feature Engineering