Data Preprocessing & Feature Engineering

Beginner • 12-15 minutes

Discover why “garbage in, garbage out” is the most important rule in machine learning. Watch how proper data preprocessing transforms messy real-world data into clean, model-ready features that dramatically improve learning performance.

Data Quality

Why clean data is essential for model success

Feature Scaling

Making all features comparable through normalization

Categorical Encoding

Converting text categories into meaningful numbers

Feature Engineering

Creating new features from existing data

What You'll Learn

Understand why data quality affects model performance

Learn different normalization and scaling techniques

Master categorical variable encoding methods

See the impact of preprocessing on learning

Real-World Applications

Data preprocessing is used everywhere in machine learning

Recommendation Systems

Netflix and Spotify preprocess user behavior data, normalizing viewing times and encoding genre preferences to recommend content you'll love.

Medical Diagnosis

Hospital systems normalize patient vital signs and encode symptoms to help AI detect diseases early and accurately.

Financial Trading

Trading algorithms preprocess market data, scaling prices and volumes to identify profitable patterns across different assets.

Key Takeaways

Data Quality Matters Most

Clean, well-prepared data is more important than complex algorithms. “Garbage in, garbage out” is the fundamental rule of ML.

Scale Features Consistently

Use Min-Max normalization or Z-score standardization to ensure all features contribute equally to model learning.

Engineer Meaningful Features

Combine existing features to create new ones that capture important relationships in your data.

Step
1/ 9

The Problem with Raw Data

Real-world data is messy! Our student dataset has different scales, missing values, and mixed data types. Neural networks struggle with this inconsistency.

View Code Example →
# Raw student data
students = [
  {"age": 22, "gpa": 3.8, "hours": 25, "income": "$45,000"},
  {"age": 19, "gpa": 2.1, "hours": 8, "income": "$0"},
  # Different scales, text values, inconsistent formats!
]

Student Dataset Transformation
6 samples

Name
age
Different scales
gpa
Different scales
study hours
Different scales
income
Text
Passed
Alice223.825$45,000
Yes
Bob192.18$0
No
Carol243.930$65,000
Yes
David202.815$25,000
No
Eve233.522$55,000
Yes
Frank181.95$0
No

⚠️ Problems Detected:

  • Different scales: Age (18-24) vs Study Hours (5-30) vs GPA (1.9-3.9)
  • Text values: Income column contains strings that neural networks can't process
  • Imbalanced influence: Larger numbers will dominate the model unfairly

Data Preprocessing & Feature Engineering — Lesson Content

Learn how to prepare raw data for machine learning by handling missing values, scaling features, and encoding categorical variables.

Learning Objectives

  • Understand why data quality affects model performance
  • Learn different normalization and scaling techniques
  • Master categorical variable encoding methods
  • See the impact of preprocessing on learning

Step 1: The Problem with Raw Data

Real-world data is messy! Our student dataset has different scales, missing values, and mixed data types. Neural networks struggle with this inconsistency.
# Raw student data
students = [
  {"age": 22, "gpa": 3.8, "hours": 25, "income": "$45,000"},
  {"age": 19, "gpa": 2.1, "hours": 8, "income": "$0"},
  # Different scales, text values, inconsistent formats!
]

Step 2: Step 1: Handle Categorical Data

First, we convert text data into numbers. Income categories become numerical values that preserve the ordering relationship.
Ordinal Encoding: \text{"Low"} \rightarrow 0, \text{"Medium"} \rightarrow 1, \text{"High"} \rightarrow 2
# Convert income to ordinal encoding
income_mapping = {"$0": 0, "$25,000": 1, "$45,000": 2, "$55,000": 3, "$65,000": 4}
data["income_encoded"] = data["income"].map(income_mapping)

Step 3: Step 2: The Scaling Problem

Different features have vastly different scales. Age (18-24) vs Study Hours (5-30) vs GPA (1.9-3.9). This makes some features dominate others unfairly!
\text{Problem: } |\text{study_hours}| >> |\text{gpa}| \text{ in magnitude}
# Feature scales are very different!
age_range = [18, 24]        # Small range
study_hours_range = [5, 30] # Medium range  
gpa_range = [1.9, 3.9]      # Small range
income_range = [0, 3]       # Very small range

Step 4: Step 3: Min-Max Normalization

Min-Max scaling transforms all features to the same 0-1 range. Now every feature has equal influence on the model!
\text{normalized} = \frac{x - \text{min}}{\text{max} - \text{min}}
# Min-Max normalization to [0, 1]
def normalize(value, min_val, max_val):
    return (value - min_val) / (max_val - min_val)

age_norm = normalize(age, 18, 24)
gpa_norm = normalize(gpa, 1.9, 3.9)

Step 5: Step 4: Z-Score Standardization

Another approach: Z-score standardization centers data around 0 with standard deviation of 1. This is better when data has outliers.
z = \frac{x - \mu}{\sigma} \text{ where } \mu \text{ is mean, } \sigma \text{ is std dev}
# Z-score standardization
import numpy as np

def standardize(values):
    mean = np.mean(values)
    std = np.std(values)
    return (values - mean) / std

age_standardized = standardize(ages)

Step 6: Step 5: Feature Engineering

Create new meaningful features from existing ones! We can combine GPA and study hours to create a "study efficiency" metric.
\text{Study Efficiency} = \frac{\text{GPA}}{\text{Study Hours}} \times 100
# Create new features from existing ones
data["study_efficiency"] = data["gpa"] / data["study_hours"] * 100
data["age_group"] = data["age"].apply(lambda x: "young" if x < 21 else "older")

# Polynomial features
data["gpa_squared"] = data["gpa"] ** 2

Step 7: Step 6: Impact on Model Performance

See how preprocessing dramatically improves model performance! Clean, scaled data helps neural networks learn faster and more accurately.
\text{Accuracy}_{\text{preprocessed}} > \text{Accuracy}_{\text{raw}}
# Compare model performance
raw_accuracy = train_model(raw_data)      # 65% accuracy
clean_accuracy = train_model(clean_data)  # 89% accuracy

print(f"Improvement: {clean_accuracy - raw_accuracy:.1%}")

Step 8: Step 7: Best Practices Summary

Key takeaways for data preprocessing: Always scale features, handle missing values, encode categories properly, and create meaningful new features when possible.
# Data preprocessing pipeline
def preprocess_data(raw_data):
    # 1. Handle missing values
    data = handle_missing(raw_data)
    
    # 2. Encode categorical variables
    data = encode_categories(data)
    
    # 3. Scale numerical features
    data = scale_features(data)
    
    # 4. Engineer new features
    data = create_features(data)
    
    return data

Step 9: Test Your Understanding

Great job learning about data preprocessing! Let's check your understanding of the key concepts covered in this lesson.

Prerequisites

  • perceptron

Key Concepts

  • Data Quality
  • Feature Scaling
  • Categorical Encoding
  • Feature Engineering