Feature Engineering for Machine Learning: A Practical Guide Using the Ames Housing Dataset

·

19 min read


1. Introduction

Feature engineering is a pivotal step in the machine learning pipeline. It transforms raw data into meaningful features that improve model performance, making it easier for algorithms to uncover patterns and make predictions. In this blog, we’ll dive into feature engineering using the Ames Housing Dataset, focusing on techniques applicable to linear regression and other machine learning models.

The Ames Housing Dataset is a rich dataset containing various characteristics of residential properties. It provides an excellent playground for applying and understanding feature engineering concepts. In this guide, we’ll walk through the process, from exploratory data analysis to transforming features for modeling. While the emphasis is on linear regression, these techniques can be adapted for other machine learning tasks.

  • Brief overview of feature engineering as a crucial step in machine learning pipelines.

  • Highlight the use of the Ames Housing Dataset and its relevance to linear regression and other models.

  • Mention the goal of transforming raw data into features that improve model performance.

2. Exploratory Data Analysis (EDA)

Before jumping into feature engineering, it’s crucial to understand the data. Let’s start by loading the Ames Housing Dataset and performing some basic exploratory data analysis.

Loading the Dataset

The dataset is stored in a .tsv (tab-separated values) file. Using Python’s Pandas library, we can load it with the following code:

import pandas as pd

# Load the dataset
df = pd.read_csv('data/Ames_Housing_Data1.tsv', sep='\t')

# Display the first few rows
print(df.head())
   Order        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street  \
0      1  526301100           20        RL         141.0     31770   Pave   
1      1  526301100           20        RL         141.0     31770   Pave   
2      2  526350040           20        RH          80.0     11622   Pave   
3      3  526351010           20        RL          81.0     14267   Pave   
4      4  526353030           20        RL          93.0     11160   Pave   

  Alley Lot Shape Land Contour  ... Pool Area Pool QC  Fence Misc Feature  \
0   NaN       IR1          Lvl  ...         0     NaN    NaN          NaN   
1   NaN       IR1          Lvl  ...         0     NaN    NaN          NaN   
2   NaN       Reg          Lvl  ...         0     NaN  MnPrv          NaN   
3   NaN       IR1          Lvl  ...         0     NaN    NaN         Gar2   
4   NaN       Reg          Lvl  ...         0     NaN    NaN          NaN   

  Misc Val Mo Sold Yr Sold Sale Type  Sale Condition  SalePrice  
0        0       5    2010       WD           Normal     215000  
1        0       5    2010       WD           Normal     215000  
2        0       6    2010       WD           Normal     105000  
3    12500       6    2010       WD           Normal     172000  
4        0       4    2010       WD           Normal     244000  

[5 rows x 82 columns]

Here, sep='\t' specifies the tab separator, making it compatible with .tsv files.

Data Summary

Next, we use df.info() to get an overview of the dataset:

# Get an overview of the data
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2931 entries, 0 to 2930
Data columns (total 82 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Order            2931 non-null   int64  
 1   PID              2931 non-null   int64  
 2   MS SubClass      2931 non-null   int64  
 3   MS Zoning        2931 non-null   object 
 4   Lot Frontage     2441 non-null   float64
 5   Lot Area         2931 non-null   int64  
 6   Street           2931 non-null   object 
 7   Alley            198 non-null    object 
 8   Lot Shape        2931 non-null   object 
 9   Land Contour     2931 non-null   object 
 10  Utilities        2931 non-null   object 
 11  Lot Config       2931 non-null   object 
 12  Land Slope       2931 non-null   object 
 13  Neighborhood     2931 non-null   object 
 14  Condition 1      2931 non-null   object 
 15  Condition 2      2931 non-null   object 
 16  Bldg Type        2931 non-null   object 
 17  House Style      2931 non-null   object 
 18  Overall Qual     2931 non-null   int64  
 19  Overall Cond     2931 non-null   int64  
 20  Year Built       2931 non-null   int64  
 21  Year Remod/Add   2931 non-null   int64  
 22  Roof Style       2931 non-null   object 
 23  Roof Matl        2931 non-null   object 
 24  Exterior 1st     2931 non-null   object 
 25  Exterior 2nd     2931 non-null   object 
 26  Mas Vnr Type     1156 non-null   object 
 27  Mas Vnr Area     2908 non-null   float64
 28  Exter Qual       2931 non-null   object 
 29  Exter Cond       2931 non-null   object 
 30  Foundation       2931 non-null   object 
 31  Bsmt Qual        2851 non-null   object 
 32  Bsmt Cond        2851 non-null   object 
 33  Bsmt Exposure    2848 non-null   object 
 34  BsmtFin Type 1   2851 non-null   object 
 35  BsmtFin SF 1     2930 non-null   float64
 36  BsmtFin Type 2   2850 non-null   object 
 37  BsmtFin SF 2     2930 non-null   float64
 38  Bsmt Unf SF      2930 non-null   float64
 39  Total Bsmt SF    2930 non-null   float64
 40  Heating          2931 non-null   object 
 41  Heating QC       2931 non-null   object 
 42  Central Air      2931 non-null   object 
 43  Electrical       2930 non-null   object 
 44  1st Flr SF       2931 non-null   int64  
 45  2nd Flr SF       2931 non-null   int64  
 46  Low Qual Fin SF  2931 non-null   int64  
 47  Gr Liv Area      2931 non-null   int64  
 48  Bsmt Full Bath   2929 non-null   float64
 49  Bsmt Half Bath   2929 non-null   float64
 50  Full Bath        2931 non-null   int64  
 51  Half Bath        2931 non-null   int64  
 52  Bedroom AbvGr    2931 non-null   int64  
 53  Kitchen AbvGr    2931 non-null   int64  
 54  Kitchen Qual     2931 non-null   object 
 55  TotRms AbvGrd    2931 non-null   int64  
 56  Functional       2931 non-null   object 
 57  Fireplaces       2931 non-null   int64  
 58  Fireplace Qu     1509 non-null   object 
 59  Garage Type      2774 non-null   object 
 60  Garage Yr Blt    2772 non-null   float64
 61  Garage Finish    2772 non-null   object 
 62  Garage Cars      2930 non-null   float64
 63  Garage Area      2930 non-null   float64
 64  Garage Qual      2772 non-null   object 
 65  Garage Cond      2772 non-null   object 
 66  Paved Drive      2931 non-null   object 
 67  Wood Deck SF     2931 non-null   int64  
 68  Open Porch SF    2931 non-null   int64  
 69  Enclosed Porch   2931 non-null   int64  
 70  3Ssn Porch       2931 non-null   int64  
 71  Screen Porch     2931 non-null   int64  
 72  Pool Area        2931 non-null   int64  
 73  Pool QC          13 non-null     object 
 74  Fence            572 non-null    object 
 75  Misc Feature     106 non-null    object 
 76  Misc Val         2931 non-null   int64  
 77  Mo Sold          2931 non-null   int64  
 78  Yr Sold          2931 non-null   int64  
 79  Sale Type        2931 non-null   object 
 80  Sale Condition   2931 non-null   object 
 81  SalePrice        2931 non-null   int64  
dtypes: float64(11), int64(28), object(43)
memory usage: 1.8+ MB

Visualizing Distributions

Exploring the distribution of variables is vital to detect patterns and outliers. For example, let’s examine the distribution of the Gr.Liv.Area (above-ground living area) column, which is known to contain outliers:

import matplotlib.pyplot as plt

# Plot a histogram of the Gr.Liv.Area column
df['Gr.Liv.Area'].hist(bins=50)
plt.title('Distribution of Gr.Liv.Area')
plt.xlabel('Gr.Liv.Area')
plt.ylabel('Frequency')
plt.show()

This visualization shows that most values lie below 4,000, with a few outliers beyond this range. These extreme values can skew our analysis, so we filter them out:

# Filter out rows where Gr.Liv.Area > 4000
df = df[df['Gr.Liv.Area'] <= 4000]

# Check the new shape of the dataset
print(df.shape)
(2926, 82)

3. Feature Selection

Feature selection is an essential step in feature engineering, helping to eliminate irrelevant or redundant features that add noise to the dataset. In the Ames Housing Dataset, we identified two columns, Order and PID, that do not provide predictive value. Both columns contain unique identifiers, which do not contribute meaningful patterns for machine learning models. Retaining such features may even degrade model performance.

To remove these columns, we use the drop method:

# Dropping irrelevant columns
df = df.drop(['Order', 'PID'], axis=1)

Here, axis=1 specifies that we are dropping columns (not rows). By reducing unnecessary data, we streamline the dataset and ensure that the model focuses on meaningful features.

4. Feature Transformation

Feature transformation involves reshaping data to improve model performance. For the Ames Housing Dataset, we identified several numerical columns with skewed distributions, which can negatively impact models like linear regression that assume normality. Applying log transformations can make these distributions closer to normal.

Identifying Skewed Features

We first calculate the skewness of all numerical columns:

# Selecting only numeric columns
numeric_cols = df.select_dtypes(include=['number'])

# Checking skewness for numeric columns
skew_vals = numeric_cols.skew()
print(skew_vals)
MS SubClass         1.356915
Lot Frontage        1.113571
Lot Area           13.154455
Overall Qual        0.171750
Overall Cond        0.573267
Year Built         -0.602169
Year Remod/Add     -0.448677
Mas Vnr Area        2.565828
BsmtFin SF 1        0.821599
BsmtFin SF 2        4.136716
Bsmt Unf SF         0.925413
Total Bsmt SF       0.399066
1st Flr SF          0.941362
2nd Flr SF          0.848035
Low Qual Fin SF    12.109736
Gr Liv Area         0.878650
Bsmt Full Bath      0.614609
Bsmt Half Bath      3.966772
Full Bath           0.165577
Half Bath           0.703508
Bedroom AbvGr       0.306780
Kitchen AbvGr       4.310424
TotRms AbvGrd       0.704708
Fireplaces          0.731945
Garage Yr Blt      -0.381364
Garage Cars        -0.220068
Garage Area         0.213447
Wood Deck SF        1.847112
Open Porch SF       2.495300
Enclosed Porch      4.011358
3Ssn Porch         11.395843
Screen Porch        3.954290
Pool Area          18.746993
Misc Val           22.228804
Mo Sold             0.196215
Yr Sold             0.132566
SalePrice           1.590763
dtype: float64

The skewness values indicate that several numeric features in the dataset have significant skewness, with notable examples being Pool Area (18.75), Misc Val (22.23), and Lot Area (13.15), which exhibit extreme right-skew. Features like Mas Vnr Area, Wood Deck SF, and Open Porch SF also have high skewness, suggesting the need for log or similar transformations. In contrast, features such as Overall Qual and Total Bsmt SF have low skewness, and some like Year Built and Year Remod/Add show mild left-skew. Addressing these skewed features will help improve data distribution and model performance.

Transforming Multiple Columns

We focus on transforming numeric features with skewness above 0.75.

skew_limit = 0.75
skewed_features = skew_vals[skew_vals > skew_limit].index

import numpy as np

# Applying log transformation to skewed features
for col in skewed_features:
    df[col] = np.log1p(df[col])

This transformation effectively reduces skewness, improving data distribution for modelling. By focusing only on features with skewness above 0.75, we avoid unnecessary transformations and preserve the original scale of well-behaved features.

Verifying the Transformation

To confirm the changes, you can recheck skewness or visualize the distributions of transformed columns. For example, you can plot histograms before and after transformations to observe the improvement:

import matplotlib.pyplot as plt

# Visualizing the distribution of a transformed feature
feature = 'Lot Area'  # Replace with any skewed feature
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
df_original[feature].hist()
plt.title(f'Original {feature} Distribution')

plt.subplot(1, 2, 2)
df[feature].hist()
plt.title(f'Transformed {feature} Distribution')

plt.show()

The plot shows the effect of log transformation on the Lot Area feature. The original distribution (left) is highly right-skewed, with most values concentrated near zero and a few extreme outliers. After applying the log transformation (right), the distribution becomes approximately normal, reducing skewness and making the feature more suitable for machine learning models that benefit from normally distributed inputs, such as linear regression.

5. Handling Missing Values

Missing values are a common issue in datasets and need to be addressed carefully to avoid impacting model performance. In the Ames Housing Dataset, we start by identifying missing values using isnull() and sum():

# Identifying missing values
missing_values = df.isnull().sum().sort_values(ascending=False)
print(missing_values[missing_values > 0])
Pool QC           2915
Misc Feature      2821
Alley             2728
Fence             2355
Mas Vnr Type      1774
Fireplace Qu      1422
Lot Frontage       490
Garage Cond        159
Garage Yr Blt      159
Garage Finish      159
Garage Qual        159
Garage Type        157
Bsmt Exposure       83
BsmtFin Type 2      81
BsmtFin Type 1      80
Bsmt Cond           80
Bsmt Qual           80
Mas Vnr Area        23
Bsmt Full Bath       2
Bsmt Half Bath       2
Bsmt Unf SF          1
Garage Cars          1
Electrical           1
Total Bsmt SF        1
BsmtFin SF 1         1
Garage Area          1
BsmtFin SF 2         1
dtype: int64

Replacing Missing Values for Features Indicating Absence

Some features have missing values that signify the absence of that feature in the property, not a lack of data during collection. For these features, we replace the missing values with "None":

# Replace missing values with 'None' for categorical features indicating absence
columns_to_fill_with_none = ['Pool QC', 'Misc Feature', 'Alley', 'Fence', 'Fireplace Qu']
for col in columns_to_fill_with_none:
    df[col].fillna('None', inplace=True)

This ensures the absence of features like pools, fences, and fireplaces is explicitly encoded and retained for modeling.

For basement-related features, missing values likely indicate the property has no basement. We handle these by replacing:

  • Categorical features with "None".

  • Numerical features with 0.

# Handling basement-related features
basement_categorical = ['Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin Type 2']
for col in basement_categorical:
    df[col].fillna('None', inplace=True)

basement_numerical = ['BsmtFin SF 1', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Bsmt Full Bath', 'Bsmt Half Bath']
for col in basement_numerical:
    df[col].fillna(0, inplace=True)

This approach ensures that properties without basements are accurately represented without introducing bias.

Similarly, missing values in garage-related features indicate properties without a garage. These are replaced as follows:

  • Categorical features are filled with "None".

  • Numerical features are filled with 0.

# Handling garage-related features
garage_categorical = ['Garage Type', 'Garage Finish', 'Garage Qual', 'Garage Cond']
for col in garage_categorical:
    df[col].fillna('None', inplace=True)

garage_numerical = ['Garage Yr Blt', 'Garage Area', 'Garage Cars']
for col in garage_numerical:
    df[col].fillna(0, inplace=True)

Replacing Missing Values in Other Features

Features like Lot Frontage and Mas Vnr Area require imputation based on the dataset’s context:

  • Lot Frontage: Missing values are imputed with the median, as it balances the influence of outliers.

  • Mas Vnr Area and Mas Vnr Type: Missing values are replaced with 0 and "None", respectively.

# Filling remaining features
df['Lot Frontage'].fillna(df['Lot Frontage'].median(), inplace=True)
df['Mas Vnr Area'].fillna(0, inplace=True)
df['Mas Vnr Type'].fillna('None', inplace=True)

Replacing Minimal Missing Values

For features with very few missing values, we impute reasonable defaults:

  • Electrical: Replaced with the mode (most common value).

  • Other minimal missing numerical features: Replaced with 0.

# Filling minimal missing values
df['Electrical'].fillna(df['Electrical'].mode()[0], inplace=True)

Verifying All Missing Values Are Handled

Finally, we ensure that no missing values remain in the dataset:

# Check for remaining missing values
print(df.isnull().sum().max())  # Should return 0
0

By addressing missing values based on their context, we retain critical information while ensuring the dataset is clean and complete. This thoughtful approach enhances data quality and prepares the dataset for further preprocessing and modeling.

6. Creating New Features

Creating new features can significantly improve model performance by capturing non-linear relationships, interactions, and relative deviations within the data. In the Ames Housing Dataset, we explore three key techniques: polynomial features, interaction terms, and relative deviations.

Polynomial Features

Polynomial features allow us to model non-linear relationships by expanding numerical features into higher-order terms and their interactions. For example, squaring a feature or including its interaction with another feature can provide the model with more expressive input data.

Here’s how we generate polynomial features using Scikit-learn’s PolynomialFeatures:

from sklearn.preprocessing import PolynomialFeatures

# Select numerical columns for polynomial transformation
numerical_cols = ['Lot Area', 'Overall Qual']  # Example numerical columns

# Create polynomial features
poly = PolynomialFeatures(degree=2, include_bias=False)
poly_features = poly.fit_transform(df[numerical_cols])

# Convert to a DataFrame
poly_feature_names = poly.get_feature_names_out(numerical_cols)
poly_df = pd.DataFrame(poly_features, columns=poly_feature_names)

# Add polynomial features to the dataset
df = pd.concat([df, poly_df], axis=1)

Visualizing Polynomial Expansion

If Lot Area and Overall Qual are the input features, the polynomial expansion includes:

  • Lot Area (original feature),

  • Overall Qual (original feature),

  • Lot Area² (squared term),

  • Lot Area × Overall Qual (interaction term),

  • Overall Qual² (squared term).

This expansion transforms two features into five, providing the model with richer information to capture non-linear patterns.

Interaction Features

Interaction terms capture the combined effect of two features. For example, the interaction between Overall Qual and Year Built can represent the compounded impact of quality and property age on the target variable.

Here’s how we create interaction terms manually:

# Creating interaction features
df['OverallQual_YearBuilt'] = df['Overall Qual'] * df['Year Built']
df['QualPerLotArea'] = df['Overall Qual'] / df['Lot Area']

These features encode relationships between variables that the model might otherwise struggle to learn.

Relative Deviations

Relative deviations provide context by comparing a feature’s value to the average within its category. For example, we calculate how much a house’s Overall Qual deviates from the average Overall Qual in its neighborhood.

Using groupby().transform(), we calculate deviations as follows:

# Calculate mean and standard deviation of Overall Qual within each Neighborhood
neighborhood_stats = df.groupby('Neighborhood')['Overall Qual'].agg(
    mean_neigh='mean',
    std_neigh='std'
).reset_index()

# Merge the stats back into the original dataset
df = df.merge(neighborhood_stats, on='Neighborhood', how='left')

# Calculate the deviation of Overall Qual within each neighborhood
df['Qual_Deviation'] = (df['Overall Qual'] - df['mean_neigh']) / df['std_neigh']

This feature provides insight into whether a house is above or below average quality for its neighborhood, offering valuable context for models.

By generating polynomial features, creating interaction terms, and calculating relative deviations, we enrich the dataset with meaningful information, enabling models to better capture complex relationships in the data. These feature engineering techniques are instrumental in improving predictive performance.

7. Handling Categorical Data

Handling categorical data involves transforming non-numeric features into a format that machine learning models can process effectively. In the Ames Housing Dataset, we reduce the dimensionality of categorical features before applying encoding techniques to ensure a clean and efficient dataset.

Reducing Dimensionality in Categorical Features

Reducing dimensionality in categorical features is crucial for simplifying the dataset and avoiding sparsity when applying one-hot encoding. In the Ames Housing Dataset, the Neighborhood and House Style features have several unique categories, including some with very low frequencies. To address this, we group low-frequency categories into an "Other" category before applying encoding.

The Neighborhood variable contains 25 unique values, representing different areas of the city. Here is a sample of their counts:

df['Neighborhood'].value_counts()
Neighborhood
NAmes      444
CollgCr    267
OldTown    239
Edwards    191
Somerst    182
NridgHt    166
Gilbert    165
Sawyer     151
NWAmes     131
SawyerW    125
Mitchel    114
BrkSide    108
Crawfor    103
IDOTRR      93
Timber      72
NoRidge     69
StoneBr     51
SWISU       48
ClearCr     44
MeadowV     37
BrDale      30
Blmngtn     28
Veenker     24
NPkVill     23
Blueste     10
Greens       8
GrnHill      2
Landmrk      1
Name: count, dtype: int64

Some categories, such as Landmrk (1 occurrence) and GrnHill (2 occurrences), are extremely rare. Including these as separate one-hot encoded columns would add unnecessary complexity and sparsity. To address this, we group categories with fewer than 10 occurrences into an "Other" category:

# Group low-frequency neighborhoods into "Other"
low_count_neighborhoods = df['Neighborhood'].value_counts()[df['Neighborhood'].value_counts() < 10].index
df['Neighborhood'] = df['Neighborhood'].replace(low_count_neighborhoods, 'Other')

The House Style variable contains 8 unique values, each representing a different architectural style. Their frequencies are as follows:

df['House Style'].value_counts()
House Style
1Story    1481
2Story     869
1.5Fin     314
SLvl       128
SFoyer      83
2.5Unf      24
1.5Unf      19
2.5Fin       8
Name: count, dtype: int64

Categories like 2.5Fin (8 occurrences) and 1.5Unf (19 occurrences) are rare and can add noise when one-hot encoded. To simplify the dataset, we group categories with fewer than 50 occurrences into an "Other" category:

# Group low-frequency house styles into "Other"
low_count_styles = df['House Style'].value_counts()[df['House Style'].value_counts() < 50].index
df['House Style'] = df['House Style'].replace(low_count_styles, 'Other')

After this transformation, the unique values in House Style are reduced to:

  • 1Story

  • 2Story

  • 1.5Fin

  • SLvl

  • SFoyer

  • Other

One-Hot Encoding

After reducing dimensionality, we apply one-hot encoding to transform categorical features into binary columns. For example, the Neighborhood feature is converted into multiple binary columns representing each category while avoiding multicollinearity by dropping the first category:

# Apply one-hot encoding to categorical features
df = pd.get_dummies(df, columns=['Neighborhood', 'House Style'], drop_first=True)
  • columns: Specifies the categorical features to encode.

  • drop_first=True: Removes one category per feature to avoid multicollinearity in linear models.

Encoding Ordinal Features

Some categorical features, like Exter Qual (Exterior Quality), have an intrinsic order (e.g., Poor < Fair < Good < Excellent). For these features, ordinal encoding is applied to preserve their order:

# Ordinal encoding for Exter Qual
exter_qual_map = {'Poor': 1, 'Fair': 2, 'Good': 3, 'Excellent': 4}
df['Exter Qual'] = df['Exter Qual'].map(exter_qual_map)

This replaces the categories with their corresponding numerical values, maintaining the ordinal relationship.

8. Correlation Analysis

Given the large number of features in the dataset, a full heatmap may be overwhelming and difficult to interpret. Instead, we focus on identifying features most strongly correlated with the target variable, SalePrice, using a sorted list of correlation values. This approach simplifies the analysis and highlights the most relevant features.

Correlation with SalePrice

We extract and sort the correlation values of all features with SalePrice to identify the strongest relationships:

# Calculate the correlation matrix
correlation_matrix = df.corr(numeric_only=True)

# Extract and sort correlations with SalePrice
target_corr = correlation_matrix['SalePrice'].sort_values(ascending=False)
print(target_corr)
SalePrice                1.000000
OverallQual_YearBuilt    0.834162
Overall Qual             0.827489
QualPerLotArea           0.733817
mean_neigh               0.730479
                           ...   
Neighborhood_Edwards    -0.194731
Enclosed Porch          -0.220051
Neighborhood_IDOTRR     -0.243680
Neighborhood_OldTown    -0.255239
Exter Qual                    NaN
Name: SalePrice, Length: 76, dtype: float64

Selecting the Top Correlated Features

Based on the sorted correlations, we select the top features with high absolute correlation values (e.g., above 0.5):

# Select top correlated features with SalePrice
top_features = target_corr[target_corr.abs() > 0.5]
print("Top features correlated with SalePrice:")
print(top_features)
Top features correlated with SalePrice:
SalePrice                1.000000
OverallQual_YearBuilt    0.834162
Overall Qual             0.827489
QualPerLotArea           0.733817
mean_neigh               0.730479
Gr Liv Area              0.726131
Lot Area Overall Qual    0.692244
Garage Cars              0.674913
Overall Qual^2           0.668838
Garage Area              0.653704
Total Bsmt SF            0.649027
Year Built               0.616685
1st Flr SF               0.616530
Year Remod/Add           0.587420
Full Bath                0.574949
Name: SalePrice, dtype: float64

This gives a concise list of features most strongly associated with SalePrice.

Analyze Relationships Between Features

To evaluate the relationships between features (including multicollinearity), we can create a heatmap of the correlation matrix for the top features:

# Calculate the correlation matrix for top features
correlation_matrix = df[top_corr_features].corr()

# Plot the heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)
plt.title("Correlation Heatmap of Top Features")
plt.show()

This heatmap highlights the correlations between the top features and SalePrice, as well as between the features themselves. Key observations include:

  1. Strong Correlation with SalePrice: Features like Overall Qual, Gr Liv Area, OverallQual_YearBuilt, and Garage Cars show strong positive correlations with SalePrice (above 0.7), indicating their importance for predicting house prices.

  2. Multicollinearity:

  • Garage Cars and Garage Area are highly correlated (0.89), suggesting redundancy. Retaining both could lead to multicollinearity issues.

  • Overall Qual and its engineered features (OverallQual_YearBuilt, Overall Qual^2) are also highly correlated, reflecting overlapping information.

3. Moderate Correlation: Features like Year Built and Year Remod/Add exhibit moderate correlations with SalePrice (~0.6) and overlap to some extent.

This analysis suggests focusing on highly correlated features while addressing multicollinearity through regularization or feature reduction.

Pair Plot for Top Features

To visualize the relationships between the top correlated features and SalePrice, we use a pair plot:

# Extract the list of top correlated features
top_corr_features = top_features.index.tolist()

# Define the maximum number of subplots per row
max_subplots_per_row = 5

# Exclude SalePrice itself
features = top_corr_features[:-1]

# Split features into groups of 5
feature_groups = [features[i:i + max_subplots_per_row] for i in range(0, len(features), max_subplots_per_row)]

# Plot each group in separate rows
for group_idx, feature_group in enumerate(feature_groups):
    plt.figure(figsize=(20, 4))  # Adjust figure size based on the number of subplots
    for i, feature in enumerate(feature_group):
        plt.subplot(1, len(feature_group), i + 1)
        sns.scatterplot(data=df, x=feature, y='SalePrice', alpha=0.1, edgecolor='none')
        plt.title(feature)
    plt.suptitle(f"Pair Plot of Top Features with SalePrice (Row {group_idx + 1})", y=1.02)
    plt.tight_layout()
    plt.show()

1. What do these plots tell us about the distribution of the target?

The target variable, SalePrice, appears to have a right-skewed distribution across some features, such as Year Built and Garage Area, where most data points are clustered at lower SalePrice values. However, features like Overall Qual and Gr Liv Area exhibit a broader and more even spread of SalePrice, indicating strong linear relationships across the range. This mixed behavior suggests that while linear regression may be effective, the skewness in specific features could still introduce biases in model residuals, making a log transformation of SalePrice or those specific features is a useful step.

2. What do these plots tell us about the relationship between the features and the target? Do you think that linear regression is well-suited to this problem? Do any feature transformations come to mind?

Most features, such as Overall Qual, Gr Liv Area, and QualPerLotArea, exhibit positive linear relationships with SalePrice. These relationships suggest that linear regression is well-suited to model these features. However, some features, such as Year Built and Year Remod/Add, exhibit more subtle trends, with some noise at higher SalePrice values. For these, interaction terms (e.g., OverallQual_YearBuilt) or polynomial features (e.g., Overall Qual^2) could improve model fit. The log transformation already applied to some features (e.g., Lot Area) effectively normalizes skewed relationships, aligning them better with SalePrice.

9. Conclusion

In this blog, we explored the Ames Housing dataset, focusing on feature engineering and its importance in preparing data for predictive modeling. We performed extensive Exploratory Data Analysis (EDA), handled missing values, applied transformations to address skewness, engineered new features, and analyzed correlations to understand relationships between variables. These steps ensured the dataset was ready for building robust machine learning models. The insights gained, such as identifying critical features (Overall Qual, Gr Liv Area, etc.) and handling multicollinearity, provide a strong foundation for accurate predictions.

10. Next Steps

The next steps involve leveraging the preprocessed and feature-engineered dataset to train machine learning models. Key actions include:

  1. Splitting the dataset into training and testing sets.

  2. Applying machine learning algorithms, starting with linear regression, to assess the model’s performance.

  3. Exploring advanced techniques like Ridge, Lasso, or Gradient Boosting to address multicollinearity and improve predictive accuracy.

  4. Validating the model using cross-validation techniques.

  5. Deploying the final model for real-world applications, such as house price prediction.


Appendix

Code: https://github.com/Minhhoang2606/Exploratory-Data-Analysis-for-Machine-Learning/blob/master/module%203/Feature%20Engineering/01d_DEMO_Feature_Engineering.ipynb

Data: https://github.com/Minhhoang2606/Exploratory-Data-Analysis-for-Machine-Learning/blob/master/data/Ames_Housing_Data1.tsv