Photo by Skyler Smith on Unsplash
Predicting Flight Delays with Machine Learning: A Comprehensive Guide
1. Introduction
Flight delays are a major inconvenience, affecting millions of passengers annually and causing financial strain on airlines. Delays disrupt travel schedules, increase operational costs, and often result in dissatisfied customers. For airlines, the ability to predict flight delays in advance can significantly improve scheduling efficiency, resource allocation, and customer satisfaction.
In this project, we explore how machine learning can be used to predict flight delays using historical data. By analyzing patterns and trends in delay data, we aim to build robust predictive models that not only forecast delay durations but also identify key factors contributing to these delays. Our goal is to provide actionable insights that airlines can leverage to minimize delays and enhance the passenger experience.
2. Understanding the Dataset
To begin the project, we first import the dataset and explore its general structure and content. This step helps us understand the data we're working with and identify any potential issues that need to be addressed.
Importing the Dataset
We start by loading the dataset using Python's pandas library:
import pandas as pd
# Load the dataset
data = pd.read_csv('flight_data.csv')
# Display the first few rows
print(data.head())
<ipython-input-2-21709313848c>:9: DtypeWarning: Columns (0,1,3,4,10,11,13,19,20,21,22,30,36,41,48) have mixed types. Specify dtype option on import or set low_memory=False.
data = pd.read_csv('final_data.csv')
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE UNIQUE_CARRIER \
0 2016 1 1 6 3 2016-01-06 AA
1 2016 1 1 7 4 2016-01-07 AA
2 2016 1 1 8 5 2016-01-08 AA
3 2016 1 1 9 6 2016-01-09 AA
4 2016 1 1 10 7 2016-01-10 AA
AIRLINE_ID CARRIER TAIL_NUM ... DISTANCE_GROUP CARRIER_DELAY \
0 19805 AA N4YBAA ... 4.0 NaN
1 19805 AA N434AA ... 4.0 NaN
2 19805 AA N541AA ... 4.0 NaN
3 19805 AA N489AA ... 4.0 NaN
4 19805 AA N439AA ... 4.0 0.0
WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY FIRST_DEP_TIME \
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 0.0 47.0 0.0 66.0 NaN
TOTAL_ADD_GTIME LONGEST_ADD_GTIME Unnamed: 64
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
[5 rows x 65 columns]
# Check the structure of the dataset
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5635978 entries, 0 to 5635977
Data columns (total 65 columns):
# Column Dtype
--- ------ -----
0 YEAR object
1 QUARTER object
2 MONTH int64
3 DAY_OF_MONTH object
4 DAY_OF_WEEK object
5 FL_DATE object
6 UNIQUE_CARRIER object
7 AIRLINE_ID int64
8 CARRIER object
9 TAIL_NUM object
10 FL_NUM object
11 ORIGIN_AIRPORT_ID object
12 ORIGIN_AIRPORT_SEQ_ID int64
13 ORIGIN_CITY_MARKET_ID object
14 ORIGIN object
15 ORIGIN_CITY_NAME object
16 ORIGIN_STATE_ABR object
17 ORIGIN_STATE_FIPS float64
18 ORIGIN_STATE_NM object
19 ORIGIN_WAC object
20 DEST_AIRPORT_ID object
21 DEST_AIRPORT_SEQ_ID object
22 DEST_CITY_MARKET_ID object
23 DEST object
24 DEST_CITY_NAME object
25 DEST_STATE_ABR object
26 DEST_STATE_FIPS float64
27 DEST_STATE_NM object
28 DEST_WAC float64
29 CRS_DEP_TIME float64
30 DEP_TIME object
31 DEP_DELAY float64
32 DEP_DELAY_NEW float64
33 DEP_DEL15 float64
34 DEP_DELAY_GROUP float64
35 DEP_TIME_BLK object
36 TAXI_OUT object
37 WHEELS_OFF float64
38 WHEELS_ON float64
39 TAXI_IN float64
40 CRS_ARR_TIME float64
41 ARR_TIME object
42 ARR_DELAY float64
43 ARR_DELAY_NEW float64
44 ARR_DEL15 float64
45 ARR_DELAY_GROUP float64
46 ARR_TIME_BLK object
47 CANCELLED float64
48 CANCELLATION_CODE object
49 DIVERTED float64
50 CRS_ELAPSED_TIME float64
51 ACTUAL_ELAPSED_TIME float64
52 AIR_TIME float64
53 FLIGHTS float64
54 DISTANCE float64
55 DISTANCE_GROUP float64
56 CARRIER_DELAY float64
57 WEATHER_DELAY float64
58 NAS_DELAY float64
59 SECURITY_DELAY float64
60 LATE_AIRCRAFT_DELAY float64
61 FIRST_DEP_TIME float64
62 TOTAL_ADD_GTIME float64
63 LONGEST_ADD_GTIME float64
64 Unnamed: 64 float64
dtypes: float64(33), int64(3), object(29)
memory usage: 2.7+ GB
None
# Check for missing values
print(data.isnull().sum())
YEAR 0
QUARTER 0
MONTH 0
DAY_OF_MONTH 0
DAY_OF_WEEK 0
...
LATE_AIRCRAFT_DELAY 4667538
FIRST_DEP_TIME 5601445
TOTAL_ADD_GTIME 5601445
LONGEST_ADD_GTIME 5601445
Unnamed: 64 5635978
Length: 65, dtype: int64
This dataset is composed of the following variables:
Year 2016
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime actual departure time (local, hhmm)
CRSDepTime scheduled departure time (local, hhmm)
ArrTime actual arrival time (local, hhmm)
CRSArrTime scheduled arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number: aircraft registration, unique aircraft identifier
ActualElapsedTime in minutes
CRSElapsedTime in minutes
AirTime in minutes
ArrDelay arrival delay, in minutes: A flight is counted as "on time" if it operated less than 15 minutes later than the scheduled time shown in the carriers' Computerized Reservations Systems (CRS).
DepDelay departure delay, in minutes
Origin IATA airport code
Dest destination IATA airport code
Distance in miles
TaxiIn taxi in time, in minutes
TaxiOut taxi out time in minutes
Cancelled *was the flight canceled
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays.
WeatherDelay in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival.
NASDelay in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc.
SecurityDelay in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
LateAircraftDelay in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation.
Analysis of Dataset Structure and Missing Values
1. Dataset Structure
The dataset contains 5,635,978 rows and 65 columns. Here's a concise summary of the structure:
Data Types:
33 columns are of
float64
type, likely representing numerical data such as delays, distances, and time values.29 columns are of
object
type, including categorical features likeCARRIER
,ORIGIN
, andFL_DATE
.3 columns are of
int64
type, likely representing IDs or counts (e.g.,AIRLINE_ID
andMONTH
).
Memory Usage: The dataset occupies 2.7+ GB in memory, indicating that this is a large dataset and might require optimization techniques (e.g., downcasting data types or using chunk processing) for efficient handling.
Observations on Specific Columns:
Key time-related columns such as
DEP_TIME
andARR_TIME
are stored asobject
types, which might require conversion to appropriate datetime formats.Some columns, such as
Unnamed: 64
, appear to be placeholders or irrelevant features, as they are completely null.
2. Missing Values
The output of data.isnull().sum()
reveals the following about missing values in the dataset:
Columns with Significant Missing Values:
Features related to delay causes, such as
CARRIER_DELAY
,WEATHER_DELAY
,NAS_DELAY
,SECURITY_DELAY
, andLATE_AIRCRAFT_DELAY
, have over 4.66 million missing entries, suggesting that delays might not be recorded for all flights.Operational details like
FIRST_DEP_TIME
,TOTAL_ADD_GTIME
, andLONGEST_ADD_GTIME
have over 5.6 million missing values, likely because they apply only to a subset of flights (e.g., delayed or rescheduled flights).The column
Unnamed: 64
is entirely null and should be dropped as it adds no value.
Columns with No Missing Values:
- Essential features like
YEAR
,QUARTER
,MONTH
,DAY_OF_MONTH
,DAY_OF_WEEK
, andDISTANCE
are complete, which is crucial for basic analysis and modeling.
- Essential features like
Impact of Missing Values:
Missing data in delay-related columns can limit the scope of analysis for specific delay causes. Imputation or targeted removal of rows/columns might be necessary, depending on the focus of the study.
Features with excessive missing values (e.g.,
Unnamed: 64
) or those not relevant to the objective should be removed during preprocessing.
Key Insights:
The dataset is rich in features, but large portions of certain columns contain missing values, which need to be handled appropriately.
Conversion of
object
columns to numeric or datetime formats is required to facilitate analysis and modeling.Dropping irrelevant or entirely null columns (e.g.,
Unnamed: 64
) is an essential preprocessing step to optimize dataset handling.
By addressing these issues in the preprocessing stage, the dataset can be better prepared for Exploratory Data Analysis (EDA) and machine learning tasks.
3. Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) helps uncover patterns, trends, and relationships in the data. This section includes a focused analysis of delays, cancellations, and other key variables.
Separating Canceled Flights
To analyze cancellations separately, we create a subset of the dataset containing only canceled flights. We also handle missing values in the TAXI_OUT
column by filling them with 0
, as canceled flights never proceed to taxi-out.
# Fill missing values in TAXI_OUT with 0
data['TAXI_OUT'].fillna(0, inplace=True)
# Create a subset of canceled flights
canceled_flights = data[data['CANCELLED'] == 1]
# Display the last few rows of the canceled flights data
print(canceled_flights.tail())
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE \
5633120 2016 4 12 16 5 2016-12-16
5633121 2016 4 12 16 5 2016-12-16
5633165 2016 4 12 16 5 2016-12-16
5633168 2016 4 12 16 5 2016-12-16
5633710 2016 4 12 30 5 2016-12-30
UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM ... DISTANCE_GROUP \
5633120 WN 19393 WN N7846A ... 3.0
5633121 WN 19393 WN N560WN ... 3.0
5633165 WN 19393 WN N713SW ... 3.0
5633168 WN 19393 WN N422WN ... 3.0
5633710 WN 19393 WN N8669B ... 4.0
CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY \
5633120 NaN NaN NaN NaN
5633121 NaN NaN NaN NaN
5633165 NaN NaN NaN NaN
5633168 NaN NaN NaN NaN
5633710 NaN NaN NaN NaN
LATE_AIRCRAFT_DELAY FIRST_DEP_TIME TOTAL_ADD_GTIME LONGEST_ADD_GTIME \
5633120 NaN NaN NaN NaN
5633121 NaN NaN NaN NaN
5633165 NaN NaN NaN NaN
5633168 NaN NaN NaN NaN
5633710 NaN NaN NaN NaN
Unnamed: 64
5633120 NaN
5633121 NaN
5633165 NaN
5633168 NaN
5633710 NaN
[5 rows x 65 columns]
The canceled_flights
DataFrame includes all rows where CANCELLED = 1
. This separation allows us to analyze cancellation trends and reasons independently.
Average Departure Delay by Day of the Week
Analyzing delays by the day of the week helps uncover patterns in departure delays.
# Group data by Day of the Week and calculate mean delay
day_delay = data.groupby('DAY_OF_WEEK')['DEP_DELAY'].mean()
# Plot the results
plt.figure(figsize=(8, 5))
sns.barplot(x=day_delay.index, y=day_delay.values, palette="viridis")
plt.title('Average Departure Delay by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Delay (minutes)')
plt.show()
The bar chart illustrates the average departure delay by the day of the week, with noticeable variations. Delays peak on certain days, such as day 4 (Thursday) and day 7 (Sunday), possibly due to higher traffic volumes or operational inefficiencies. Conversely, delays are relatively lower on day 6 (Saturday), indicating less congestion or smoother operations. The presence of the outlier labeled "N707EV" suggests potential data inconsistencies or a unique case that requires further investigation. This analysis highlights temporal patterns that could inform scheduling optimizations.
Percentage of Canceled Flights by Day of the Week
Here, we calculate the percentage of canceled flights for each day of the week and display the results in a bar chart.
import numpy as np
# Calculate the number of canceled flights by day
days_canceled = canceled_flights['CANCELLED'].groupby(data['DAY_OF_WEEK']).count()
# Calculate the total number of flights by day
days_total = data['CANCELLED'].groupby(data['DAY_OF_WEEK']).count()
# Calculate the percentage of canceled flights
days_frac = np.divide(days_canceled, days_total)
# Define labels for the x-axis
week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
# Plot the percentage of cancellations
fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(days_frac.index, days_frac * 100, align='center')
ax.set_ylabel('Percentage of Flights Canceled')
ax.set_xticks(days_frac.index)
ax.set_xticklabels(week, rotation=45)
plt.title('Percentage of Canceled Flights by Day of the Week')
plt.show()
The bar chart shows the percentage of canceled flights by day of the week. Cancellation rates are highest on Friday and Saturday, likely due to increased air traffic and operational challenges during peak travel times. Tuesday has the lowest percentage, indicating smoother operations on less congested days. The pattern highlights how weekday schedules and traffic volumes impact cancellation rates.
Comparing All Flights vs. Canceled Flights
This step compares the total number of flights (All
) and canceled flights (Canceled
) for key features like DISTANCE
, DAY_OF_WEEK
, or ORIGIN
.
# Specify the feature to compare
feature_name = 'DISTANCE' # Replace with the desired column name, e.g., 'DAY_OF_WEEK' or 'ORIGIN'
# Ensure the feature exists in both datasets
if feature_name in df.columns and feature_name in cancelled.columns:
plt.figure(figsize=(12, 6))
plt.bar(
df[feature_name].value_counts().index,
df[feature_name].value_counts().values,
label='All Flights',
alpha=0.7, color='blue'
)
plt.bar(
cancelled[feature_name].value_counts().index,
cancelled[feature_name].value_counts().values,
label='Canceled Flights',
alpha=0.7, color='orange'
)
plt.xlabel(feature_name)
plt.ylabel('Count')
plt.title(f'Comparison of All Flights vs. Cancelled Flights by {feature_name}')
plt.legend()
plt.xticks(rotation=45)
plt.show()
else:
print(f"The feature '{feature_name}' is not present in the dataset.")
4. Data Preprocessing
Handling missing values.
Encoding categorical features.
Creating additional features, such as time of day or cumulative delays, to enhance the model.
Scaling and normalizing numerical features (if required).
5. Machine Learning Models Training and Evaluation
Random Forest Classifier
Description of the model and its role in predicting cancellations or delays.
- Key performance metrics (e.g., accuracy, F1-score).
Delay Time Regression Analysis:
Overview of regression techniques for predicting delay durations.
Results and evaluation metrics (e.g., RMSE, R²).
Visualizing predicted vs actual delay times.
Key Results and Insights
Summary of the most influential features identified during modeling.
Insights from both classification and regression models.
Discussion on how these models can help airlines improve scheduling and communication with passengers.
Limitations and Future Scope
Challenges encountered during the project, such as lack of weather data or other external factors.
Recommendations for improving the model:
Incorporating additional datasets (e.g., weather, air traffic data).
Testing ensemble models and hyperparameter tuning.
Deploying the model using Streamlit or Flask for real-world applications.
Conclusion
Recap of the objectives, methods, and findings.
Practical implications for airlines and passengers.
Encouragement to explore similar datasets and predictive modeling approaches.