1. Introduction

Flight delays are a major inconvenience, affecting millions of passengers annually and causing financial strain on airlines. Delays disrupt travel schedules, increase operational costs, and often result in dissatisfied customers. For airlines, the ability to predict flight delays in advance can significantly improve scheduling efficiency, resource allocation, and customer satisfaction.

In this project, we explore how machine learning can be used to predict flight delays using historical data. By analyzing patterns and trends in delay data, we aim to build robust predictive models that not only forecast delay durations but also identify key factors contributing to these delays. Our goal is to provide actionable insights that airlines can leverage to minimize delays and enhance the passenger experience.

2. Understanding the Dataset

To begin the project, we first import the dataset and explore its general structure and content. This step helps us understand the data we're working with and identify any potential issues that need to be addressed.

Importing the Dataset

We start by loading the dataset using Python's pandas library:

import pandas as pd

# Load the dataset
data = pd.read_csv('flight_data.csv')

# Display the first few rows
print(data.head())

<ipython-input-2-21709313848c>:9: DtypeWarning: Columns (0,1,3,4,10,11,13,19,20,21,22,30,36,41,48) have mixed types. Specify dtype option on import or set low_memory=False.
  data = pd.read_csv('final_data.csv')
   YEAR QUARTER  MONTH DAY_OF_MONTH DAY_OF_WEEK     FL_DATE UNIQUE_CARRIER  \
0  2016       1      1            6           3  2016-01-06             AA   
1  2016       1      1            7           4  2016-01-07             AA   
2  2016       1      1            8           5  2016-01-08             AA   
3  2016       1      1            9           6  2016-01-09             AA   
4  2016       1      1           10           7  2016-01-10             AA   

   AIRLINE_ID CARRIER TAIL_NUM  ... DISTANCE_GROUP CARRIER_DELAY  \
0       19805      AA   N4YBAA  ...            4.0           NaN   
1       19805      AA   N434AA  ...            4.0           NaN   
2       19805      AA   N541AA  ...            4.0           NaN   
3       19805      AA   N489AA  ...            4.0           NaN   
4       19805      AA   N439AA  ...            4.0           0.0   

   WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY FIRST_DEP_TIME  \
0            NaN       NaN            NaN                 NaN            NaN   
1            NaN       NaN            NaN                 NaN            NaN   
2            NaN       NaN            NaN                 NaN            NaN   
3            NaN       NaN            NaN                 NaN            NaN   
4            0.0      47.0            0.0                66.0            NaN   

   TOTAL_ADD_GTIME LONGEST_ADD_GTIME Unnamed: 64  
0              NaN               NaN         NaN  
1              NaN               NaN         NaN  
2              NaN               NaN         NaN  
3              NaN               NaN         NaN  
4              NaN               NaN         NaN  

[5 rows x 65 columns]

# Check the structure of the dataset
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5635978 entries, 0 to 5635977
Data columns (total 65 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   YEAR                   object 
 1   QUARTER                object 
 2   MONTH                  int64  
 3   DAY_OF_MONTH           object 
 4   DAY_OF_WEEK            object 
 5   FL_DATE                object 
 6   UNIQUE_CARRIER         object 
 7   AIRLINE_ID             int64  
 8   CARRIER                object 
 9   TAIL_NUM               object 
 10  FL_NUM                 object 
 11  ORIGIN_AIRPORT_ID      object 
 12  ORIGIN_AIRPORT_SEQ_ID  int64  
 13  ORIGIN_CITY_MARKET_ID  object 
 14  ORIGIN                 object 
 15  ORIGIN_CITY_NAME       object 
 16  ORIGIN_STATE_ABR       object 
 17  ORIGIN_STATE_FIPS      float64
 18  ORIGIN_STATE_NM        object 
 19  ORIGIN_WAC             object 
 20  DEST_AIRPORT_ID        object 
 21  DEST_AIRPORT_SEQ_ID    object 
 22  DEST_CITY_MARKET_ID    object 
 23  DEST                   object 
 24  DEST_CITY_NAME         object 
 25  DEST_STATE_ABR         object 
 26  DEST_STATE_FIPS        float64
 27  DEST_STATE_NM          object 
 28  DEST_WAC               float64
 29  CRS_DEP_TIME           float64
 30  DEP_TIME               object 
 31  DEP_DELAY              float64
 32  DEP_DELAY_NEW          float64
 33  DEP_DEL15              float64
 34  DEP_DELAY_GROUP        float64
 35  DEP_TIME_BLK           object 
 36  TAXI_OUT               object 
 37  WHEELS_OFF             float64
 38  WHEELS_ON              float64
 39  TAXI_IN                float64
 40  CRS_ARR_TIME           float64
 41  ARR_TIME               object 
 42  ARR_DELAY              float64
 43  ARR_DELAY_NEW          float64
 44  ARR_DEL15              float64
 45  ARR_DELAY_GROUP        float64
 46  ARR_TIME_BLK           object 
 47  CANCELLED              float64
 48  CANCELLATION_CODE      object 
 49  DIVERTED               float64
 50  CRS_ELAPSED_TIME       float64
 51  ACTUAL_ELAPSED_TIME    float64
 52  AIR_TIME               float64
 53  FLIGHTS                float64
 54  DISTANCE               float64
 55  DISTANCE_GROUP         float64
 56  CARRIER_DELAY          float64
 57  WEATHER_DELAY          float64
 58  NAS_DELAY              float64
 59  SECURITY_DELAY         float64
 60  LATE_AIRCRAFT_DELAY    float64
 61  FIRST_DEP_TIME         float64
 62  TOTAL_ADD_GTIME        float64
 63  LONGEST_ADD_GTIME      float64
 64  Unnamed: 64            float64
dtypes: float64(33), int64(3), object(29)
memory usage: 2.7+ GB
None

# Check for missing values
print(data.isnull().sum())

YEAR                         0
QUARTER                      0
MONTH                        0
DAY_OF_MONTH                 0
DAY_OF_WEEK                  0
                        ...   
LATE_AIRCRAFT_DELAY    4667538
FIRST_DEP_TIME         5601445
TOTAL_ADD_GTIME        5601445
LONGEST_ADD_GTIME      5601445
Unnamed: 64            5635978
Length: 65, dtype: int64

This dataset is composed of the following variables:

Year 2016
Month 1-12
DayofMonth 1-31
DayOfWeek 1 (Monday) - 7 (Sunday)
DepTime actual departure time (local, hhmm)
CRSDepTime scheduled departure time (local, hhmm)
ArrTime actual arrival time (local, hhmm)
CRSArrTime scheduled arrival time (local, hhmm)
UniqueCarrier unique carrier code
FlightNum flight number
TailNum plane tail number: aircraft registration, unique aircraft identifier
ActualElapsedTime in minutes
CRSElapsedTime in minutes
AirTime in minutes
ArrDelay arrival delay, in minutes: A flight is counted as "on time" if it operated less than 15 minutes later than the scheduled time shown in the carriers' Computerized Reservations Systems (CRS).
DepDelay departure delay, in minutes
Origin IATA airport code
Dest destination IATA airport code
Distance in miles
TaxiIn taxi in time, in minutes
TaxiOut taxi out time in minutes
Cancelled *was the flight canceled
CancellationCode reason for cancellation (A = carrier, B = weather, C = NAS, D = security)
Diverted 1 = yes, 0 = no
CarrierDelay in minutes: Carrier delay is within the control of the air carrier. Examples of occurrences that may determine carrier delay are: aircraft cleaning, aircraft damage, awaiting the arrival of connecting passengers or crew, baggage, bird strike, cargo loading, catering, computer, outage-carrier equipment, crew legality (pilot or attendant rest), damage by hazardous goods, engineering inspection, fueling, handling disabled passengers, late crew, lavatory servicing, maintenance, oversales, potable water servicing, removal of unruly passenger, slow boarding or seating, stowing carry-on baggage, weight and balance delays.
WeatherDelay in minutes: Weather delay is caused by extreme or hazardous weather conditions that are forecasted or manifest themselves on point of departure, enroute, or on point of arrival.
NASDelay in minutes: Delay that is within the control of the National Airspace System (NAS) may include: non-extreme weather conditions, airport operations, heavy traffic volume, air traffic control, etc.
SecurityDelay in minutes: Security delay is caused by evacuation of a terminal or concourse, re-boarding of aircraft because of security breach, inoperative screening equipment and/or long lines in excess of 29 minutes at screening areas.
LateAircraftDelay in minutes: Arrival delay at an airport due to the late arrival of the same aircraft at a previous airport. The ripple effect of an earlier delay at downstream airports is referred to as delay propagation.

Analysis of Dataset Structure and Missing Values

1. Dataset Structure

The dataset contains 5,635,978 rows and 65 columns. Here's a concise summary of the structure:

Data Types:
- 33 columns are of float64 type, likely representing numerical data such as delays, distances, and time values.
- 29 columns are of object type, including categorical features like CARRIER, ORIGIN, and FL_DATE.
- 3 columns are of int64 type, likely representing IDs or counts (e.g., AIRLINE_ID and MONTH).
Memory Usage: The dataset occupies 2.7+ GB in memory, indicating that this is a large dataset and might require optimization techniques (e.g., downcasting data types or using chunk processing) for efficient handling.
Observations on Specific Columns:
- Key time-related columns such as DEP_TIME and ARR_TIME are stored as object types, which might require conversion to appropriate datetime formats.
- Some columns, such as Unnamed: 64, appear to be placeholders or irrelevant features, as they are completely null.

2. Missing Values

The output of data.isnull().sum() reveals the following about missing values in the dataset:

Columns with Significant Missing Values:
- Features related to delay causes, such as CARRIER_DELAY, WEATHER_DELAY, NAS_DELAY, SECURITY_DELAY, and LATE_AIRCRAFT_DELAY, have over 4.66 million missing entries, suggesting that delays might not be recorded for all flights.
- Operational details like FIRST_DEP_TIME, TOTAL_ADD_GTIME, and LONGEST_ADD_GTIME have over 5.6 million missing values, likely because they apply only to a subset of flights (e.g., delayed or rescheduled flights).
- The column Unnamed: 64 is entirely null and should be dropped as it adds no value.
Columns with No Missing Values:
- Essential features like YEAR, QUARTER, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, and DISTANCE are complete, which is crucial for basic analysis and modeling.
Impact of Missing Values:
- Missing data in delay-related columns can limit the scope of analysis for specific delay causes. Imputation or targeted removal of rows/columns might be necessary, depending on the focus of the study.
- Features with excessive missing values (e.g., Unnamed: 64) or those not relevant to the objective should be removed during preprocessing.

Key Insights:

The dataset is rich in features, but large portions of certain columns contain missing values, which need to be handled appropriately.
Conversion of object columns to numeric or datetime formats is required to facilitate analysis and modeling.
Dropping irrelevant or entirely null columns (e.g., Unnamed: 64) is an essential preprocessing step to optimize dataset handling.

By addressing these issues in the preprocessing stage, the dataset can be better prepared for Exploratory Data Analysis (EDA) and machine learning tasks.

3. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps uncover patterns, trends, and relationships in the data. This section includes a focused analysis of delays, cancellations, and other key variables.

Separating Canceled Flights

To analyze cancellations separately, we create a subset of the dataset containing only canceled flights. We also handle missing values in the TAXI_OUT column by filling them with 0, as canceled flights never proceed to taxi-out.

# Fill missing values in TAXI_OUT with 0
data['TAXI_OUT'].fillna(0, inplace=True)

# Create a subset of canceled flights
canceled_flights = data[data['CANCELLED'] == 1]

# Display the last few rows of the canceled flights data
print(canceled_flights.tail())

         YEAR QUARTER  MONTH DAY_OF_MONTH DAY_OF_WEEK     FL_DATE  \
5633120  2016       4     12           16           5  2016-12-16   
5633121  2016       4     12           16           5  2016-12-16   
5633165  2016       4     12           16           5  2016-12-16   
5633168  2016       4     12           16           5  2016-12-16   
5633710  2016       4     12           30           5  2016-12-30   

        UNIQUE_CARRIER  AIRLINE_ID CARRIER TAIL_NUM  ... DISTANCE_GROUP  \
5633120             WN       19393      WN   N7846A  ...            3.0   
5633121             WN       19393      WN   N560WN  ...            3.0   
5633165             WN       19393      WN   N713SW  ...            3.0   
5633168             WN       19393      WN   N422WN  ...            3.0   
5633710             WN       19393      WN   N8669B  ...            4.0   

        CARRIER_DELAY  WEATHER_DELAY NAS_DELAY SECURITY_DELAY  \
5633120           NaN            NaN       NaN            NaN   
5633121           NaN            NaN       NaN            NaN   
5633165           NaN            NaN       NaN            NaN   
5633168           NaN            NaN       NaN            NaN   
5633710           NaN            NaN       NaN            NaN   

        LATE_AIRCRAFT_DELAY FIRST_DEP_TIME  TOTAL_ADD_GTIME LONGEST_ADD_GTIME  \
5633120                 NaN            NaN              NaN               NaN   
5633121                 NaN            NaN              NaN               NaN   
5633165                 NaN            NaN              NaN               NaN   
5633168                 NaN            NaN              NaN               NaN   
5633710                 NaN            NaN              NaN               NaN   

        Unnamed: 64  
5633120         NaN  
5633121         NaN  
5633165         NaN  
5633168         NaN  
5633710         NaN  

[5 rows x 65 columns]

The canceled_flights DataFrame includes all rows where CANCELLED = 1. This separation allows us to analyze cancellation trends and reasons independently.

Average Departure Delay by Day of the Week

Analyzing delays by the day of the week helps uncover patterns in departure delays.

# Group data by Day of the Week and calculate mean delay
day_delay = data.groupby('DAY_OF_WEEK')['DEP_DELAY'].mean()

# Plot the results
plt.figure(figsize=(8, 5))
sns.barplot(x=day_delay.index, y=day_delay.values, palette="viridis")
plt.title('Average Departure Delay by Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Delay (minutes)')
plt.show()

The bar chart illustrates the average departure delay by the day of the week, with noticeable variations. Delays peak on certain days, such as day 4 (Thursday) and day 7 (Sunday), possibly due to higher traffic volumes or operational inefficiencies. Conversely, delays are relatively lower on day 6 (Saturday), indicating less congestion or smoother operations. The presence of the outlier labeled "N707EV" suggests potential data inconsistencies or a unique case that requires further investigation. This analysis highlights temporal patterns that could inform scheduling optimizations.

Percentage of Canceled Flights by Day of the Week

Here, we calculate the percentage of canceled flights for each day of the week and display the results in a bar chart.

import numpy as np

# Calculate the number of canceled flights by day
days_canceled = canceled_flights['CANCELLED'].groupby(data['DAY_OF_WEEK']).count()

# Calculate the total number of flights by day
days_total = data['CANCELLED'].groupby(data['DAY_OF_WEEK']).count()

# Calculate the percentage of canceled flights
days_frac = np.divide(days_canceled, days_total)

# Define labels for the x-axis
week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

# Plot the percentage of cancellations
fig, ax = plt.subplots(figsize=(12, 6))
ax.bar(days_frac.index, days_frac * 100, align='center')
ax.set_ylabel('Percentage of Flights Canceled')
ax.set_xticks(days_frac.index)
ax.set_xticklabels(week, rotation=45)
plt.title('Percentage of Canceled Flights by Day of the Week')
plt.show()

Uploaded image

The bar chart shows the percentage of canceled flights by day of the week. Cancellation rates are highest on Friday and Saturday, likely due to increased air traffic and operational challenges during peak travel times. Tuesday has the lowest percentage, indicating smoother operations on less congested days. The pattern highlights how weekday schedules and traffic volumes impact cancellation rates.

Comparing All Flights vs. Canceled Flights

This step compares the total number of flights (All) and canceled flights (Canceled) for key features like DISTANCE, DAY_OF_WEEK, or ORIGIN.

# Specify the feature to compare
feature_name = 'DISTANCE'  # Replace with the desired column name, e.g., 'DAY_OF_WEEK' or 'ORIGIN'

# Ensure the feature exists in both datasets
if feature_name in df.columns and feature_name in cancelled.columns:
    plt.figure(figsize=(12, 6))
    plt.bar(
        df[feature_name].value_counts().index, 
        df[feature_name].value_counts().values, 
        label='All Flights', 
        alpha=0.7, color='blue'
    )
    plt.bar(
        cancelled[feature_name].value_counts().index, 
        cancelled[feature_name].value_counts().values, 
        label='Canceled Flights', 
        alpha=0.7, color='orange'
    )
    plt.xlabel(feature_name)
    plt.ylabel('Count')
    plt.title(f'Comparison of All Flights vs. Cancelled Flights by {feature_name}')
    plt.legend()
    plt.xticks(rotation=45)
    plt.show()
else:
    print(f"The feature '{feature_name}' is not present in the dataset.")

4. Data Preprocessing

- Handling missing values.
  - Encoding categorical features.
  - Creating additional features, such as time of day or cumulative delays, to enhance the model.
  - Scaling and normalizing numerical features (if required).

5. Machine Learning Models Training and Evaluation

Random Forest Classifier

- Description of the model and its role in predicting cancellations or delays.
  - Key performance metrics (e.g., accuracy, F1-score).
Delay Time Regression Analysis:
- Overview of regression techniques for predicting delay durations.
- Results and evaluation metrics (e.g., RMSE, R²).
- Visualizing predicted vs actual delay times.

Key Results and Insights
- Summary of the most influential features identified during modeling.
- Insights from both classification and regression models.
- Discussion on how these models can help airlines improve scheduling and communication with passengers.
Limitations and Future Scope
- Challenges encountered during the project, such as lack of weather data or other external factors.
- Recommendations for improving the model:
  - Incorporating additional datasets (e.g., weather, air traffic data).
  - Testing ensemble models and hyperparameter tuning.
  - Deploying the model using Streamlit or Flask for real-world applications.
Conclusion
- Recap of the objectives, methods, and findings.
- Practical implications for airlines and passengers.
- Encouragement to explore similar datasets and predictive modeling approaches.

Predicting Flight Delays with Machine Learning: A Comprehensive Guide