1. Introduction

Tipping behavior at restaurants is a fascinating area of study, often influenced by various factors such as bill amount, group size, and even the time of day. Regression analysis offers a powerful method to explore these relationships and predict tip amounts based on multiple features.

This project focuses on building and evaluating regression models using Python to predict the tips given at a restaurant. By analyzing the relationships between predictors like total_bill, sex, smoker, day, time, and size, we aim to uncover the underlying factors affecting tipping behavior and use this understanding to make accurate predictions.

2. Dataset Overview

The dataset used in this project is the popular "Tips" dataset, which contains information about customer bills and tipping behavior at a restaurant. The dataset includes 244 records and 7 features, described as follows:

Features of the Dataset:

total_bill: The total bill amount (in dollars).
tip: The tip amount (in dollars) — this is the target variable.
sex: The gender of the customer (Male or Female).
smoker: Indicates whether the customer was a smoker (Yes or No).
day: The day of the week (e.g., Thur, Fri, Sat, Sun).
time: The time of the meal (Lunch or Dinner).
size: The size of the dining party.

This dataset provides an excellent opportunity to explore how factors like meal size, time, and other attributes influence tipping behavior. We aim to develop regression models that effectively predict the tip amount based on these features.

3. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps us uncover patterns, trends, and relationships in the dataset, enabling informed decisions during preprocessing and modeling. Here's how we approach EDA for the Tips dataset.

Importing Required Libraries and Dataset

We start by importing the necessary libraries for analysis and visualization, along with loading the dataset from the seaborn library.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the Tips dataset
tips_data = sns.load_dataset('tips')

# General information about the dataset
print(tips_data.head())

total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4

Checking the General Information

To understand the structure and types of data in the dataset, we use the .info() method. This step highlights the number of records, feature types, and missing values, if any.

# Display dataset information
print(tips_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None

The dataset contains 244 entries with 7 columns. None of the features have missing values. The features include:

total_bill (float64): Numeric data indicating the total bill amount.
tip (float64): Numeric data for the tip amount.
sex, smoker, day, time (object): Categorical features.
size (int64): Numeric data for the party size.

Summary Statistics

We examine the summary statistics of the dataset to understand the distribution and central tendencies of numerical features.

# Display summary statistics
print(tips_data.describe())

total_bill         tip        size
count  244.000000  244.000000  244.000000
mean    19.785943    2.998279    2.569672
std      8.902412    1.383638    0.951100
min      3.070000    1.000000    1.000000
25%     13.347500    2.000000    2.000000
50%     17.795000    2.900000    2.000000
75%     24.127500    3.562500    3.000000
max     50.810000   10.000000    6.000000

Key Observations:

The total_bill ranges from approximately $3 to $50, with a mean of $19.79.
The tip amounts range from $1 to $10, with an average of $2.99.
Party size (size) typically ranges from 1 to 6, with an average of 2.57 diners per party.

Feature Distributions

Visualizing the distributions of numerical features helps identify skewness or outliers.

# Plot distributions of numerical features as subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.histplot(tips_data['total_bill'], kde=True, bins=20, color='blue', ax=axes[0])
axes[0].set_title('Distribution of Total Bill')
axes[0].set_xlabel('Total Bill ($)')

sns.histplot(tips_data['tip'], kde=True, bins=20, color='green', ax=axes[1])
axes[1].set_title('Distribution of Tips')
axes[1].set_xlabel('Tip Amount ($)')

plt.tight_layout()
plt.show()

Insights:

total_bill is slightly right-skewed, with most bills concentrated between $10 and $30.
tip also exhibits right-skewness, indicating larger tips are less frequent.

Categorical Feature Analysis

Analyzing the distributions of categorical features provides insights into customer demographics and behaviors.

# Plot counts of categorical features as subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

sns.countplot(x='sex', data=tips_data, palette='pastel', ax=axes[0, 0])
axes[0, 0].set_title('Distribution by Gender')

sns.countplot(x='smoker', data=tips_data, palette='pastel', ax=axes[0, 1])
axes[0, 1].set_title('Smoker vs. Non-Smoker')

sns.countplot(x='day', data=tips_data, palette='pastel', order=['Thur', 'Fri', 'Sat', 'Sun'], ax=axes[1, 0])
axes[1, 0].set_title('Distribution by Day of the Week')

sns.countplot(x='time', data=tips_data, palette='pastel', ax=axes[1, 1])
axes[1, 1].set_title('Lunch vs. Dinner')

plt.tight_layout()
plt.show()

Gender Distribution: Male customers significantly outnumber female customers, indicating a potential demographic bias in the dataset.
Smoking Status: Non-smokers are almost double the number of smokers, suggesting the restaurant may attract more non-smoking patrons.
Day of the Week: The majority of visits occur on weekends (Saturday and Sunday), likely due to higher dining-out activities during those days.
Meal Timing: Dinner dominates over lunch as the preferred meal time, possibly reflecting the restaurant’s peak business hours.

Correlation Analysis

Correlation analysis helps identify relationships between numerical features.

# Plot a heatmap of correlations
sns.heatmap(tips_data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Strong Positive Correlation: total_bill and tip show a strong positive correlation (0.68), indicating that higher bills tend to result in higher tips.
Moderate Correlation: size has a moderate positive correlation with both total_bill (0.6) and tip (0.49), suggesting that larger dining parties generally spend more and leave bigger tips.
No Perfect Correlation: None of the features are perfectly correlated, indicating each contributes unique information to the dataset.

Interaction Between Features

Exploring relationships between features provides deeper insights into tipping behavior.

# Present these plots as 2 subplots
fig, axes = plt.subplots(1, 2, figsize=(12, 6))

sns.scatterplot(x='total_bill', y='tip', data=tips_data, hue='time', palette='coolwarm', ax=axes[0])
axes[0].set_title('Total Bill vs Tip')

sns.boxplot(x='day', y='tip', data=tips_data, palette='pastel', order=['Thur', 'Fri', 'Sat', 'Sun'], ax=axes[1])
axes[1].set_title('Tips by Day')

plt.tight_layout()
plt.show()

Total Bill vs Tip: There is a positive linear trend between total_bill and tip, with dinner generally associated with higher bills and tips compared to lunch.
Tips by Day: Tips are relatively consistent across days, with slightly higher medians on weekends (Saturday and Sunday), reflecting potentially larger parties or higher spending. Outliers are more frequent on weekends, indicating occasional large tips.

4. Data Preprocessing

Data preparation involves cleaning and preprocessing the dataset to ensure it is ready for building regression models. Based on insights gained from the Exploratory Data Analysis (EDA), the following steps are taken:

Encoding Categorical Features

The dataset contains categorical features (sex, smoker, day, time) that need to be converted into a numerical format for regression analysis. One-hot encoding is used to avoid introducing any ordinal relationships.

Before preprocessing, we check for any missing values in the dataset.

# One-hot encode categorical features
tips_data = pd.get_dummies(tips_data, columns=['sex', 'smoker', 'day', 'time'], drop_first=True)
print(tips_data.head())

total_bill   tip  size  sex_Female  smoker_No  day_Fri  day_Sat  day_Sun  \
0       16.99  1.01     2           1          1        0        0        1   
1       10.34  1.66     3           0          1        0        0        1   
2       21.01  3.50     3           0          1        0        0        1   
3       23.68  3.31     2           0          1        0        0        1   
4       24.59  3.61     4           1          1        0        0        1   

   time_Dinner  
0            1  
1            1  
2            1  
3            1  
4            1

The dataset now includes binary columns for each category, such as sex_Male, smoker_Yes, and time_Dinner.

Feature Scaling

Regression models can benefit from scaling numerical features to ensure they are on comparable scales. Min-Max scaling is applied to normalize total_bill, tip, and size.

from sklearn.preprocessing import MinMaxScaler

# Apply Min-Max scaling
scaler = MinMaxScaler()
tips_data[['total_bill', 'tip', 'size']] = scaler.fit_transform(tips_data[['total_bill', 'tip', 'size']])

Numerical features are scaled to a range of 0 to 1, which helps improve model performance and stability.

Splitting the Data

To train and evaluate the model, the dataset is split into training and testing sets. The tip column serves as the target variable (y), while all other columns are features (X).

from sklearn.model_selection import train_test_split

# Define features and target variable
X = tips_data.drop(columns=['tip'])
y = tips_data['tip']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The data is divided into 80% training and 20% testing sets to evaluate model performance effectively.

5. Regression Model Implementation

In this step, we build regression models to predict the tip amount (tip) based on the features in the dataset. We start with a simple Linear Regression model and then explore Polynomial Regression to capture potential nonlinear relationships.

Linear Regression Model

Linear Regression is a fundamental regression technique that assumes a linear relationship between the features and the target variable.

Training the Linear Regression Model:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Initialize the Linear Regression model
linear_model = LinearRegression()

# Fit the model on the training data
linear_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred_linear = linear_model.predict(X_test)

# Evaluate the model
linear_mse = mean_squared_error(y_test, y_pred_linear)
linear_r2 = r2_score(y_test, y_pred_linear)

print(f"Linear Regression Mean Squared Error: {linear_mse}")
print(f"Linear Regression R-squared: {linear_r2}")

Linear Regression Mean Squared Error: 0.008683414836340865
Linear Regression R-squared: 0.4373018194348254

Mean Squared Error (MSE): The low MSE (0.0087) indicates that the Linear Regression model's predictions are close to the actual values on average, showing good accuracy.
R-squared: The score of 0.437 suggests that the model explains about 43.7% of the variance in tip, indicating a moderate fit. There may be room for improvement by using more complex models or feature engineering.

Polynomial Regression Model

Polynomial Regression allows us to model nonlinear relationships by including polynomial terms for the features.

Training the Polynomial Regression Model:

from sklearn.preprocessing import PolynomialFeatures

# Create polynomial features
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Initialize and train the Polynomial Regression model
poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_train)

# Make predictions on the test data
y_pred_poly = poly_model.predict(X_test_poly)

# Evaluate the model
poly_mse = mean_squared_error(y_test, y_pred_poly)
poly_r2 = r2_score(y_test, y_pred_poly)

print(f"Polynomial Regression Mean Squared Error: {poly_mse}")
print(f"Polynomial Regression R-squared: {poly_r2}")

Polynomial Regression Mean Squared Error: 0.012555568269818424
Polynomial Regression R-squared: 0.18638052488048473

Mean Squared Error (MSE): 0.0126, higher than Linear Regression's 0.0087, indicating worse prediction accuracy.
R-squared: 0.186, significantly lower than Linear Regression's 0.437, suggesting Polynomial Regression explains less variance in the data.

6. Model Evaluation

Model evaluation is crucial to understanding how well the regression models perform on unseen data. In this section, we evaluate the performance of both the Linear Regression and Polynomial Regression models using key metrics and visualizations.

Evaluation Metrics

We evaluated the performance of both models using the following metrics:

Mean Squared Error (MSE): Measures the average squared difference between predicted and actual values. Lower MSE indicates better predictions.
R-squared (R²): Represents the proportion of variance in the target variable explained by the model. Higher R² indicates a better fit.

The results are summarized in the table below:

Comparison:

The Linear Regression model outperforms Polynomial Regression in both metrics, effectively capturing the largely linear relationship between features and tip.
Polynomial Regression added unnecessary complexity without improving accuracy, indicating overfitting.

This table provides a clear comparison of model performance, highlighting Linear Regression as the more suitable choice for this dataset

Visualizing Model Performance

Residual Plot for Linear Regression: Residual plots help check the adequacy of the model fit. For Linear Regression, the residuals form a systematic pattern, suggesting some underfitting and possible non-linear effects.

# Residual Plot for Linear Regression
residuals = y_test - y_pred_linear
plt.scatter(y_pred_linear, residuals)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot for Linear Regression')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()

The residual plot for Linear Regression shows a clear pattern, with residuals forming a curve rather than being randomly scattered around the zero line. This indicates that the Linear Regression model is underfitting, failing to capture the true relationship between the features and the target variable. The systematic pattern suggests the presence of non-linear relationships in the data that the model cannot account for, indicating the potential need for a more flexible model, such as Polynomial Regression.

Residual Plot for Polynomial Regression: Residuals for Polynomial Regression show higher scatter and less structure compared to Linear Regression, reinforcing its poorer performance.

# Residual Plot for Polynomial Regression
residuals_poly = y_test - y_pred_poly
plt.scatter(y_pred_poly, residuals_poly)
plt.axhline(y=0, color='r', linestyle='--')
plt.title('Residual Plot for Polynomial Regression')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.show()

The residual plot for Polynomial Regression shows a slightly more scattered distribution compared to Linear Regression, but it still exhibits some structured patterns, indicating that the model is not capturing all relationships effectively. While Polynomial Regression introduces flexibility to capture non-linear relationships, the lack of random residual dispersion around the zero line suggests overfitting, as the model may have incorporated noise from the data without significantly improving predictive performance.

7. Addressing Underfitting in the Linear Regression Model

After evaluating the model performance and observing underfitting in the residual plot, we focus on addressing this limitation through targeted improvements. This section details the steps taken to enhance the Linear Regression model's ability to capture the relationship between features and the target variable (tip).

Adding Interaction Terms

Interaction terms help capture combined effects between features that a simple linear model may miss. For instance, the relationship between total_bill and size may influence tipping behavior.

Implementation:

# Add interaction term between total_bill and size
tips_data['total_bill_size'] = tips_data['total_bill'] * tips_data['size']

This term captures how the size of a party and the total bill together affect tipping behavior.

Introducing Non-Linear Transformations

Non-linear transformations, such as squared terms, allow the model to capture the diminishing or accelerating effects of certain features.

Implementation:

# Add squared terms for total_bill and size
tips_data['total_bill_squared'] = tips_data['total_bill'] ** 2
tips_data['size_squared'] = tips_data['size'] ** 2

total_bill_squared: Helps capture non-linear effects of the total bill on tips, such as diminishing tipping percentages for very high bills.
size_squared: Models non-linear effects of party size on tipping behavior.

Testing Logarithmic Scaling

Skewed distributions in features like total_bill can adversely affect model performance. Applying a logarithmic transformation helps normalize these distributions and improve the model's robustness.

Implementation:

import numpy as np

# Apply log transformation to total_bill and tip
tips_data['log_total_bill'] = np.log(tips_data['total_bill'] + 1)  # Avoid log(0)
tips_data['log_tip'] = np.log(tips_data['tip'] + 1)

Log transformations reduce the impact of extreme values (outliers) and normalize feature distributions.

Re-evaluating the Model

After introducing these changes, the model is retrained, and its performance metrics are reevaluated. The impact of these improvements is assessed using metrics like MSE and R², as well as updated residual plots.

Retrain and Evaluate:

# Define the new feature set
X = tips_data[['total_bill', 'size', 'total_bill_size', 'total_bill_squared', 'size_squared']]
y = tips_data['tip']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Make predictions and evaluate
y_pred = linear_model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Improved Linear Regression - MSE: {mse}, R-squared: {r2}")

Improved Linear Regression - MSE: 0.009381495691324071, R-squared: 0.3920651430362164

The improved Linear Regression model achieves an MSE of 0.0094, slightly higher than the original model's 0.0087, and an R-squared of 0.392, slightly lower than the original 0.437. These results indicate that the added interaction and non-linear features did not significantly enhance the model's ability to explain the variance in tip or improve prediction accuracy. The marginal decrease in performance suggests that the relationship between the features and tip is predominantly linear and that the additional complexity introduced by the new features may not have been necessary for this dataset.

8. Deploying the Model with Streamlit

Deploying the model using Streamlit allows us to create a simple and interactive web application where users can input feature values (e.g., total_bill, size) and receive predicted tips instantly. This section walks through the process of building and deploying the Streamlit app.

Setting Up the Streamlit Environment

First, ensure that Streamlit is installed. If not, you can install it using the following command:

pip install streamlit

Creating the Streamlit Application

Create a new Python file (e.g., app.py) for the Streamlit app. The app will:

Accept user inputs for features like total_bill, size, and other necessary parameters.
Use the trained Linear Regression model to predict the tip.
Display the predicted tip dynamically.

Code for app.py:

import streamlit as st
import numpy as np
import pickle

# Load the trained model
with open('linear_model.pkl', 'rb') as file:
    model = pickle.load(file)

# Streamlit app title
st.markdown("<h1 style='text-align: center; font-size: 48px;'>Restaurant Tip Predictor</h1>", unsafe_allow_html=True)

# User inputs
total_bill = st.number_input("Total Bill ($)", min_value=0.0, format="%.2f", step=1.0)
size = st.number_input("Party Size", min_value=1, format="%d", step=1)

# Optional interaction term and squared features
total_bill_size = total_bill * size
total_bill_squared = total_bill ** 2
size_squared = size ** 2

# Prepare the feature vector
features = np.array([[total_bill, size, total_bill_size, total_bill_squared, size_squared]])

# Predict the tip
if st.button("Predict Tip"):
    predicted_tip = model.predict(features)[0]
    st.markdown(f"<h2 style='text-align: center; font-size: 48px;'>Predicted Tip: ${predicted_tip:.2f}</h2>", unsafe_allow_html=True)

Running the Streamlit Application

Run the Streamlit app locally using the following command:

streamlit run app.py

Once the app is running, you’ll see a web interface where users can input total_bill, size, and receive the predicted tip instantly, which looks like below:

9. Conclusion

In this project, we explored the relationship between restaurant bill characteristics and tipping behavior using regression analysis. Starting with a comprehensive Exploratory Data Analysis (EDA), we identified key patterns and relationships in the data. We implemented both Linear and Polynomial Regression models to predict tips and evaluated their performance using metrics such as Mean Squared Error (MSE) and R-squared (R²).

Our findings revealed that the Linear Regression model performed better, effectively capturing the predominantly linear relationship between features like total_bill and size and the target variable tip. While Polynomial Regression introduced flexibility to model non-linear relationships, it added unnecessary complexity without improving accuracy, suggesting that the additional terms were not justified for this dataset.

However, during deployment using Streamlit, we observed unrealistic predictions in certain cases, such as tips exceeding 60% of the total bill. This highlighted the importance of addressing issues like outliers, scaling inconsistencies, and feature engineering to improve the model’s reliability. Steps such as handling outliers, refining interaction terms, and capping predictions can further enhance the model’s real-world applicability.

Through this project, we demonstrated how regression models can be applied to practical problems, the importance of validating predictions against domain knowledge, and the value of making models accessible through deployment. This work lays a foundation for more advanced predictive modeling techniques and data-driven decision-making in the hospitality industry. Let me know if you'd like to explore further refinements!

Appendix

Code: https://github.com/Minhhoang2606/Data-Analytics-Foundations-for-Accountancy-II/tree/master/module%201

Data source: (from the Seaborn package)

Predicting Restaurant Tips Using Regression Analysis in Python

Table of contents

1. Introduction

2. Dataset Overview

3. Exploratory Data Analysis (EDA)

Importing Required Libraries and Dataset

Checking the General Information

Summary Statistics

Feature Distributions

Categorical Feature Analysis

Correlation Analysis

Interaction Between Features

4. Data Preprocessing

Encoding Categorical Features

Feature Scaling

Splitting the Data

5. Regression Model Implementation

Linear Regression Model

Polynomial Regression Model

6. Model Evaluation

Evaluation Metrics

Visualizing Model Performance

7. Addressing Underfitting in the Linear Regression Model

Adding Interaction Terms

Introducing Non-Linear Transformations

Testing Logarithmic Scaling

Re-evaluating the Model

8. Deploying the Model with Streamlit

Setting Up the Streamlit Environment

Creating the Streamlit Application

Running the Streamlit Application

9. Conclusion

Appendix