Exploratory Data Analysis¶

The purpose of this section of the notebook is to provide some key highlights of the baseline data being used. This showcases the various attributes, any specific transformations, and key relationships.

import pandas as pd
import matplotlib.pyplot as plt

# Load the primary dataset
df = pd.read_csv('./../datasets/original/WA_Fn-UseC_-Telco-Customer-Churn.csv')

As can be seen below, the dataset contains 21 columns and 7,043 rows

df.shape

(7043, 21)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 customerID        7043 non-null   object 
 gender            7043 non-null   object 
 SeniorCitizen     7043 non-null   int64  
 Partner           7043 non-null   object 
 Dependents        7043 non-null   object 
 tenure            7043 non-null   int64  
 PhoneService      7043 non-null   object 
 MultipleLines     7043 non-null   object 
 InternetService   7043 non-null   object 
 OnlineSecurity    7043 non-null   object 
OnlineBackup      7043 non-null   object 
DeviceProtection  7043 non-null   object 
TechSupport       7043 non-null   object 
StreamingTV       7043 non-null   object 
StreamingMovies   7043 non-null   object 
Contract          7043 non-null   object 
PaperlessBilling  7043 non-null   object 
PaymentMethod     7043 non-null   object 
MonthlyCharges    7043 non-null   float64
TotalCharges      7043 non-null   object 
Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

Attribute Definitions¶

customerID is the unique identifier per customer.
MonthlyCharges and TotalCharges identify the monthly and total spending to date for the population. Both should be float attributes.
tenure is how long the customer has been with the service, measured in months.
All other 16 attributes are categorical and highlight customer attributes (e.g. Senior Citizen) or usage of various features or (e.g. Phone Service).

df.sample(n=5)

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
2178	1657-DYMBM	Male	Yes	No	53	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	Two year	No	Credit card (automatic)	19.90	1110.05	No
1217	0784-GTUUK	Male	Yes	No	62	Yes	Yes	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	One year	Yes	Credit card (automatic)	23.75	1446.8	No
491	6778-YSNIH	Female	No	No	2	Yes	No	DSL	No	...	No	No	Yes	No	Month-to-month	Yes	Electronic check	59.00	114.15	No
4522	7305-ZWMAJ	Male	Yes	No	30	Yes	No	No	No internet service	...	No internet service	No internet service	No internet service	No internet service	One year	No	Credit card (automatic)	20.70	587.1	No
5734	6408-WHTEF	Male	Yes	Yes	72	Yes	Yes	DSL	Yes	...	Yes	Yes	Yes	Yes	Two year	Yes	Mailed check	89.40	6376.55	No

5 rows × 21 columns

Basic Transformations¶

Converting Total Charges to a float attribute.

df['TotalCharges'] = df['TotalCharges'].str.replace(r' ','0').astype(float)
df['TotalCharges'].dtypes

dtype('float64')

Changing Churn to better represent a binary attribute with 1s and 0s vs. “No” or “Yes”. This aids further computation.
Changing Senior Citizen to a categorical attribute (No/Yes) from 1s and 0s.
Changing a number of other columns to showcase visualizations better.

df['Churn'] = df['Churn'].apply(lambda x: 0 if x == "No" else 1)
df['SeniorCitizen'] = df['SeniorCitizen'].apply(lambda x: "No" if x == 0 else "Yes")

# Cosmetic edits to showcase visualizations better
df['OnlineSecurity'] = df['OnlineSecurity'].apply(lambda x: "Other" if x == "No internet service" else x)
df['OnlineBackup'] = df['OnlineBackup'].apply(lambda x: "Other" if x == "No internet service" else x)
df['DeviceProtection'] = df['DeviceProtection'].apply(lambda x: "Other" if x == "No internet service" else x)
df['TechSupport'] = df['TechSupport'].apply(lambda x: "Other" if x == "No internet service" else x)
df['StreamingTV'] = df['StreamingTV'].apply(lambda x: "Other" if x == "No internet service" else x)
df['StreamingMovies'] = df['StreamingMovies'].apply(lambda x: "Other" if x == "No internet service" else x)

df['MultipleLines'] = df['MultipleLines'].apply(lambda x: "Other" if x == "No phone service" else x)

df['PaymentMethod'] = df['PaymentMethod'].map({'Bank transfer (automatic)':'Bank','Credit card (automatic)':'Credit','Mailed check':'MCheck','Electronic check':'ECheck',})

df.head()

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	...	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	No	Yes	No	1	No	Other	DSL	No	...	No	No	No	No	Month-to-month	Yes	ECheck	29.85	29.85	0
1	5575-GNVDE	Male	No	No	No	34	Yes	No	DSL	Yes	...	Yes	No	No	No	One year	No	MCheck	56.95	1889.50	0
2	3668-QPYBK	Male	No	No	No	2	Yes	No	DSL	Yes	...	No	No	No	No	Month-to-month	Yes	MCheck	53.85	108.15	1
3	7795-CFOCW	Male	No	No	No	45	No	Other	DSL	Yes	...	Yes	Yes	No	No	One year	No	Bank	42.30	1840.75	0
4	9237-HQITU	Female	No	No	No	2	Yes	No	Fiber optic	No	...	No	No	No	No	Month-to-month	Yes	ECheck	70.70	151.65	1

5 rows × 21 columns

Bin the tenure attribute into 10 buckets.

df['tenure_bins'] = pd.cut(df['tenure'], bins=10, include_lowest=True)

Bin the MonthlyCharges attribute into 10 buckets.

df['monthlyCharges_bins'] = pd.cut(df['MonthlyCharges'], bins=10, include_lowest=True)

Basic Analyses¶

Customer counts by categorical attributes¶

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(16, 15))
column_list=['OnlineSecurity', 'PaymentMethod', 'Partner', 'Dependents','OnlineBackup',
        'gender', 'SeniorCitizen', 'PhoneService', 'MultipleLines','InternetService',
         'DeviceProtection', 'TechSupport','StreamingTV','StreamingMovies',
         'Contract','PaperlessBilling']

# Iterate through all combinations
for i, var in enumerate(column_list):
    # Adjusting for the multi-line assignments
    if i <=3:
        pos1=0
        pos2=i
    elif i <=7:
        pos1=1
        pos2=i-4
    elif i <=11:
        pos1=2
        pos2=i-8
    elif i <=15:
        pos1=3
        pos2=i-12        
    tt = df.groupby(var).size().to_frame()
    tt.index.rename('', inplace=True) # To remove extra index labels for visualizations
    tdf = tt.plot(kind='bar', ax=axes[pos1,pos2], title=str(var), legend=None)

    for container in tdf.containers:
        tdf.bar_label(container)
    
plt.tight_layout()

Histogram of Customer Tenure (in months)¶

ax = df['tenure'].plot(
    kind='hist',
    title="'tenure' histogram for all customers",
    figsize=(12,6), 
    xticks=[1,5,12,20,24,30,36,40,48,50,60,70,72]
)

This is an interesting histogram where a large percentage of customers are fairly new (less than a year old) while a number of older customers (greater than 5 years) make up another large percentage.

churned = df[ df['Churn']==1]
ax = churned['tenure'].plot(
    kind='hist',
    title="'tenure' histogram for customers that churned",
    figsize=(12,6), 
    xticks=[1,5,12,20,24,30,36,40,48,50,60,70,72])

As one would expect, most of the churned customers are less than a year old. Many organizations find this to be the biggest segment of churn, i.e. customers fairly new to the service or product.

(Note: A good lesson on survivorship bias.)

Churned customer population¶

(Note: 0 = Not churned, 1 = Churned)

churn_pp = df.groupby('Churn').size().to_frame()
ax = churn_pp.plot(kind='bar', legend=None)
# ax.bar_label(ax.containers[0])
for container in ax.containers:
    ax.bar_label(container)

Churn ratio by categorical attributes¶

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(15, 15))
column_list=['OnlineSecurity', 'PaymentMethod', 'Partner', 'Dependents','OnlineBackup',
        'gender', 'SeniorCitizen', 'PhoneService', 'MultipleLines','InternetService',
         'DeviceProtection', 'TechSupport','StreamingTV','StreamingMovies',
         'Contract','PaperlessBilling']

# Iterate through all combinations
for i, var in enumerate(column_list):
    # Adjusting for the multi-line assignments
    if i <=3:
        pos1=0
        pos2=i
    elif i <=7:
        pos1=1
        pos2=i-4
    elif i <=11:
        pos1=2
        pos2=i-8
    elif i <=15:
        pos1=3
        pos2=i-12
    c_df = df.groupby(var).agg({'Churn': ['sum','count']})
    c_df.columns = ['sum', 'count']
    c_df['percent'] = (c_df['sum']/c_df['count']).round(2)
    c_df = c_df[['percent']]
    c_df.index.rename('', inplace=True) # To remove extra index labels for visualizations
    tt = c_df.plot(kind='bar', ax=axes[pos1,pos2], title=str(var), legend=None)

    for container in tt.containers:
        tt.bar_label(container)
    
plt.tight_layout()

Exploring the relationship between tenure and monthly and total charges¶

ax = df.plot(x='tenure', y='MonthlyCharges', kind='scatter', title="Comparing 'tenure' and monthly charges")

No particular relationship seems to be apparent.

ax = df.plot(x='tenure', y='TotalCharges', kind='scatter', title="Comparing 'tenure' and total charges")

As one would expect, there is a linear relationship between tenure and total charges, i.e. with time, one’s total charges grow.

Monthly Charges by Uniform Spending Cohorts¶

mc_bin = df.groupby(by='monthlyCharges_bins').size().to_frame()
mc_bin.columns = ['Customer Count']
ax = mc_bin.plot(kind='barh')

By dividing the customer base into 10 uniform cohorts, one can see that a large proportion spend less than $28 dollars per month. A non-trivial proportion also spends above $100 per month.

Feature Importance¶

By focusing on most of the categorical attributes, and grouping tenure into specific bins, we can use a classifier to identify the attributes that have a stronger bearing on churn. The fact that being on a month to month contract, being early in the use of the service (tenure bin < 7 months), and paying by electronic check corroborates with the learning above around churn rates.

feature_df = df.copy()

# Drop un-needed columns, and convert categorical column to object
feature_df = feature_df.drop(['customerID','TotalCharges', 'MonthlyCharges', 'monthlyCharges_bins', 'tenure'], axis=1)
feature_df['tenure_bins'] = feature_df['tenure_bins'].astype(object)

# One hot encode categorical columns
feature_df = pd.get_dummies(feature_df)
# feature_df.info()

# Train the dataset with a Random Forest Classifier
X,y = feature_df.drop('Churn', axis=1), feature_df.Churn

from sklearn.ensemble import RandomForestClassifier
params={'random_state':42, 'n_estimators':500}
clf=RandomForestClassifier(**params)
clf.fit(X,y)

fv = pd.Series(data=clf.feature_importances_,index=X.columns).sort_values()
ax = fv.plot(kind='barh', figsize=(12,12))
plt.tight_layout()

MLOPs Primer

Exploratory Data Analysis

Contents