Exploratory Data Analysis

The purpose of this section of the notebook is to provide some key highlights of the baseline data being used. This showcases the various attributes, any specific transformations, and key relationships.

import pandas as pd
import matplotlib.pyplot as plt

# Load the primary dataset
df = pd.read_csv('./../datasets/original/WA_Fn-UseC_-Telco-Customer-Churn.csv')

As can be seen below, the dataset contains 21 columns and 7,043 rows

df.shape
(7043, 21)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB

Attribute Definitions

  • customerID is the unique identifier per customer.

  • MonthlyCharges and TotalCharges identify the monthly and total spending to date for the population. Both should be float attributes.

  • tenure is how long the customer has been with the service, measured in months.

  • All other 16 attributes are categorical and highlight customer attributes (e.g. Senior Citizen) or usage of various features or (e.g. Phone Service).

df.sample(n=5)
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
2178 1657-DYMBM Male 0 Yes No 53 Yes No No No internet service ... No internet service No internet service No internet service No internet service Two year No Credit card (automatic) 19.90 1110.05 No
1217 0784-GTUUK Male 0 Yes No 62 Yes Yes No No internet service ... No internet service No internet service No internet service No internet service One year Yes Credit card (automatic) 23.75 1446.8 No
491 6778-YSNIH Female 0 No No 2 Yes No DSL No ... No No Yes No Month-to-month Yes Electronic check 59.00 114.15 No
4522 7305-ZWMAJ Male 0 Yes No 30 Yes No No No internet service ... No internet service No internet service No internet service No internet service One year No Credit card (automatic) 20.70 587.1 No
5734 6408-WHTEF Male 0 Yes Yes 72 Yes Yes DSL Yes ... Yes Yes Yes Yes Two year Yes Mailed check 89.40 6376.55 No

5 rows × 21 columns

Basic Transformations

Converting Total Charges to a float attribute.

df['TotalCharges'] = df['TotalCharges'].str.replace(r' ','0').astype(float)
df['TotalCharges'].dtypes
dtype('float64')
  • Changing Churn to better represent a binary attribute with 1s and 0s vs. “No” or “Yes”. This aids further computation.

  • Changing Senior Citizen to a categorical attribute (No/Yes) from 1s and 0s.

  • Changing a number of other columns to showcase visualizations better.

df['Churn'] = df['Churn'].apply(lambda x: 0 if x == "No" else 1)
df['SeniorCitizen'] = df['SeniorCitizen'].apply(lambda x: "No" if x == 0 else "Yes")

# Cosmetic edits to showcase visualizations better
df['OnlineSecurity'] = df['OnlineSecurity'].apply(lambda x: "Other" if x == "No internet service" else x)
df['OnlineBackup'] = df['OnlineBackup'].apply(lambda x: "Other" if x == "No internet service" else x)
df['DeviceProtection'] = df['DeviceProtection'].apply(lambda x: "Other" if x == "No internet service" else x)
df['TechSupport'] = df['TechSupport'].apply(lambda x: "Other" if x == "No internet service" else x)
df['StreamingTV'] = df['StreamingTV'].apply(lambda x: "Other" if x == "No internet service" else x)
df['StreamingMovies'] = df['StreamingMovies'].apply(lambda x: "Other" if x == "No internet service" else x)

df['MultipleLines'] = df['MultipleLines'].apply(lambda x: "Other" if x == "No phone service" else x)

df['PaymentMethod'] = df['PaymentMethod'].map({'Bank transfer (automatic)':'Bank','Credit card (automatic)':'Credit','Mailed check':'MCheck','Electronic check':'ECheck',})

df.head()
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female No Yes No 1 No Other DSL No ... No No No No Month-to-month Yes ECheck 29.85 29.85 0
1 5575-GNVDE Male No No No 34 Yes No DSL Yes ... Yes No No No One year No MCheck 56.95 1889.50 0
2 3668-QPYBK Male No No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes MCheck 53.85 108.15 1
3 7795-CFOCW Male No No No 45 No Other DSL Yes ... Yes Yes No No One year No Bank 42.30 1840.75 0
4 9237-HQITU Female No No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes ECheck 70.70 151.65 1

5 rows × 21 columns

Bin the tenure attribute into 10 buckets.

df['tenure_bins'] = pd.cut(df['tenure'], bins=10, include_lowest=True)

Bin the MonthlyCharges attribute into 10 buckets.

df['monthlyCharges_bins'] = pd.cut(df['MonthlyCharges'], bins=10, include_lowest=True)

Basic Analyses

Customer counts by categorical attributes

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(16, 15))
column_list=['OnlineSecurity', 'PaymentMethod', 'Partner', 'Dependents','OnlineBackup',
        'gender', 'SeniorCitizen', 'PhoneService', 'MultipleLines','InternetService',
         'DeviceProtection', 'TechSupport','StreamingTV','StreamingMovies',
         'Contract','PaperlessBilling']

# Iterate through all combinations
for i, var in enumerate(column_list):
    # Adjusting for the multi-line assignments
    if i <=3:
        pos1=0
        pos2=i
    elif i <=7:
        pos1=1
        pos2=i-4
    elif i <=11:
        pos1=2
        pos2=i-8
    elif i <=15:
        pos1=3
        pos2=i-12        
    tt = df.groupby(var).size().to_frame()
    tt.index.rename('', inplace=True) # To remove extra index labels for visualizations
    tdf = tt.plot(kind='bar', ax=axes[pos1,pos2], title=str(var), legend=None)

    for container in tdf.containers:
        tdf.bar_label(container)
    
plt.tight_layout()
_images/exploratory_analysis_19_0.png

Histogram of Customer Tenure (in months)

ax = df['tenure'].plot(
    kind='hist',
    title="'tenure' histogram for all customers",
    figsize=(12,6), 
    xticks=[1,5,12,20,24,30,36,40,48,50,60,70,72]
)
_images/exploratory_analysis_21_0.png

This is an interesting histogram where a large percentage of customers are fairly new (less than a year old) while a number of older customers (greater than 5 years) make up another large percentage.

churned = df[ df['Churn']==1]
ax = churned['tenure'].plot(
    kind='hist',
    title="'tenure' histogram for customers that churned",
    figsize=(12,6), 
    xticks=[1,5,12,20,24,30,36,40,48,50,60,70,72])
_images/exploratory_analysis_23_0.png

As one would expect, most of the churned customers are less than a year old. Many organizations find this to be the biggest segment of churn, i.e. customers fairly new to the service or product.

(Note: A good lesson on survivorship bias.)

Churned customer population

(Note: 0 = Not churned, 1 = Churned)

churn_pp = df.groupby('Churn').size().to_frame()
ax = churn_pp.plot(kind='bar', legend=None)
# ax.bar_label(ax.containers[0])
for container in ax.containers:
    ax.bar_label(container)
_images/exploratory_analysis_26_0.png

Churn ratio by categorical attributes

fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(15, 15))
column_list=['OnlineSecurity', 'PaymentMethod', 'Partner', 'Dependents','OnlineBackup',
        'gender', 'SeniorCitizen', 'PhoneService', 'MultipleLines','InternetService',
         'DeviceProtection', 'TechSupport','StreamingTV','StreamingMovies',
         'Contract','PaperlessBilling']

# Iterate through all combinations
for i, var in enumerate(column_list):
    # Adjusting for the multi-line assignments
    if i <=3:
        pos1=0
        pos2=i
    elif i <=7:
        pos1=1
        pos2=i-4
    elif i <=11:
        pos1=2
        pos2=i-8
    elif i <=15:
        pos1=3
        pos2=i-12
    c_df = df.groupby(var).agg({'Churn': ['sum','count']})
    c_df.columns = ['sum', 'count']
    c_df['percent'] = (c_df['sum']/c_df['count']).round(2)
    c_df = c_df[['percent']]
    c_df.index.rename('', inplace=True) # To remove extra index labels for visualizations
    tt = c_df.plot(kind='bar', ax=axes[pos1,pos2], title=str(var), legend=None)

    for container in tt.containers:
        tt.bar_label(container)
    
plt.tight_layout()
_images/exploratory_analysis_28_0.png

Exploring the relationship between tenure and monthly and total charges

ax = df.plot(x='tenure', y='MonthlyCharges', kind='scatter', title="Comparing 'tenure' and monthly charges")
_images/exploratory_analysis_30_0.png

No particular relationship seems to be apparent.

ax = df.plot(x='tenure', y='TotalCharges', kind='scatter', title="Comparing 'tenure' and total charges")
_images/exploratory_analysis_32_0.png

As one would expect, there is a linear relationship between tenure and total charges, i.e. with time, one’s total charges grow.

Monthly Charges by Uniform Spending Cohorts

mc_bin = df.groupby(by='monthlyCharges_bins').size().to_frame()
mc_bin.columns = ['Customer Count']
ax = mc_bin.plot(kind='barh')
_images/exploratory_analysis_35_0.png

By dividing the customer base into 10 uniform cohorts, one can see that a large proportion spend less than $28 dollars per month. A non-trivial proportion also spends above $100 per month.

Feature Importance

By focusing on most of the categorical attributes, and grouping tenure into specific bins, we can use a classifier to identify the attributes that have a stronger bearing on churn. The fact that being on a month to month contract, being early in the use of the service (tenure bin < 7 months), and paying by electronic check corroborates with the learning above around churn rates.

feature_df = df.copy()

# Drop un-needed columns, and convert categorical column to object
feature_df = feature_df.drop(['customerID','TotalCharges', 'MonthlyCharges', 'monthlyCharges_bins', 'tenure'], axis=1)
feature_df['tenure_bins'] = feature_df['tenure_bins'].astype(object)

# One hot encode categorical columns
feature_df = pd.get_dummies(feature_df)
# feature_df.info()

# Train the dataset with a Random Forest Classifier
X,y = feature_df.drop('Churn', axis=1), feature_df.Churn

from sklearn.ensemble import RandomForestClassifier
params={'random_state':42, 'n_estimators':500}
clf=RandomForestClassifier(**params)
clf.fit(X,y)

fv = pd.Series(data=clf.feature_importances_,index=X.columns).sort_values()
ax = fv.plot(kind='barh', figsize=(12,12))
plt.tight_layout()
_images/exploratory_analysis_39_0.png