An In Depth Project with Scikit-Learn

This next project I was tasked with taking a CSV all the way to a finalised model to predict with. The data provided was a regression problem with predicting insurance claim values. The data had a mixture of numerical values and categorical values. The data given was messy on purpose with lots of imputation needed and then other pre-processing. This article will cover pre-processing and then use of gradient boosted trees to perform regression. There is a GitHub link at the bottom with a notebook with more types of classification including random forests and a dual model that first classified if it should be 0 or a value, then used regression to find the value. This is all interactive and details how and why things were completed and I thoroughly recommend looking at that as well as the article for a more in depth view.

The process followed was I took the data from a CSV into a Pandas dataframe, I then imputed missing data, pre-processed the data, data balancing, feature selection and then classification.

Loading the data

import pandas as pd 
import numpy as np
#Will convert '?' to NaN as well
claim_data = pd.read_csv('./data/train.csv', encoding= 'unicode_escape', na_values='?')

This placed the data inside a dataframe with NaN values set. From this I counted the nulls and found the columns they appeared in which happened to be in categorical columns only. If a column had less than .5% of values missing I simply threw away that row of data. This caused me to lose just 252 rows of a 30K size dataset.

Imputation

The other categories required a method of imputation. I selected to use a K-Nearest Neighbours method to keep these features as accurate as possible. This was preferred over modal imputation due to the modal value already being highly dominant in that column, this would just create a more imbalanced data set.

#Confirm we have NA data
print(claim_data.isna().sum().sum())
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

#Generate our one hot encoding in the dataframe
cat_variables = claim_data[["Cat1","Cat2","Cat3","Cat4","Cat5","Cat6","Cat7","Cat8","Cat9","Cat10","Cat11","Cat12","OrdCat","NVCat"]]
cat_dummies = pd.get_dummies(cat_variables)

claim_data_ohe = pd.concat([claim_data.drop(cat_variables, axis=1), cat_dummies], axis=1)

scaler = MinMaxScaler()
#Too much data to onehotencode and package issues stopped me from hashing these - Not ideal but has to be done
#Also remove the target variable
claim_data_ohe = claim_data_ohe.drop(["Blind_Make","Blind_Model", "Blind_Submodel", "Claim_Amount"], axis=1)

#Normalise the data to between 0 and 1
claim_data_ohe = pd.DataFrame(scaler.fit_transform(claim_data_ohe), columns= claim_data_ohe.columns)

#Find with KNN
imputer = KNNImputer(n_neighbors=7)

claim_data_imputed = pd.DataFrame(imputer.fit_transform(claim_data_ohe), columns= claim_data_ohe.columns)

#Reverse one hot encoding
for cat in columns_to_inv:
    new_column = pd.DataFrame(claim_data_imputed.filter(regex="^" + cat + "_").idxmax(axis=1), columns=[cat])
    regex_string = cat + "_"
    claim_data[cat] = new_column[cat].str.replace(regex_string,'')
#Confirm we have no nulls left in our dataset
claim_data.isnull().sum().sum()

The above code imputed based off the other categories it had available and then placed it back inside the dataset.

Pre-Processing

Next I built a column transformer using Sklearn to apply normalisation to the values and One Hot Encoding to the categorical values to make these usable in the model later. There was a transformer made just for the tree models as well without normalisation as that is not needed.

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

attributes_num = ["Var1", "Var2","Var3","Var4","Var5","Var6","Var7","Var8","NVVar1","NVVar2","NVVar3","NVVar4"]
attributes_cat = ["Blind_Make", "Cat1","Cat2","Cat3","Cat4","Cat5","Cat6","Cat7","Cat8","Cat9","Cat10","Cat11","Cat12","NVCat"]
attributes_other = ["Household_ID", "Vehicle", "Calendar_Year", "Model_Year", "OrdCat"]

#Below I drop some columns, these are used in the actual transformer as opposed to all the columns which are above
attributes_num_used = ["Var1", "Var2","Var3","Var4","Var5","Var6","Var7","Var8","NVVar1","NVVar2","NVVar3","NVVar4"]
attributes_cat_used = ["Cat1","Cat3","Cat5","Cat6","Cat7","Cat8","Cat9","NVCat"]

full_transform = ColumnTransformer([
    ("num", Normalizer(), attributes_num_used),
    ("cat", OneHotEncoder(), attributes_cat_used),
])

#Transformer without numerical for the tree models
tree_transformer = ColumnTransformer([
    ("cat", OneHotEncoder(), attributes_cat_used)
],remainder='passthrough')

Data balancing

I then explored trying to better balance the dataset due to it having far more zero values than non-zero.

To fix this I elected to use Random Under sampling reducing the difference to 80% as not to lose too much data. This was achieved as follows.

from imblearn.under_sampling import RandomUnderSampler
#Take in features and labels, return under sampled labels
def sampler(features):
    #First of all make the continous data binary
    undersample = RandomUnderSampler(sampling_strategy=0.8)
    features, y = undersample.fit_resample(features,features["Binary"])
    #Shuffle the features so they are mixed
    features = features.sample(frac=1).reset_index(drop=True)
    return features

Feature Selection

Finally, as not too have too many features in the dataset I decided to try and only take signficant features with high correlation. Using a spearman rank I compared each column with the claim amount. It was found here the data was extremely uncorrelated however with the highest correlation being 0.08. I dropped the highly uncorrelated rows and moved onto building the model.

from scipy import stats
from operator import itemgetter
corr_matrix = []
for feature in attributes_cat:
    corr_matrix.append((feature, stats.spearmanr(claim_data[feature], claim_data["Claim_Amount"])[0]))
for feature in attributes_num:
    corr_matrix.append((feature, stats.spearmanr(claim_data[feature], claim_data["Claim_Amount"])[0]))
for feature in attributes_other:
    corr_matrix.append((feature, stats.spearmanr(claim_data[feature], claim_data["Claim_Amount"])[0]))
sorted(corr_matrix,key=itemgetter(1))

Classification

The classifier is detailed below. The data is taken from where it has been sampled above in the notebook. I take the data and split it into data and labels. The data is the transformed and handed to the model to be trained. Finally I use the model to predict and compare with the validation set, this achieves a final F-Score of 0.34 which was the best performance with a single model.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

#Remind us of the variables we have above
claim_data_train_under_bin
claim_data_train2_under_bin
claim_data_test_bin
claim_data_valid_bin

#Use these variables to set the correct target classes etc
#I now need to split the data into features and labels now it is sampled, first full train and test sets
claim_data_train_under_bin_features = claim_data_train_under_bin.drop(["Binary"], axis=1)
claim_data_train_under_bin_labels = pd.DataFrame(claim_data_train_under_bin["Binary"])
claim_data_test_bin_features = claim_data_test_bin.drop(["Binary"], axis=1)
claim_data_test_bin_labels = pd.DataFrame(claim_data_test_bin["Binary"])

#Now do my train2 and validation split
claim_data_train2_under_bin_features = claim_data_train2_under_bin.drop(["Binary"], axis=1)
claim_data_train2_under_bin_labels = pd.DataFrame(claim_data_train2_under_bin["Binary"])
claim_data_valid_bin_features = claim_data_valid_bin.drop(["Binary"], axis=1)
claim_data_valid_bin_labels = pd.DataFrame(claim_data_valid_bin["Binary"])

#We now need to appy a different transformer as the numerical data does not need it
claim_data_train2_under_bin_features_tree_transformed = tree_transformer.fit_transform(claim_data_train2_under_bin_features)
claim_data_valid_bin_features_tree_transformed = tree_transformer.transform(claim_data_valid_bin_features)

#Try out the GBC
gbc = GradientBoostingClassifier()

#Grid search
#parameters = dict(n_estimators = [20, 50, 100, 200], max_features = ['auto', 'sqrt','log2'] ,max_depth=[3,5,7,9], learning_rate = [0.1,0.01,0.25,0.5])
#grid = GridSearchCV(gbc, param_grid=parameters, cv=5, n_jobs=-1, scoring='neg_mean_squared_error')
#grid.fit(claim_data_train2_under_bin_features_tree_transformed, claim_data_train2_under_bin_labels)

gbc_best = {'learning_rate': 0.1, 'max_depth': 5, 'max_features': 'sqrt', 'n_estimators': 50}

#Set the best params
gbc.set_params(**gbc_best)

#fit the model
gbc.fit(claim_data_train2_under_bin_features_tree_transformed, claim_data_train2_under_bin_labels)

#Predict on the validation set
gbc_pred = gbc.predict(claim_data_valid_bin_features_tree_transformed)

print("The GBC score is: ", f1_score(claim_data_valid_bin_labels,gbc_pred))

Conclusion

This project allowed me to overview a full machine learning pipeline system and get to better grips with Scikit-learn and all they have to offer. I found the hardest parts in ‘juggaling’ the data and making sure good practise was followed in the order in which things take place to avoid data leaking. This sample of work comes from a full notebook which has much more classifiers in and greater descriptive detail, this can be found at:

https://github.com/naathanbrown/insurance-clasifier

Share this:

Related

Leave a comment Cancel reply