Background of the Dataset

This dataset was collected from kaggle at this link. It consists of data from an insurance company with information on the customer. The data was originally used in a hackathon and contains a training and test data set. The test data set does not include the correct classification while the training data set does. The intention behind the data set was to generate a model that can predict whether customers who currently have health insurance with the company are interested in expanding their coverage to include car insurance. The model performance for the hackathon was to be assessed via the largest AUC value. I am also adding the constraint that the model should also be interpretable so as to best inform business decisions. The data set is very clean with no missing values, so data pre-processing is limited.

Outcome Variable

The outcome variable for this data set is the Response variable. This is a factor variable with two levels that correspond to whether a customer is interested in acquiring car insurance (Yes-1) or not (No-0).

Predictor Variables

This table contains the list of predictors in the dataset with a short description. Both Region_Code and PolicySalesChannel have values that are anonymized with no way to determine what the values correspond to.

Predictor Variable Description
id Identification for each customer
Gender Gender of the Customer (Male or Female)
Age Age of the customer
Driving_License Customer does not have DL (0) customer has DL (1)
Region_Code Region identifier for the customer
Previously_Insured The customer has car insurance (1) or does not (0)
Vehicle_Age Age of the Vehicle
Vehicle_Damage The customer got their vehicle damaged in the past (1) or not (0)
Annual_Premium Annual premium the customer pays
PolicySalesChannel Anonymized code for how the customer was contacted to purchase car insurance
Vintage Number of Days the customer has been associated with the company