This dataset was collected from kaggle at this link. It consists of data from an insurance company with information on the customer. The data was originally used in a hackathon and contains a training and test data set. The test data set does not include the correct classification while the training data set does. The intention behind the data set was to generate a model that can predict whether customers who currently have health insurance with the company are interested in expanding their coverage to include car insurance. The model performance for the hackathon was to be assessed via the largest AUC value. I am also adding the constraint that the model should also be interpretable so as to best inform business decisions. The data set is very clean with no missing values, so data pre-processing is limited.
The outcome variable for this data set is the Response variable. This is a factor variable with two levels that correspond to whether a customer is interested in acquiring car insurance (Yes-1) or not (No-0).
This table contains the list of predictors in the dataset with a short description. Both Region_Code and PolicySalesChannel have values that are anonymized with no way to determine what the values correspond to.
| Predictor Variable | Description |
|---|---|
| id | Identification for each customer |
| Gender | Gender of the Customer (Male or Female) |
| Age | Age of the customer |
| Driving_License | Customer does not have DL (0) customer has DL (1) |
| Region_Code | Region identifier for the customer |
| Previously_Insured | The customer has car insurance (1) or does not (0) |
| Vehicle_Age | Age of the Vehicle |
| Vehicle_Damage | The customer got their vehicle damaged in the past (1) or not (0) |
| Annual_Premium | Annual premium the customer pays |
| PolicySalesChannel | Anonymized code for how the customer was contacted to purchase car insurance |
| Vintage | Number of Days the customer has been associated with the company |