Score card development
For fraud management applications, we are looking to predict fraud score (probability of fraud). So that we can process credit card and insurance applications and claims. We define one-year observation window and 6 months to one-year performance window. Fraud are marked as 1, no fraud is marked as 0 based on past investigation of customers.
Fraud profile is discussed with stakeholders, while collecting input from different stakeholders. You can manipulate different modelling techniques run to include those variables, considered necessary by different stakeholders. There are two component of algorithm – Modelling of fraud/non fraud using Logistic regression, then predict fraud score using model developed. You need to take the 2 datasets – one mix of fraud and non-frauds, second with cases to be predicted. There are two set of variable independent variable and dependent variables. The dependent variable is dependent on values of independent variables. It takes value 1 for fraud and 0 for non-fraud.
Part A- Modelling of fraud and non-fraud
First this dataset is analysed for missing values if missing values exist then missing value indicator is inserted. Now this dataset is divided in to two stratified sample in ratio of 2:1 train and valid datasets respectively.
First of all, we prepare the data for modelling. If there are missing value, impute the missing value by median, mean and zero depending upon type of variables. We do the outlier treatment by percent capping for example using P1 (percentile 1) and P99 (percentile 99). We do the dummy coding of character variable using binary 1 and 0. If there are n categories of a character variable then n-1 dummy variables are created. If a variable has too many categories then it’s either removed from dataset or categories are combined, depending upon the importance of variable.
Next step is to deal with multi collinearity. For dealing with multi collinearity, we create cluster of redundant variables based on VIF factor or correlation matrix between independent variables.
Once above steps are over, we calculate the importance of the different variables using the Random forest techniques. It’s possible to calculate the importance by using classical variable selection techniques such WOE and IV. We compare IV values using the following criterion
Less than 0.02: unpredictive
0.02 to 0.1: weak
0.1 to 0.3: medium
0.3 +: strong
0.5: Need to investigated
Once variable selection is over. We run different techniques such as logistic regression, random forest, neural network, decision tree and gradient boost method. We select the method with highest accuracy and area under curve. It helps to differentiate between fraud and non-fraud. We try to minimize the false positive and false negatives. Its critical to minimize both.
Additionally, we run the model on validation and out of time to see if we get same sign of different coefficients. If it’s not satisfied, then rebuild the model until we get the same sign of coefficient with both the sample.
Next we validate the model performance using the validation dataset (mix of fraud and non-fraud, you can use ROC, ACC and Gini coefficient for judging the model performance. False negative and false positive rate should be less that 10% for an effective model.
Part B- Monitoring the score card:
We check the system stability and characteristic stability on month by month data and compare it with development sample.
System Stability/Population stability: a report between recent applicants and expected (from development sample):
index = sum of ((% Actual -% Expected) × ln (% Actual / % Expected))
An index of less than 0.10 shows no significant change,0.10–0.25 denotes a small change that needs to be investigated, and an index greater than 0.25 points to a significant shift in the applicant population.
Characteristic Analysis Report:
“Expected %” and “Actual %” again refer to the distributions of the development and recent samples, respectively. The index here is calculated simply by:
sum of ((%Actual – % Expected) * Points))
Predicting the fraud and non-fraud
Once a score card is built it can be used for prediction of fraud and non-fraud cases. If we have used the techniques such as logistic regression for building the final scorecard then it makes sense to set a cut-off of fraud score to differentiate between fraud and non-fraud cases. To set a cut off, you need to see at which fraud score, you get the accepted fraud rate across different samples of data. Once cut-off is set up, it’s used for predicting fraud and fraud rate is monitored on monthly basis to ensure that there is no significant different between predicted fraud rate and actual fraud rate. It can be translated into simple rule for frauds.
On the other hand, if decision tree and gradient boost is used, rules are derived from it for fraud. If technique such as random forest and neural network is used for building the score card, then we have to use the R or Python software to do the prediction.