Credit Scorecard Modeling with Missing Values- MATLAB & Simulink Example (2024)

Open Live Script

This example shows how to handle missing values when you work with creditscorecard objects. First, the example shows how to use the creditscorecard functionality to create an explicit bin for missing data with corresponding points. Then, this example describes four different ways to "treat" the missing data to get a final credit scorecard with no explicit bins for missing values.

Develop a Credit Scorecard with Explicit Bins for Missing Values

When you create a creditscorecard object, the data can contain missing values. When using creditscorecard to create a creditscorecard object, you can set the name-value pair argument for 'BinMissingData' set to true. In this case, the missing data for numeric predictors (NaN values) and for categorical predictors (<undefined> values) is binned in a separate bin labeled <missing> that appears at the end of the bins. Predictors with no missing values in the training data have no <missing> bin. If you do not specify the 'BinMissingData' argument or if you set 'BinMissingData' to false, the creditscorecard function discards missing observations when computing frequencies of Good and Bad, and neither the bininfo nor plotbins functions reports such observations.

The <missing> bin remains in place throughout the scorecard modeling process. The final scorecard explicitly indicates the points to be assigned to missing values for predictors that have a <missing> bin. These points are determined from the weight-of-evidence (WOE) value of the <missing> bin and the predictor's coefficient in the logistic model. For predictors without an explicit <missing> bin, you can assign points to missing values using the name-value pair argument 'Missing' in formatpoints, as described in this example, or by using one of the four different ways to "treat" the missing data.

The dataMissing table in the CreditCardData.mat file has two predictors with missing values — CustAge and ResStatus.

load CreditCardData.mathead(dataMissing,5)

 CustID CustAge TmAtAddress ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance UtilRate status ______ _______ ___________ ___________ _________ __________ _______ _______ _________ ________ ______ 1 53 62 <undefined> Unknown 50000 55 Yes 1055.9 0.22 0 2 61 22 Home Owner Employed 52000 25 Yes 1161.6 0.24 0 3 47 30 Tenant Employed 37000 61 No 877.23 0.29 0 4 NaN 75 Home Owner Employed 53000 20 Yes 157.37 0.08 0 5 68 56 Home Owner Employed 53000 14 Yes 561.84 0.11 0

Create a creditscorecard object using the CreditCardData.mat file to load the dataMissing table with missing values. Set the 'BinMissingData' argument to true. Apply automatic binning.

sc = creditscorecard(dataMissing,'IDVar','CustID','BinMissingData',true);sc = autobinning(sc);

The bin information and bin plots for the predictors that have missing data both show a <missing> bin at the end.

bi = bininfo(sc,'CustAge');disp(bi)

 Bin Good Bad Odds WOE InfoValue _____________ ____ ___ ______ ________ __________ {'[-Inf,33)'} 69 52 1.3269 -0.42156 0.018993 {'[33,37)' } 63 45 1.4 -0.36795 0.012839 {'[37,40)' } 72 47 1.5319 -0.2779 0.0079824 {'[40,46)' } 172 89 1.9326 -0.04556 0.0004549 {'[46,48)' } 59 25 2.36 0.15424 0.0016199 {'[48,51)' } 99 41 2.4146 0.17713 0.0035449 {'[51,58)' } 157 62 2.5323 0.22469 0.0088407 {'[58,Inf]' } 93 25 3.72 0.60931 0.032198 {'<missing>'} 19 11 1.7273 -0.15787 0.00063885 {'Totals' } 803 397 2.0227 NaN 0.087112

plotbins(sc,'CustAge')

bi = bininfo(sc,'ResStatus');disp(bi)

 Bin Good Bad Odds WOE InfoValue ______________ ____ ___ ______ _________ __________ {'Tenant' } 296 161 1.8385 -0.095463 0.0035249 {'Home Owner'} 352 171 2.0585 0.017549 0.00013382 {'Other' } 128 52 2.4615 0.19637 0.0055808 {'<missing>' } 27 13 2.0769 0.026469 2.3248e-05 {'Totals' } 803 397 2.0227 NaN 0.0092627

plotbins(sc,'ResStatus')

The training data for the 'CustAge' and 'ResStatus' predictors has missing data (NaNs and <undefined>). The binning process estimates WOE values of -0.15787 and 0.026469, respectively, for the missing data in these predictors.

The training data for EmpStatus and CustIncome has no explicit bin for <missing> values because there are no missing values for these predictors.

bi = bininfo(sc,'EmpStatus');disp(bi)

 Bin Good Bad Odds WOE InfoValue ____________ ____ ___ ______ ________ _________ {'Unknown' } 396 239 1.6569 -0.19947 0.021715 {'Employed'} 407 158 2.5759 0.2418 0.026323 {'Totals' } 803 397 2.0227 NaN 0.048038

bi = bininfo(sc,'CustIncome');disp(bi)

 Bin Good Bad Odds WOE InfoValue _________________ ____ ___ _______ _________ __________ {'[-Inf,29000)' } 53 58 0.91379 -0.79457 0.06364 {'[29000,33000)'} 74 49 1.5102 -0.29217 0.0091366 {'[33000,35000)'} 68 36 1.8889 -0.06843 0.00041042 {'[35000,40000)'} 193 98 1.9694 -0.026696 0.00017359 {'[40000,42000)'} 68 34 2 -0.011271 1.0819e-05 {'[42000,47000)'} 164 66 2.4848 0.20579 0.0078175 {'[47000,Inf]' } 183 56 3.2679 0.47972 0.041657 {'Totals' } 803 397 2.0227 NaN 0.12285

Use fitmodel to fit a logistic regression model using WOE values. fitmodel internally transforms all the predictor variables into WOE values, using the bins found during the automatic binning process. By default, fitmodel then fits a logistic regression model using a stepwise method. For predictors that have missing data, there is an explicit <missing> bin with a corresponding WOE value computed from the data. When you use fitmodel, the corresponding WOE value for the <missing> bin is applied when the function performs the WOE transformation.

[sc,mdl] = fitmodel(sc,'display','off');

Scale the scorecard points by the points-to-double-the-odds (PDO) method using the 'PointsOddsAndPDO' argument of formatpoints. Suppose that you want a score of 500 points to have odds of 2 (twice as likely to be good than to be bad) and that the odds double every 50 points (so that 550 points would have odds of 4).

Display the scorecard showing the scaled points for predictors retained in the fitting model.

sc = formatpoints(sc,'PointsOddsAndPDO',[500 2 50]);PointsInfo = displaypoints(sc)

PointsInfo=38×3 table Predictors Bin Points _____________ ______________ ______ {'CustAge' } {'[-Inf,33)' } 54.062 {'CustAge' } {'[33,37)' } 56.282 {'CustAge' } {'[37,40)' } 60.012 {'CustAge' } {'[40,46)' } 69.636 {'CustAge' } {'[46,48)' } 77.912 {'CustAge' } {'[48,51)' } 78.86 {'CustAge' } {'[51,58)' } 80.83 {'CustAge' } {'[58,Inf]' } 96.76 {'CustAge' } {'<missing>' } 64.984 {'ResStatus'} {'Tenant' } 62.138 {'ResStatus'} {'Home Owner'} 73.248 {'ResStatus'} {'Other' } 90.828 {'ResStatus'} {'<missing>' } 74.125 {'EmpStatus'} {'Unknown' } 58.807 {'EmpStatus'} {'Employed' } 86.937 {'EmpStatus'} {'<missing>' } NaN ⋮

Notice that points for the <missing> bins for CustAge and ResStatus are explicitly shown (as 64.9836 and 74.1250, respectively). These points are computed from the WOE value for the <missing> bin and the logistic model coefficients.

Points for predictors that have no missing data in the training set, by default, are set to NaN and they lead to a score of NaN when you run score. This can be changed by updating the name-value pair argument 'Missing' in formatpoints to indicate how to treat missing data for scoring purposes.

The scorecard is ready for scoring new data sets. You can also use the scorecard to compute probabilities of default or perform model validation. For details, see score, probdefault, and validatemodel. To further explore the handling of missing data, take a few rows from the original data as test data and introduce some missing data.

tdata = dataMissing(11:14,mdl.PredictorNames); % Keep only the predictors retained in the model% Set some missing valuestdata.CustAge(1) = NaN;tdata.ResStatus(2) = '<undefined>';tdata.EmpStatus(3) = '<undefined>';tdata.CustIncome(4) = NaN;disp(tdata)

 CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ ___________ ___________ __________ _______ _______ _________ NaN Tenant Unknown 34000 44 Yes 119.8 48 <undefined> Unknown 44000 14 Yes 403.62 65 Home Owner <undefined> 48000 6 No 111.88 44 Other Unknown NaN 35 No 436.41

Score the new data and see how points for missing data are differently assigned for CustAge and ResStatus and for EmpStatus and CustIncome. CustAge and ResStatus have an explicit <missing> bin for missing data. However, for EmpStatus and CustIncome, the score function sets the points to NaN.

[Scores,Points] = score(sc,tdata);disp(Scores)

 481.2231 520.8353 NaN NaN

disp(Points)

 CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 64.984 62.138 58.807 67.893 61.858 75.622 89.922 78.86 74.125 58.807 82.439 61.061 75.622 89.922 96.76 73.248 NaN 96.969 51.132 50.914 89.922 69.636 90.828 58.807 NaN 61.858 50.914 89.922

Use the name-value pair argument 'Missing' in formatpoints to choose how to assign points to missing values for predictors that do not have an explicit <missing> bin. For this example, use the 'MinPoints' option for the 'Missing' argument. For EmpStatus and CustIncome, the minimum numbers of points in the scorecard are 58.8072 and 29.3753, respectively. You can also treat missing values using one of the four different ways to "treat" the missing data.

sc = formatpoints(sc,'Missing','MinPoints');[Scores,Points] = score(sc,tdata);disp(Scores)

 481.2231 520.8353 517.7532 451.3405

disp(Points)

 CustAge ResStatus EmpStatus CustIncome TmWBank OtherCC AMBalance _______ _________ _________ __________ _______ _______ _________ 64.984 62.138 58.807 67.893 61.858 75.622 89.922 78.86 74.125 58.807 82.439 61.061 75.622 89.922 96.76 73.248 58.807 96.969 51.132 50.914 89.922 69.636 90.828 58.807 29.375 61.858 50.914 89.922

Four Approaches for Treating Missing Data and Developing a New Credit Scorecard

There are four different approaches for treating missing data.

Approach 1: Fill missing data using the fillmissing function of the creditscorecard object

The creditscorecard object supports a fillmissing function. When you call the function on a predictor or group of predictors, the fillmissing function fills the missing data with the user-specified statistic. fillmissing supports the fill values 'mean', 'median', 'mode', and 'constant', as well as the option to switch back to the original data.

The advantage of using fillmissing is that the creditscorecard object keeps track of the fill value and also applies it to the validation data. The limitation of this approach is that only basic statistics are used to fill missing data.

For more information on Approach 1, see fillmissing.

Approach 2: Fill missing data using the MATLAB® fillmissing function

MATLAB® supports a fillmissing function that you can use before creating a creditscorecard object to treat missing values in numeric and categorical data. The advantage of this method is that you can use all the options available in fillmissing to fill missing data, as well as other MATLAB functionality, such as standardizeMissing and features for the treatment of outliers. However, the downside is that you are responsible for the same transformations to the validation data before scoring as the fillmissing function is outside of the creditscorecard object.

For more information on Approach 2, see Treat Missing Data in a Credit Scorecard Workflow Using MATLAB fillmissing.

Approach 3: Impute missing data using the k-nearest neighbors (KNN) algorithm

This KNN approach considers multiple predictors as compared to Approach 1 and Approach 2. Like Approach 2, the KNN approach is done outside the creditscoreacrd workflow, and consequently, you need to perform imputation for both the training and validation data.

For more information on Approach 3, see Impute Missing Data in the Credit Scorecard Workflow Using the k-Nearest Neighbors Algorithm.

Approach 4: Impute missing data using the random forest algorithm

This random forest approach is similar to Approach 3 and uses multiple predictors to impute missing values. Because the approach is outside the creditscorecard workflow, you need to perform imputation for both the training and validation data.

For more information on Approach 4, see Impute Missing Data in the Credit Scorecard Workflow Using the Random Forest Algorithm.

Credit Scorecard Modeling with Missing Values - MATLAB & Simulink Example (2024)

Develop a Credit Scorecard with Explicit Bins for Missing Values

Four Approaches for Treating Missing Data and Developing a New Credit Scorecard

See Also

Related Topics