Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2024)

Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (1) comments

By Sabber Ahamed, Computational Geophysicist and Machine Learning Enthusiast

Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2)

Introduction


Financial institutions/companies have been using predictive analytics for quite a long time. Recently, due to the availability of computational resources and tremendous research in machine learning made it possible to better data analysis hence better prediction. In the series of articles, I explain how to create a predictive loan model that identifies a bad applicant who is more likely to be charged off. In step by step processes, I show how to process raw data, clean unnecessary part of it, select relevant features, perform exploratory data analysis, and finally build a model.

As an example, I use Lending club loan data dataset. Lending Club is the world’s largest online marketplace connecting borrowers and investors. An inevitable outcome of lending is default by borrowers. The idea of this tutorial is to create a predictive model that identifies applicants who are relatively risky for a loan. In order to accomplish this, I organized the whole series into four parts as follows:

  • Data processing-1: In this first part I show how to clean and remove unnecessary features. Data processing is very time-consuming, but better data would produce a better model. Therefore, careful and very detail examination is required to prepare better data. I show how to identify constant features, duplicate feature, duplicate rows, and features with a high number of missing values.
  • Data processing-2: In this part, I manually go through each and every features selected from part -1. This is the most time-consuming part, but worth it for a better model.
  • EDA: In in this part, I do some exploratory data analysis (EDA) on the features selected in part-1 and 2. A good EDA is required to get a better knowledge of the domain. We need to spend some quality time to find out the relations between the features.
  • Create a model: Finally, In this last but not the last part, I create models. Creating a model is also not an easy task. It’s also an iterative process. I show how to start with a with a simple model, then slowly add complexity for better performance.

Alright, let’s get started with the part-1: data processing, cleaning and feature selections.

Data processing-1

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport warningswarnings.filterwarnings("ignore")

In this project, I used three years of datasets (2014, 2015 and 2017(first-thrid quarter)) and stored in five separate CSV files. Lets read the files first:

df1 = pd.read_csv(‘./data/2017Q1.csv’, skiprows=[0])df2 = pd.read_csv(‘./data/2017Q2.csv’, skiprows=[0])df3 = pd.read_csv(‘./data/2017Q3.csv’, skiprows=[0])df4 = pd.read_csv(‘./data/2014.csv’, skiprows=[0])df5 = pd.read_csv(‘./data/2015.csv’, skiprows=[0])

Since data are stored in separate files, we have to make sure that we have the same number of features in each file. We can check using the following code snippet:

columns = np.dstack((list(df1.columns), list(df2.columns), list(df3.columns), list(df4.columns), list(df5.columns)))coldf = pd.DataFrame(columns[0])

The above code is self-explanatory, we first extract the column names the stack them together using Numpy ‘dstack’ object. If you look at the Jupyter-notebook on Github, you would see they are same. Which is good for us. We can move on to the next step. It’s time to check the shape of the data:

df = pd.concat([df1, df2, df3, df4, df5])df.shape
(981665, 151)

We see that there are approximately one million examples and each of the examples has 151 features including target variable. Let’s look at the feature name to get familiar with the data. It’s imperative to get to know the domain, especially the details of the features relationship with the target variable. It’s not easy to learn overnight, that’s why need to spend some days or maybe a week to get familiar with the data before jumping into further detail analysis. Let’s see the feature names:

print(list(df.columns))
['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'acc_now_delinq', 'tot_coll_amt', 'tot_cur_bal', 'open_acc_6m', 'open_act_il', 'open_il_12m', 'open_il_24m', 'mths_since_rcnt_il', 'total_bal_il', 'il_util', 'open_rv_12m', 'open_rv_24m', 'max_bal_bc', 'all_util', 'total_rev_hi_lim', 'inq_fi', 'total_cu_tl', 'inq_last_12m', 'acc_open_past_24mths', 'avg_cur_bal', 'bc_open_to_buy', 'bc_util', 'chargeoff_within_12_mths', 'delinq_amnt', 'mo_sin_old_il_acct', 'mo_sin_old_rev_tl_op', 'mo_sin_rcnt_rev_tl_op', 'mo_sin_rcnt_tl', 'mort_acc', 'mths_since_recent_bc', 'mths_since_recent_bc_dlq', 'mths_since_recent_inq', 'mths_since_recent_revol_delinq', 'num_accts_ever_120_pd', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_bc_sats', 'num_bc_tl', 'num_il_tl', 'num_op_rev_tl', 'num_rev_accts', 'num_rev_tl_bal_gt_0', 'num_sats', 'num_tl_120dpd_2m', 'num_tl_30dpd', 'num_tl_90g_dpd_24m', 'num_tl_op_past_12m', 'pct_tl_nvr_dlq', 'percent_bc_gt_75', 'pub_rec_bankruptcies', 'tax_liens', 'tot_hi_cred_lim', 'total_bal_ex_mort', 'total_bc_limit', 'total_il_high_credit_limit', 'revol_bal_joint', 'sec_app_fico_range_low', 'sec_app_fico_range_high', 'sec_app_earliest_cr_line', 'sec_app_inq_last_6mths', 'sec_app_mort_acc', 'sec_app_open_acc', 'sec_app_revol_util', 'sec_app_open_act_il', 'sec_app_num_rev_accts', 'sec_app_chargeoff_within_12_mths', 'sec_app_collections_12_mths_ex_med', 'sec_app_mths_since_last_major_derog', 'hardship_flag', 'hardship_type', 'hardship_reason', 'hardship_status', 'deferral_term', 'hardship_amount', 'hardship_start_date', 'hardship_end_date', 'payment_plan_start_date', 'hardship_length', 'hardship_dpd', 'hardship_loan_status', 'orig_projected_additional_accrued_interest', 'hardship_payoff_balance_amount', 'hardship_last_payment_amount', 'disbursem*nt_method', 'debt_settlement_flag', 'debt_settlement_flag_date', 'settlement_status', 'settlement_date', 'settlement_amount', 'settlement_percentage', 'settlement_term']

Looking at the above features, it may seem scary first. But we will get through every feature and then select the relevant features. Let's start with the target feature “loan_status”

df.loan_status.value_counts()
Current 500937Fully Paid 358629Charged Off 99099Late (31-120 days) 13203In Grace Period 6337Late (16-30 days) 3414Default 36Name: loan_status, dtype: int64

We see that there are seven types of loan status. However, in this tutorial, we are interested in two classes: 1) Fully paid: those who paid the loan with interests and 2) Charged off: those who could not pay and finally charged off. Therefore, we select the data sets for these two classes:

df = df.loc[(df['loan_status'].isin(['Fully Paid', 'Charged Off']))]
df.shape(457728, 151)

Looking at the shape, we see that we now have half of the data point than original data and the same number of features. Before processing and cleaning manually, let’s do some general data processing steps first:

  • Remove features associated with >85% missing values
  • Remove constant features
  • Remove duplicates features
  • Remove duplicate rows
  • Remove highly collinear features (In part 3 EDA)

Alright, let’s get started with the typical data processing:

1. Remove features associated with 90% missing values:In the code below I first use pandas’ built-in method ‘isnull()’ to find the rows associated with missing values. Then I sum them up to get the count for each feature. Finally, I sort the features according to the number of missing values and create a data frame for further analysis.

In the above result, we see that there are 53 features which have 400000 missing values. I use the pandas’ drop method to remove these 53 features. Notice that in this function I set the “inplace” option to True”, which removes these features from original data framedfwithout returning anything.

2. Remove constant features:At this step, we remove features that have a single unique value. A feature associated with one unique value does not help the model to generalize well since it’s variance is zero. A tree-based model cannot take advantage of these type of features since the model can not split these features. To identify features with a single unique value is relatively straightforward:

In the above code, I create a function “find_constant_features” to identify constant features. The function goes through each feature and sees if it has less than two unique values. If so, the features are added to the constant feature list. We can also find out constant feature looking at the variance or standard deviation. If the feature has zero variance or standard deviation, we are sure that the feature has single unique value. The print statement shows that five features have single unique value. So we remove them using “inplace” option true.

3. Remove duplicate features:Duplicate features are those have the same value in multiple features with the same/different name. To find out the duplicate features I borrowed the following code from thisstack overflow link:

We see only one feature which seems to be duplicated. I am not going to remove the feature yet rather wait until we do EDA in the next part.

4. Remove duplicate rows:In this step, we remove all the duplicate rows. I use pandas built-in “drop_duplicates(inplace= True)” method to perform this action:

df.drop_duplicates(inplace= True)

The above four processings are basic which we need to do for any data science project. Let's see the shape of the data after all of these steps:

df.shape(457728, 93)

We see that we have 93 features after performing the above steps.

In thenext partof this tutorial, I will go through each feature, then perform cleaning and remove it if necessary. In the meantime, if you have any question regarding this part, please feel free to write your comment below. You can reach out to me:


Bio: Sabber Ahamed is the Founder of xoolooloo.com. Computational Geophysicist and Machine Learning Enthusiast.

Original. Reposted with permission.

Related:

  • Text Mining on the Command Line
  • Three techniques to improve machine learning model performance with imbalanced datasets

More On This Topic

  • AWS Data Exchange Webinar: Maintain competitive edge with…
  • Building a Structured Financial Newsfeed Using Python, SpaCy and Streamlit
  • How ML Model Explainability Accelerates the AI Adoption Journey for…
  • Data Science Project of Rotten Tomatoes Movie Rating Prediction:…
  • Data Science Project of Rotten Tomatoes Movie Rating Prediction:…
  • Multivariate Time-Series Prediction with BQML
Financial Data Analysis - Data Processing 1: Loan Eligibility Prediction - KDnuggets (2024)

FAQs

What is loan eligibility prediction? ›

The loan eligibility prediction model makes use of an analysis technique that modifies historical and present credit user information to create predictions. A significant issue in predicting loan eligibility is, making precise loan predictions using risk and evaluation analysis.

Which algorithm is used for loan prediction? ›

In order to predict the accuracy of loan approval status for an applied person, we used four different algorithms namely Random Forest, Naive Bayes, Decision Tree, and KNN.

How much score do you need to get a loan? ›

A score of 610+ usually gives you a fair chance of home loan approval. The bank's lending policy plays a role; some banks may be willing to approve low credit scores. Your best bet is to apply to multiple banks, which you can do with the services of a home loan comparison service such as ooba Home Loans.

What score do you need to get a loan? ›

Payment history is weighed the most heavily in determining your credit score, along with your total outstanding debt. Generally, borrowers need a credit score of at least 610 to 640 to even qualify for a personal loan. To qualify for a lender's lowest interest rate, borrowers typically need a score of at least 800.

What is an example of a prediction problem? ›

Some examples of sequence prediction problems include: Weather Forecasting. Given a sequence of observations about the weather over time, predict the expected weather tomorrow. Stock Market Prediction.

What is the purpose of the loan prediction system? ›

Application and Evaluation: The Loan Prediction System offers a structured and automated application process, ensuring a hassle-free experience for loan applicants. It assesses applicant data, credit scores, income sources, and other relevant criteria to determine creditworthiness accurately. 2.

What are the objectives of loan approval prediction? ›

It is done by predicting if the loan can be given to that person on the basis of various parameters like credit score, income, age, marital status, gender, etc. The prediction model not only helps the applicant but also helps the bank by minimizing the risk and reducing the number of defaulters.

What are the advantages of loan eligibility prediction? ›

It will benefit both the client and the bank in terms of time and manpower required for loan eligibility prediction. The entire work is cantered on a classification problem and is a form of supervised learning in which it is important to determine whether the loan will be approved or not.

What is the objective of loan eligibility prediction? ›

Loan prediction analysis uses specific parameters about a loan application to determine whether or not the loan should get approved. Approved loans usually have a good credit history, decent applicant income, and reliability in other factors.

What is the purpose of loan prediction? ›

It is done by predicting if the loan can be given to that person on the basis of various parameters like credit score, income, age, marital status, gender, etc. The prediction model not only helps the applicant but also helps the bank by minimizing the risk and reducing the number of defaulters.

Top Articles
Latest Posts
Article information

Author: Kerri Lueilwitz

Last Updated:

Views: 6000

Rating: 4.7 / 5 (47 voted)

Reviews: 86% of readers found this page helpful

Author information

Name: Kerri Lueilwitz

Birthday: 1992-10-31

Address: Suite 878 3699 Chantelle Roads, Colebury, NC 68599

Phone: +6111989609516

Job: Chief Farming Manager

Hobby: Mycology, Stone skipping, Dowsing, Whittling, Taxidermy, Sand art, Roller skating

Introduction: My name is Kerri Lueilwitz, I am a courageous, gentle, quaint, thankful, outstanding, brave, vast person who loves writing and wants to share my knowledge and understanding with you.