Find stocks worth buying with Machine Learning (2024)

Is it possible to have a machine learning model learn the differences between stocks that perform well and those that don’t, and then leverage this knowledge in order to predict which stock will be worth buying? Moreover, is it possible to achieve this simply by looking at financial indicators found in the 10-K filings?

Published in

Towards Data Science

19 min read

Feb 20, 2020

1.1 PRELIMINARY IMPORTS

If you are familiar with machine learning in Python you should already know all the packages and library that we will be using. If you are new to the field don’t be scared, you can find a lot of information and tutorials for each package and library. All packages used are easily retrievable and can be installed with either pip or conda depending on your Python setup (Python 3.7.5 has been used here).

An internet connection is required in order to scrape the data from the web (we will use the excellent free API https://financialmodelingprep.com/developer/docs/ and the well knownpandas_datareader).

from sys import stdout
import numpy as np
import pandas as pd
from pandas_datareader import data
import json# Reading data from external sources
import urllib as u
from urllib.request import urlopen
# Machine learning (preprocessing, models, evaluation)
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb
from sklearn.metrics import classification_report
# Graphics
from tqdm import tqdm

We will also need a few helper functions in order to streamline the code:

get_json_data: used to scrape the links to financialmodelingprep API, and to pull the financial data.
get_price_var: used to compute the price variation during 2019, leverages pandas_datareader and Yahoo Finance.
find_in_json: used to scan a complex json file for a key and return its value.

def get_json_data(url):
 '''
 Scrape data (which must be json format) from given url
 Input: url to financialmodelingprep API
 Output: json file
 '''
 response = urlopen(url)
 dat = response.read().decode('utf-8')
 return json.loads(dat)def get_price_var(symbol):
 '''
 Get historical price data for a given symbol leveraging the power of pandas_datareader and Yahoo.
 Compute the difference between first and last available time-steps in terms of Adjusted Close price..
 Input: ticker symbol
 Output: price variation 
 '''
 # read data
 prices = data.DataReader(symbol, 'yahoo', '2019-01-01', '2019-12-31')['Adj Close']
 # get all timestamps for specific lookups
 today = prices.index[-1]
 start = prices.index[0]
 # calculate percentage price variation
 price_var = ((prices[today] - prices[start]) / prices[start]) * 100
 return price_var
def find_in_json(obj, key):
 '''
 Scan the json file to find the value of the required key.
 Input: json file
 required key
 Output: value corresponding to the required key
 '''
 # Initialize output as empty
 arr = []
 def extract(obj, arr, key):
 '''
 Recursively search for values of key in json file.
 '''
 if isinstance(obj, dict):
 for k, v in obj.items():
 if isinstance(v, (dict, list)):
 extract(v, arr, key)
 elif k == key:
 arr.append(v)
 elif isinstance(obj, list):
 for item in obj:
 extract(item, arr, key)
 return arr
 results = extract(obj, arr, key)
 return results

1.2 LIST OF STOCKS

First, we need to get a list of stocks that will be used to build the dataset. Since there are thousands of stocks whose information can be scraped online, I decided to simply pull the whole list of available stocks on Financial Modeling Prep API.

The list comprehends a total of more than 7k stocks, which clearly spans more than one sector. Indeed, each company belongs to its sector (Technology, Healthcare, Energy, …), which in turn may be characterized by certain seasonalities, macro-economic trends and so on. As of now, I decided to focus on the Technology sector: this means that from the complete list of available stocks available_tickers I only keep those whose sector is equal to Technology. This operation is quite straight forward thanks to the power of pandas library.

So, the list tickers_tech will contain all the available stocks, on Financial Modeling Prep API, belonging to the Technology sector.

url = 'https://financialmodelingprep.com/api/v3/company/stock/list'
ticks_json = get_json_data(url)
available_tickers = find_in_json(ticks_json, 'symbol')tickers_sector = []
for tick in tqdm(available_tickers):
 url = 'https://financialmodelingprep.com/api/v3/company/profile/' + tick # get sector from here
 a = get_json_data(url)
 tickers_sector.append(find_in_json(a, 'sector'))
S = pd.DataFrame(tickers_sector, index=available_tickers, columns=['Sector'])
# Get list of tickers from TECHNOLOGY sector
tickers_tech = S[S['Sector'] == 'Technology'].index.values.tolist()

1.3 GET PRICE VARIATION THROUGHOUT 2019

The price variation of each stock listed in tickers_tech during 2019 will be used as metric to distinguish between stocks worth buying and those that are not (because they decrease their value, for reasons we don’t really care about). So, we need to:

pull all the daily Adjusted Close Price for each stock, compute difference (this is done thanks to the helper function get_price_var
if no data is found, skip the stock
limit the number of stocks to be scanned to 1000 (for time purposes. When we will pull the financial data, the time required is proportional to the number of stocks, so to keep it reasonable we can limit the number of stocks to a threshold. However, you can drop this check and let the computer do its thing during nighttime).
store stocks and relative 2019 price variations in the dataframe D

pvar_list, tickers_found = [], []
num_tickers_desired = 1000
count = 0
tot = 0
TICKERS = tickers_techfor ticker in TICKERS:
 tot += 1 
 try:
 pvar = get_price_var(ticker)
 pvar_list.append(pvar)
 tickers_found.append(ticker)
 count += 1
 except:
 pass
 stdout.write(f'\rScanned {tot} tickers. Found {count}/{len(TICKERS)} usable tickers (max tickets = {num_tickers_desired}).')
 stdout.flush()
 if count == num_tickers_desired: # if there are more than 1000 tickers in sectors, stop
 break
# Store everything in a dataframe
D = pd.DataFrame(pvar_list, index=tickers_found, columns=['2019 PRICE VAR [%]'])

For the stocks in D, we now need to find the values of the indicators that will become the input data to the classification models. We leverage once again the FinancialModelingPrep API.

First we load the indicators.tx file (available in the repository). As explained the the README document, a plethora of financial indicators are being scraped. I decided to perform a brute force of all the available indicators from the FinancialModelingPrep API, and then I will worry about cleaning and preparing the dataset for the models. The table below summarizes the quantity of financial indicator available for each category.

Find stocks worth buying with Machine Learning (4)

In total, 224 indicators are available. However, since there are some duplicates, the actual number of indicators in indicators.txtis 221 (not counting the date). You can find the indicators.txt file here: https://github.com/CNIC92/beat-the-stock-market/tree/master/All%20tickers%20and%20Indicators

1.4 SCRAPE FINANCIAL INDICATORS AND BUILD RAW DATASET

As of now we have listed the stocks that belong to the Technology sector, and we have also listed their 2019 price variation. It is time to scrape the financial indicators that will later be used as input features to out classification models.

The scraping will once again be performed thanks to the Financial Modeling Prep API. This process is quite time-consuming since it is required to pull a lot of data iteratively.

Furthermore, it is important to keep in mind that:

it is required to pull data within a specific time frame. Since the objective is the classification of a stock according to its price variation during year 2019, the financial indicators must belong to the end of year 2018.
it is possible, albeit uncommon, that one company filed two 10-K documents in the same year. In this case only the most recent entry must be kept.
it is possible that the API does not return any data at all for a given stock. In this case the stock must be discarded.
not all indicators will return a value. It is fair to expect that a percentage of the indicators are missing for one reason or another. In this case, np.nan will be assigned to the missing entries, and we will take care of them in the cleaning stage.

In the end, what we want to obtain is a dataframe DATA where the rows correspond to the stocks for which the data has been found (actual_tickers) and the columns correspond to the financial indicators (indicators).

# Initialize lists and dataframe (dataframe is a 2D numpy array filled with 0s)
missing_tickers, missing_index = [], []
d = np.zeros((len(tickers_found), len(indicators)))for t, _ in enumerate(tqdm(tickers_found)):
 # Scrape indicators from financialmodelingprep API
 url0 = 'https://financialmodelingprep.com/api/v3/financials/income-statement/' + tickers_found[t]
 url1 = 'https://financialmodelingprep.com/api/v3/financials/balance-sheet-statement/' + tickers_found[t]
 url2 = 'https://financialmodelingprep.com/api/v3/financials/cash-flow-statement/' + tickers_found[t]
 url3 = 'https://financialmodelingprep.com/api/v3/financial-ratios/' + tickers_found[t]
 url4 = 'https://financialmodelingprep.com/api/v3/company-key-metrics/' + tickers_found[t]
 url5 = 'https://financialmodelingprep.com/api/v3/financial-statement-growth/' + tickers_found[t]
 a0 = get_json_data(url0)
 a1 = get_json_data(url1)
 a2 = get_json_data(url2)
 a3 = get_json_data(url3)
 a4 = get_json_data(url4)
 a5 = get_json_data(url5)
 # Combine all json files in a list, so that it can be scanned quickly
 A = [a0, a1, a2 , a3, a4, a5]
 all_dates = find_in_json(A, 'date')
 check = [s for s in all_dates if '2018' in s] # find all 2018 entries in dates
 if len(check) > 0:
 date_index = all_dates.index(check[0]) # get most recent 2018 entries, if more are present
 for i, _ in enumerate(indicators):
 ind_list = find_in_json(A, indicators[i])
 try:
 d[t][i] = ind_list[date_index]
 except:
 d[t][i] = np.nan # in case there is no value inserted for the given indicator
 else:
 missing_tickers.append(tickers_found[t])
 missing_index.append(t)
actual_tickers = [x for x in tickers_found if x not in missing_tickers]
d = np.delete(d, missing_index, 0)#raw dataset
DATA = pd.DataFrame(d, index=actual_tickers, columns=indicators)

1.5 DATASET CLEANING & PREPARATION

The preparation of the dataset is somewhat an art form. I limited my actions to the application of the common practices, such as:

removing columns that have a lot of nan values.
removing columns that have a lot of 0 values.
fill the remaining nan values with the average value of the column.

For instance, in this specific case there are an average of 84 0-values per column, with a standard deviation of 140. So I decided to remove from the dataframe all those columns where the occurrences of 0-values is higher than 20 (20 being about 3.1% of the total number of rows of the dataset).

At the same time, there is an average of about 37 nan entries per column, with a standard deviation of about 86. So I decided to remove from the dataframe all those columns where the occurrences of nan entries is higher than 15 (15 being about 2.4% of the total number of rows of the dataset). Then, the remaining nan entries have been filled with the average value of the column.

At the end of the cleaning process, the number of columns of DATA has decreased from 221 to 108, a 50% reduction. While certainly some of the discarded indicators were useless due to the lack of data, it is possible that useful data has been lost in this process too. However, it must be considered that we need useful data across all stocks in the dataset, so I think it is acceptable to discard those indicators (columns) that may be relevant only to a small portion of the dataset.

Finally, it is required to classify each sample. For each stock it has already been computed the difference in trading price between the first trading day on January 2019 and the last trading day on December 2019 (dataset D). If this difference is positive, then that stock will belong to class 1, which is a BUY signal. On the contrary, if the difference in price is negative, the stock will be classified as a 0, which is an IGNORE signal (do not buy). A quick recap is found in the table below.

Find stocks worth buying with Machine Learning (5)

So, this array of 1 and 0 values will be appended as the last column of the dataframe DATA.

# Remove columns that have more than 20 0-values
DATA = DATA.loc[:, DATA.isin([0]).sum() <= 20]# Remove columns that have more than 15 nan-values
DATA = DATA.loc[:, DATA.isna().sum() <= 15]
# Fill remaining nan-values with column mean value
DATA = DATA.apply(lambda x: x.fillna(x.mean())) 
# Get price variation data only for tickers to be used
D2 = D.loc[DATA.index.values, :]
# Generate classification array
y = []
for i, _ in enumerate(D2.index.values):
 if D2.values[i] >= 0:
 y.append(1)
 else: 
 y.append(0)
# Add array to dataframe
DATA['class'] = y

This concludes the section 1 of this article. We have built a dataset containing both relevant financial indicators from 2018 and the binary classes which come from the price action of the stocks during 2019. In section 2, we will focus on the implementation of a few machine learning algorithms in order to make predictions about the stocks, and try to beat the market!

2.1 PREPARE THE DATASET

Before having fun running experiments with different machine learning algorithms, we must apply the finishing touches to the dataset, which comprehend:

split the dataset in training and testing datasets;
standardize the dataset so that each indicator has mean 0 and standard deviation equal to 1.

So, for what concerns the split it into training and testing, 80% of the available data in DATA will be used to train the algorithms, while the remaining 20% will be used to test the ML algorithms. Note the parameter stratify used in order to keep the same class-ratio between training and testing datasets. From the train_split and test_split we extract both input data X_train, X_test and output target data y_train, y_test. A sanity check is performed afterwards.

# Divide data in train and testing
train_split, test_split = train_test_split(df, test_size=0.2, random_state=1, stratify=df['class'])
X_train = train_split.iloc[:, :-1].values
y_train = train_split.iloc[:, -1].values
X_test = test_split.iloc[:, :-1].values
y_test = test_split.iloc[:, -1].valuesprint()
print(f'Number of training samples: {X_train.shape[0]}')
print()
print(f'Number of testing samples: {X_test.shape[0]}')
print()
print(f'Number of features: {X_train.shape[1]}')

The results are:

Number of training samples: 510Number of testing samples: 128
Number of features: 107

For what concerns the standardization of the data, we leverage the StandardScaler() available from scikit-learn. It is important to use the same coefficients when standardizing both training and testing data: for this reason we first fit the scaler to X_train, and then apply it it both X_train and X_test via the method .transform().

# Standardize input data
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

2.2 SUPPORT VECTOR MACHINE

The first classification algorithm we will run is the support vector machine. A GridSeachCV is performed in order to tune some hyper-parameters (kernel, gamma, C). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of false positives.

# Parameter grid to be tuned
tuned_parameters = [{'kernel': ['rbf', 'linear'],
 'gamma': [1e-3, 1e-4],
 'C': [0.01, 0.1, 1, 10, 100]}]clf1 = GridSearchCV(SVC(random_state=1),
 tuned_parameters,
 n_jobs=6,
 scoring='precision_weighted',
 cv=5)
clf1.fit(X_train, y_train)
print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf1.best_score_, clf1.best_params_))
print()

The results are not too bad. We can see, as shown below, that we have a weighted precision of 71.3%. If you read carefully, you may have noticed that the scoring parameter is set equal to the metric precision_weighted. This has been done in order to optimize the algorithms with respect to their precision (to not be confused with accuracy!) and weighted, because we don’t have the same number of sample for both classes (buy-worthy stocks and not buy-worthy stocks). For more info about this and others scoring parameters you can check the docs here https://scikitlearn.org/stable/modules/model_evaluation.html.

Best score and parameters found on development set: 0.713 for {‘C’: 0.01, ‘gamma’: 0.001, ‘kernel’: ‘linear’}

2.3 RANDOM FOREST

The second classification algorithm we will run is the random forest. A GridSeachCV is performed in order to tune some hyper-parameters (n_estimators, max_features, max_depth, criterion). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of false positives.

# Parameter grid to be tuned
tuned_parameters = {'n_estimators': [32, 256, 512, 1024],
 'max_features': ['auto', 'sqrt'],
 'max_depth': [4, 5, 6, 7, 8],
 'criterion': ['gini', 'entropy']}clf2 = GridSearchCV(RandomForestClassifier(random_state=1),
 tuned_parameters,
 n_jobs=6,
 scoring='precision_weighted',
 cv=5)
clf2.fit(X_train, y_train)print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf2.best_score_, clf2.best_params_))
print()

We can see, as shown below, that this algorithm outperforms the support vector machine by a few percent points, since we have a weighted precision of 72.4%.

Best score and parameters found on development set: 0.724 for {'criterion': 'gini', 'max_depth': 5, 'max_features': 'auto', 'n_estimators': 32}

2.4 EXTREME GRADIENT BOOSTING

The third classification algorithm we will run is the extreme gradient boosting. A GridSeachCV is performed in order to tune some hyper-parameters (learning_rate, max_depth, n_estimators). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of false positives.

# Parameter grid to be tuned
tuned_parameters = {'learning_rate': [0.01, 0.001],
 'max_depth': [4, 5, 6, 7, 8],
 'n_estimators': [32, 128, 256]}clf3 = GridSearchCV(xgb.XGBClassifier(random_state=1),
 tuned_parameters,
 n_jobs=6,
 scoring='precision_weighted', 
 cv=5)
clf3.fit(X_train, y_train)print('Best score and parameters found on development set:')
print()
print('%0.3f for %r' % (clf3.best_score_, clf3.best_params_))
print()

This algorithm underperfoms both the support vector machine and the random forest classifier by a few percent points, since we have a weighted precision of 69.7%.

Best score and parameters found on development set: 0.697 for {'learning_rate': 0.001, 'max_depth': 4, 'n_estimators': 256}

2.5 MULTI-LAYER PERCEPTRON

The fourth classification algorithm we will run is the multi-layer perceptron (feed-forward neural network). A GridSeachCV is performed in order to tune some hyper-parameters (hidden_layer_sizes, activation, solver). The required number of cross-validations is set to 5. We want to achieve maximum weighted precision, in order to minimize the number of false positives.

# Parameter grid to be tuned
tuned_parameters = {'hidden_layer_sizes': [(32,), (64,), (32, 64, 32)],
 'activation': ['tanh', 'relu'],
 'solver': ['lbfgs', 'adam']}clf4 = GridSearchCV(MLPClassifier(random_state=1, batch_size=4, early_stopping=True), 
 tuned_parameters,
 n_jobs=6,
 scoring='precision_weighted',
 cv=5)
clf4.fit(X_train, y_train)print('Best score, and parameters, found on development set:')
print()
print('%0.3f for %r' % (clf4.best_score_, clf4.best_params_))
print()

This algorithm is the best yet, since it outperforms all the previously tested algorithms. The MLP yields to a weighted precision of 73%.

Best score, and parameters, found on development set: 0.730 for {'activation': 'relu', 'hidden_layer_sizes': (32, 64, 32), 'solver': 'adam'}

2.6 EVALUATE THE MODELS

Now that 4 classification algorithms have been trained, we must test them and compare their performance with respect to each other and to what is considered the benchmark in this field (S&P 500, DOW JONES). Indeed, we don’t limit ourselves to the comparison of their testing accuracies: we want to understand which algorithm leads to the best return on investment (ROI). To do this, we must first get the 2019 price variations contained inpvar of only those stocks that belong to the testing dataset (which we did not use yet!).

# Get 2019 price variations ONLY for the stocks in testing split
pvar_test = pvar.loc[test_split.index.values, :]

Now, we build a new dataframe df1 in which, for each tested stock, we collect all the predicted classes from each model (it is reminded that the two classes are 0=IGNORE, 1=BUY).

If the model predicts class 1, we proceed to buy 100 USD worth of that stock; otherwise, we ignore the stock.

# Initial investment can be $100 for each stock whose predicted class = 1
buy_amount = 100# In new dataframe df1, store all the information regarding each model's predicted class and relative gain/loss in $USD
df1 = pd.DataFrame(y_test, index=test_split.index.values, columns=['ACTUAL']) # first column is the true class (BUY/INGORE)df1['SVM'] = clf1.predict(X_test) # predict class for testing dataset
df1['VALUE START SVM [$]'] = df1['SVM'] * buy_amount # if class = 1 --> buy $100 of that stock
df1['VAR SVM [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START SVM [$]'] # compute price variation in $
df1['VALUE END SVM [$]'] = df1['VALUE START SVM [$]'] + df1['VAR SVM [$]'] # compute final valuedf1['RF'] = clf2.predict(X_test)
df1['VALUE START RF [$]'] = df1['RF'] * buy_amount
df1['VAR RF [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START RF [$]']
df1['VALUE END RF [$]'] = df1['VALUE START RF [$]'] + df1['VAR RF [$]']df1['XGB'] = clf3.predict(X_test)
df1['VALUE START XGB [$]'] = df1['XGB'] * buy_amount
df1['VAR XGB [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START XGB [$]']
df1['VALUE END XGB [$]'] = df1['VALUE START XGB [$]'] + df1['VAR XGB [$]']df1['MLP'] = clf4.predict(X_test)
df1['VALUE START MLP [$]'] = df1['MLP'] * buy_amount
df1['VAR MLP [$]'] = (pvar_test['2019 PRICE VAR [%]'].values / 100) * df1['VALUE START MLP [$]']
df1['VALUE END MLP [$]'] = df1['VALUE START MLP [$]'] + df1['VAR MLP [$]']

Finally, we build a compact dataframe MODELS_COMPARISON in which we collect the main information required to perform the comparison between the classification models and the benchmarks (S&P 500, DOW JONES).

Leveraging the dataframe df1, we can easily compute gains and losses for each model (net_gain_, percent_gain_).

Since we miss the data from the benchmarks, we quickly exploit the custom function get_price_var in order to get the percent price variation for both S&P 500 (^GSPC) and DOW JONES (^DJI) for the year 2019.

# Create a new, compact, dataframe in order to show gain/loss for each model
start_value_svm = df1['VALUE START SVM [$]'].sum()
final_value_svm = df1['VALUE END SVM [$]'].sum()
net_gain_svm = final_value_svm - start_value_svm
percent_gain_svm = (net_gain_svm / start_value_svm) * 100start_value_rf = df1['VALUE START RF [$]'].sum()
final_value_rf = df1['VALUE END RF [$]'].sum()
net_gain_rf = final_value_rf - start_value_rf
percent_gain_rf = (net_gain_rf / start_value_rf) * 100start_value_xgb = df1['VALUE START XGB [$]'].sum()
final_value_xgb = df1['VALUE END XGB [$]'].sum()
net_gain_xgb = final_value_xgb - start_value_xgb
percent_gain_xgb = (net_gain_xgb / start_value_xgb) * 100start_value_mlp = df1['VALUE START MLP [$]'].sum()
final_value_mlp = df1['VALUE END MLP [$]'].sum()
net_gain_mlp = final_value_mlp - start_value_mlp
percent_gain_mlp = (net_gain_mlp / start_value_mlp) * 100percent_gain_sp500 = get_price_var('^GSPC') # get percent gain of S&P500 index
percent_gain_dj = get_price_var('^DJI') # get percent gain of DOW JONES indexMODELS_COMPARISON = pd.DataFrame([start_value_svm, final_value_svm, net_gain_svm, percent_gain_svm],
 index=['INITIAL COST [USD]', 'FINAL VALUE [USD]', '[USD] GAIN/LOSS', 'ROI'], columns=['SVM'])
MODELS_COMPARISON['RF'] = [start_value_rf, final_value_rf, net_gain_rf, percent_gain_rf]
MODELS_COMPARISON['XGB'] = [start_value_xgb, final_value_xgb, net_gain_xgb, percent_gain_xgb]
MODELS_COMPARISON['MLP'] = [start_value_mlp, final_value_mlp, net_gain_mlp, percent_gain_mlp]
MODELS_COMPARISON['S&P 500'] = ['', '', '', percent_gain_sp500]
MODELS_COMPARISON['DOW JONES'] = ['', '', '', percent_gain_dj]

Find stocks worth buying with Machine Learning (6)

From the dataframe MODELS_COMPARISON, it is possible to see that:

XGB and RF are the ML models that yield the highest ROI, 31.3% and 40.9% respectively
RF outperforms the S&P 500 by more than 12 p.p, while it outperforms the DOW JONES by almost 20 p.p.
XGB outperforms the S&P 500 by a few p.p., while it outperforms the DOW JONES by almost 10 p.p.
MLP and SVM are closely matched, and yield an ROI of 28.3% and 27.2% respectively
MLP and SVM perform similarly to the S&P 500, while they both outperfom the DOW JONES
the SVM leads to the highest net gains, at about 3290 USD; however, it also has the highest initial investment cost at 12100 USD
the RF leads to the lowest net gains, at about 1920 USD; however, it also has the lowest initial investment cost at 4700 USD

So, this example proves, at least as proof-of-concept, that it is possible to find useful information in the 10-K filings that the publicly traded companies release. The financial information can be used to train machine learning models that learn to recognize buy-worthy stocks.

For what concerns a more traditional comparison between the performance of the ML models implemented, it is possible to analyze the classification_report.

from sklearn.metrics import classification_reportprint()
print(53 * '=')
print(15 * ' ' + 'SUPPORT VECTOR MACHINE')
print(53 * '-')
print(classification_report(y_test, clf1.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(20 * ' ' + 'RANDOM FOREST')
print(53 * '-')
print(classification_report(y_test, clf2.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(14 * ' ' + 'EXTREME GRADIENT BOOSTING')
print(53 * '-')
print(classification_report(y_test, clf3.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')
print(53 * '=')
print(15 * ' ' + 'MULTI-LAYER PERCEPTRON')
print(53 * '-')
print(classification_report(y_test, clf4.predict(X_test), target_names=['IGNORE', 'BUY']))
print(53 * '-')

which yields to:

=====================================================
 SUPPORT VECTOR MACHINE
-----------------------------------------------------
 precision recall f1-score support IGNORE 0.40 0.05 0.09 38
 BUY 0.71 0.97 0.82 90
 accuracy 0.70 128
 macro avg 0.55 0.51 0.45 128
weighted avg 0.62 0.70 0.60 128
-----------------------------------------------------
=====================================================
 RANDOM FOREST
-----------------------------------------------------
 precision recall f1-score support
 IGNORE 0.37 0.79 0.50 38
 BUY 0.83 0.43 0.57 90
 accuracy 0.54 128
 macro avg 0.60 0.61 0.54 128
weighted avg 0.69 0.54 0.55 128
-----------------------------------------------------
=====================================================
 EXTREME GRADIENT BOOSTING
-----------------------------------------------------
 precision recall f1-score support
 IGNORE 0.48 0.34 0.40 38
 BUY 0.75 0.84 0.80 90
 accuracy 0.70 128
 macro avg 0.62 0.59 0.60 128
weighted avg 0.67 0.70 0.68 128
-----------------------------------------------------
=====================================================
 MULTI-LAYER PERCEPTRON
-----------------------------------------------------
 precision recall f1-score support
 IGNORE 0.39 0.29 0.33 38
 BUY 0.73 0.81 0.77 90
 accuracy 0.66 128
 macro avg 0.56 0.55 0.55 128
weighted avg 0.63 0.66 0.64 128
-----------------------------------------------------

Looking carefully, it is fair to ask: why does the RF returns the highest ROI if it is the method with the lowest weighted accuracy? This happens because:

RF has the highest precision for the BUY class (83%). Indeed, 83% of the BUY predictions are true positives, and the remaning 17% are false positives
Mimizing the number of false positives allows to minimize the quantity of money spent on stocks that will decrease in value during 2019
RF has the highest recall for the IGNORE class (79%), meaning that it correctly identified 79% of the stocks that should not be bought

However, all this means that we miss a lot of potential stocks to be bought, since RF leads to a high number of false negatives. Indeed, it is easy to see that RF has the lowest recall value for the BUY class (43%), meaning that we only find 43% of the total stocks that should have been classified as BUY-worthy.

This concludes section 2 of this article. Hopefully you find it useful and inspiring. For more about Financial Modeling Prep API, codes and historical data check out the links below.

Free Stock API and Financial Statements API - FMP API

This documentation includes a financial statements API, a free stock API and a historical quotes API. Find all…

financialmodelingprep.com

CNIC92/beat-the-stock-market

This project begins from a question I asked myself during the 2019 winter holidays: it is possible to understand which…

github.com