Bayesian Optimization in Trading (2024)

Algorithmic trading has similar problems to those in machine learning. Today, I’m going to show how to apply Bayesian optimization for tuning trading strategy hyperparameters.

Let’s suppose you created a trading strategy with a few hyperparameters. This strategy is profitable on a backtesting. You want to deploy the strategy to production mode, but there is one question left: “Is this set of parameters optimal?”.

Typically, a grid-search approach is used to search for optimal hyperparameters. This approach is also used in machine learning, but this requires a lot of computations, often in the wrong parameter space.

Another approach is a random search, which can perform slightly better than grid-search.

I found a great explanation on Quora:

The problem: The goal is to find an approximate minimum to some ‘expensive’ function. Such a function accepts a real valued vector 𝐱∈ℝ, returns a scalar and takes a damn long time doing it. Also, no gradient calculation is available. So let’s imagine a simple one dimensional case:

The dotted lines imply we can’t see this function — we may only choose to evaluate it at certain points.
Since this function is costly, let’s say we have a fixed budget of evaluations — we may only evaluate it, say, 10 times.
With that, how would you find the best possible minimum?
This is the idea: Sample some input-outputs (less than 10) and use them to guess the true function with something called a ‘Gaussian Process’ (or ‘GP’ here). Then use that guessed function to determine where to evaluate next. Evaluate that point, add it to our set of input-outputs and infer the guessed function once again. Repeat this until you’ve exhausted your budget of evaluations (or some other stopping criteria). If the GP is any good at guessing the true function, we’ll do better than random sampling.
To get a feel for the GP, let’s sample four points from our expensive function, hand these over to the GP and have it infer the rest of the function. That might look like this:

The solid green line is our guess of the true function. Each additional band of green is another half standard deviation on the output distribution.
So now the question becomes: given all this useful guessed information, what point should we check next? In answering this, there are two things we care about:
We should evaluate points we think will yield a low output value. That is, we should evaluate on points where the solid green line is low.
We should check areas we know less about. So in the graph above, it would be wiser to check somewhere between .65 and .75 than to check between .15 and .25 since we have a pretty good idea as to what’s going on in the latter zone. Said differently, we should check regions which will reduce the variance in our guess the most.
Balancing these two is the exploration-exploitation trade-off. Do you look in new spots or exploit the goldmines you’ve discovered thus far? We make our preference explicit with an acquisition function. This is a function of 𝑥x that will yield a number that tells us how well we are accomplishing these two goals. This function is cheap, so we can optimize it and use that 𝑥x as our next point to search next.
So which acquisition function? Well, there are a handful of options, but I’ll go with expectation of improvement. That is, evaluate the next point where the expected improvement is highest.
So if:
𝜇(𝑥)μ(x) is the guessed value of the function at 𝑥x (the green line).
𝜎(𝑥)σ(x) is the standard deviation of output at 𝑥x (proportional to the green bands).
Then our acquisition/expectation of improvement (call it 𝐴(𝑥)A(x)) is:

where Φ(⋅) and N(⋅) refers to the CDF and PDF of a standard normal distribution, respectively. It’s not important to understand this formula exactly — just know that it’s some balance of a low 𝜇(𝑥)μ(x) and a high 𝜎(𝑥)σ(x).
To see how this operates, check it out next to our previous graph:

So this would tell us to check at 𝑥=1 since that point has the highest activation.
And then repeat. That’s it! Like I said earlier, if the GP is any good at guessing the true function, it’ll beat random sampling.

This article compares different approaches for hyperparameters tuning in machine learning. It suggests Bayesian optimization as a better approach than random, grid-search or manual.

This video by Siraj Raval also has a good explanation:

If you want to dive deeper into this topic you should read this article.

We will consider one strategy that I discussed earlier. This strategy has 4 hyperparameters, let’s define required libraries:

Bayesian optimization is implemented in hyperopt package. You can find an introduction to this package in the great article by Will Koehrsen.

Usually, researchers use backtesting to find optimal hyperparameters and out-of-sample test for evaluating. I like this approach, but I think we can get a more significant result if we use a few backtesting periods (folds)

When a fold is closer to a out-of-sample period, then the fold has a greater weight in the final weighted score.

Also, this function should penalize a final value if some fold has a negative performance. As a score function in a trading, you can use the Sortino ratio, or any function that estimates the quality of the strategy. For our purpose, the lower value is better. This means we should invert the Sortino ratio value.

In this example, we will use:

Five folds by one month for the training (optimization) step
One month for the testing period, and three weeks for validation (out-of-sample period)

The testing period allows selecting pairs that will be tested on the validation period.

Training period: 2018–06–01 to 2018–10–31
Testing period: 2018–11–1 to 2018–11–30
Validation period: 2018–12–1 to 2018–12–21

Let’s define the periods, hyperparameter space, and a weighted_mean function. This code allows to define a different length of folds by period_weights.

The next step is defining the score function that runs the algorithm and measures the performance (in this case, Sortino ratio).

This function is too long because also contains nested functions of the algorithm from the previous article. The alternative variant for passing parameters from hyperopt package to the Catalyst framework is using the os.environ module. Let me know in comments if you know the better way. Maybe guys from Enigma Project advise the right answer.

The core is the objective function that measures the performance of a few folds by weighted_mean function, otherwise gets the maximum of the folds (penalizing).

The next code runs the optimization with 300 iterations.

The output looks like this:

This table contains the top of trials by score. The last line is the set of best hyperparameters. I carried out this experiment for ETH/BTC, LTC/BTC, XMR/BTC, and ZEC/ETC using Bitfinex data.

Theoretically, you can visualize objective value in 3D space, like dimension 1 — dimension 2 — objective value, and the best solution has the global minimum:

This hyperparameters table shows the best values of parameters for each asset:

On this step, we should run the strategy with fixed hyperparameters for particular asset on testing period. The asset with positive performance will be selected for the next step (out-of-sample test).

This code allows to do this, you should just change the hyperparameters and set the cryptocurrency.

As you see, two assets (ETH/BTC and XMR/BTC) got a positive performance.

When we have found the optimal hyperparameters and selected assets, we can run the trading algorithm on the period that is not used for optimization (validation period).

Let’s build a portfolio based on these two assets. At t = 0, set each asset capital depending on proportion of their objective value (-15.45 and -9.72).

Theoretically, the weight of each asset could be different. For example, uniform allocation or other approach. Final equity of the portfolio combines the performance of these assets.

On the whole, the strategy has a positive performance on the out-of-sample period. The basic metrics are demonstrated in the table.

As you see on the graph above, the drawdown period is too long. Probably, the algorithm couldn’t adapt to decreasing after the previous fast growth. You might take this into consideration when you formulate the objective function.

We dropped out two assets (LTC/BTC and ZEC/BTC) from the portfolio on the testing step. Let’s see if this was a right decision:

Yes, the decision was right. If these assets were included, the portfolio would be worse. A confusion matrix looks like this:

As we see, we don’t have type I errors. We only dropped out ZEC/BTC from the portfolio, but this asset was good on the validation step (type II error). In this case, that’s less important than including some asset that couldn’t be profitable.

In this article, we covered the following:

Suggested the approach for optimizing the trading strategy including training, testing, and validation steps.
Formulated the objective function that combines score values from different folds.
Developed scripts you can explore on github.

This example is not the ultimate trading strategy — the given result could be random. It needs a huge amount of experimentation to draw a conclusion about suitability for trading.

How to improve this approach:

Define your own objective and score functions based on your goal.
Customize the train/test/validation periods.
Create a portfolio including a huge number of assets to test the hypothesis that this approach could be work on you.
Use a moving walk-forward test to get a more significant result. It allows simulating a situation that is closer to real trading. You will get more data for statistical inference.

Let me know what optimization approach you use on your own trading strategies.

FAQs

What is the Bayesian approach to trading? ›

Unlike simpler frequentist cointegration tests, the Bayesian approach allows to monitor the relationship between a pair of equities over time, which further allows to follow pairs whose cointegration parameters change steadily or abruptly. Bayesian statistics also accounts for uncertainty in in making predictions.

When should I use Bayesian optimization? ›

It's particularly effective for scenarios where sampling is costly, and the objective function is unknown but can be sampled. Bayesian optimization typically employs a probabilistic model, like a Gaussian Process, to estimate the objective function and then uses an acquisition function to decide where to sample next.

View Details ›

What is the Bayesian optimization strategy? ›

Bayesian optimization is a sequential design strategy for global optimization of black-box functions that does not assume any functional forms. It is usually employed to optimize expensive-to-evaluate functions.

Discover More Details ›

What is Bayesian optimization in finance? ›

Bayesian optimization is an approach for performing derivative-free global optimization in a small dimension, and uses Gaussian processes to locate the global maximum of a black-box function.

What is the Bayesian method of financial forecasting? ›

In Bayesian forecasting, before we ever attempt to produce a forecast, we must first carefully enumerate all possible sources of uncertainty – including, where possible, the set of alternative forecasting models – and construct reasonable prior beliefs for these quantities, which often include (possibly several layers ...

View Details ›

What is Bayesian approach to demand forecasting? ›

A Bayesian approach utilizes available information and updates prior information (such as judgmental input) as observed demand occurs. The Bayesian paradigm has been used to overcome difficulties with limited demand data [10].

What is Bayesian optimization in simple words? ›

The Bayesian optimization algorithm attempts to minimize a scalar objective function f(x) for x in a bounded domain. The function can be deterministic or stochastic, meaning it can return different results when evaluated at the same point x.

Read On ›

Is Bayesian optimization better than random search? ›

We can see that the bayesian search outperforms the other methods by a little. This effect is much more noticeable in larger datasets and more complex models.

Find Out More ›

What are the benefits of Bayesian optimization? ›

By iteratively estimating posterior distribution of the objective function and adaptively updating the subspace of hyper-parameters, Bayesian optimization can leverage the information from prior evaluations and only focus on promising subspaces, which not only simplifies the hyper-parameter searching process, but also ...

What are the limits of Bayesian optimization? ›

"However, BO (Bayesian Optimization) is practically limited to optimizing 10–20 parameters. To scale BO to high dimensions, we usually make structural assumptions on the decomposition of the objective and/or exploit the intrinsic lower dimensionality of the problem, e.g. by using linear projections.

Is Bayesian better than maximum likelihood? ›

Using simulated datasets, the Bayesian method generally fares better than the ML approach in accuracy and coverage, although for some values the two approaches are equal in performance.

Get More Info ›

What are Bayesian models good for? ›

Finally, Bayesian modeling is a powerful tool for decision-making. It can be used to inform policy decisions by providing a quantitative assessment of a variety of complex risks associated with exposure to pollutants.

View Details ›

Where is Bayesian optimization used? ›

Bayesian Optimization has been widely used for the hyperparameter tuning purpose in the Machine Learning world.

View Details ›

What is Bayesian analysis and where why to use it? ›

Bayesian analysis is a statistical paradigm that answers research questions about unknown parameters using probability statements.

See Details ›

Why do we use Bayesian analysis? ›

Specifically, in this case, Bayes' theorem states that our prior knowledge is updated by the current data to yield updated knowledge in the form of the posterior distribution. That is, we can use our prior information in estimating the population mean, variance, and other aspects of the distribution for this sample.

See Details ›