In one of my previous posts, I introduced the concept of decision trees and their significance in the realm of solving classification problems. Today, I aim to delve deeper into the practical applications of decision trees and provide insights into their implementation using Python.
Random forests or random decision forests are an ensemble learning method for classification, regression, and other tasks that operate by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision tree’s habit of overfitting to their training set. Random forests generally outperform decision trees, but their accuracy is lower than gradient-boosted trees.
My goal was to harness the power of a random forest decision tree to predict the returns of Apple stocks. This prediction was to be based on the percentage changes observed in interest rates, crude oil prices, and Google search trends.
Using Yahoo Finance and the yfinance library in Python, I downloaded the monthly historical data for Apple for the past 5 years. The monthly historical interest rates and crude oil prices data were downloaded from Fred, and Google search trends from Google Analytics. With the data in hand, my initial focus was on data cleansing, which entailed the removal of any null or missing values, ensuring that the dataset was ready for analysis.
Subsequently, I divided the dataset into two distinct data frames: ‘y’ and ‘x’. In ‘y’, I stored the dependent variable, which represented the percentage change in Apple stock prices. In ‘x’, I stored the independent variables, comprising interest rates, crude oil prices, and Google search trends. To align these two data frames accurately, I discarded the first row of values and set their indexes equal to one another, setting the stage for meaningful analysis.
Now, let’s explore the implementation of the random forest regressor model. This pivotal step was comprised of the selection of two crucial parameters: the number of estimators and the random state. Following this, I fitted the model to the data.
To visualize the data, I used Matplotlib to create a figure representing the decision tree results. Here is a snippet of the code, along with the output of the model:
As you can see, this model is quite difficult to follow, and the accuracy score is substantially lower than that of the Autoregressive Distributed Lag (ARDL) model. This was to be expected since there were only 3 features. As there are many factors that affect stock prices, using many features is important to obtain accuracy. It is critical to find the right balance to ensure that the model is not overwhelmed but also accurate.
Disclaimer:
Remember that predicting stock prices involves inherent risks, and past performance is not indicative of future results. Always seek professional financial advice and conduct extensive research before making any investment choices.
Source: Wikipedia