Course website: http://quantsoftware.gatech.edu/CS7646_Fall_2019
In Experiment 1, the probability of winning $80 within 1000 sequential bets is 1. Based on the Figure \ref{fig: 3}, the median is approaching $80 and the std is zero starting within 300 bets. So we can almost make sure the probability is at least a half. Based on Figure \ref{fig: 2}, the mean is approaching $80 and the std is zero starting within 300 bets, so we can make sure that all games are won and the probability is almost 1 based on the statistics. But the probability is never one based on mathematical computation. As long as the probability is high enough, we can consider that winning is guaranteed.
The estimated value is $80. According to all the simulation, the result are always guaranteed to reach $80. That is because the probability of not reaching $80 for 1000 bets is too small and it doesn’t count too much in the computation of expectation.
Based on Figure \ref{fig: 3}, the std is growing from the start and reaching zero while the bets are more. But we can not say that the standard deviation reach a maximum value then stabilize or converge as the number of sequential bets increases because it is not just simply a increasing and decreasing function, as it fluctuate a lot. But we can say that the std is indeed converging to zero as the number of bets increases after some bets. \section{Experiment 2}
Since we have finite bank roll, the bets it takes if losing sequentially and run out of money is about 8 rounds. And everytime after a winning bet, it only takes around 8 rounds to run out of money, which has quite big possibility of losing. Take the probability of winning each bet as a half for computation simplicity. Then the probability of losing is: [1/2^8+\binom{9}{8}1/2^9+\binom{10}{8}1/2^{10}+\ldots+\binom{80}{8}1/2^{80}\ge1/4] So the probability of winning is less than $3/4$ with the approximation that winning is $\frac{1}{2}$. In fact, the winning rate is smaller per bet. So probability of winning could just be $\frac{1}{2}$.
Presume the probability of winning $80 in the end is $\frac12$, then the estimated expected value would be: [E=\frac1280+frac12(-256)=-88.] Not to mention that the winning rate could be less than $\frac12$. So the estimated expected value would be less than $-88$.
The standard deviation reach a maximum value then stabilize, based on Figure \ref{fig: 5}. The reason is because the situation is stable around 200 bets. People who lost is lost and people who win win, the game is almost settled.
Overfitting occurs when the leaf size is small. When the leaf_size is below 5, overfitting occurs. According to the first fugure, we can see that as the leaf size increases, the in sample RMSE increases, i.e.\ as the leaf size decreases, the in sample RMSE decreases, which is the indicator that the accuracy increases. That could be the sign of overfitting. Also, we can see from the Figure 4 that the exact match of predictions is almost $100\%$. The accuracy decreases as the number of leaf size increases.
Bagging can reduce overfitting with respect to leaf_size. As we can see from Figure 2, compared with Figure 1, when the bagging equals 20, the in sample RMSE and out of sample RMSE are both smaller, which means they have better performance and don’t have overfitting.
With the metric being the costing time to run the command, apparently, the random tree learner have better performance according to Figure 3. Another metric is the exact match percent of the data. When the leaf size, decision tree learner tend to overfit, which gives higher in sample exact match percent. However, with the increasing of leaf size, random tree might perform better. That is because of the randomness during the process. And when the leaf size increases, the impact will shrink, and the two methods tend to converge.
According to the definition of SMA, we have the following code, where we do the rolling mean and prices to SMA ratio. SMA would work because it is based on the recent past.
SMA = normalized_prices.rolling(window = window, center = False).mean()
prices_to_sma_ratio = normalized_prices[window-1 : ] / SMA[window-1 : ]
According to the definition of EMA, we have the following code, where we can directly use the \verb | .ewm | function provided by \verb | pandas | package. EMA would work because it is based on the recent past and has a portion of the past EMA. |
EMA = normalized_prices.ewm(alpha = 2.0 / (window + 1)).mean()
\begin{figure}[H]\centering\includegraphics[scale=0.7]{ema.png}\caption{EMA}\end{figure}
According to the definition of Bollinger Bands, we have the following code, where we first get the SMA then do the band computation where $C$ is usually 2. Bollinger Bands would work because we assume that there is an average and the stock price won’t drift too far from the average. Once it reaches that boundary, it will come back eventually.
SMA = sma(normalized_prices = normalized_prices, window = window, plot = False)[0]
rolling_std = C * normalized_prices.rolling(window, center = False).std(ddof = 0)
The Commodity Channel Index (CCI) measures the current price level relative to an average price level over a given period of time. CCI is relatively high when prices are far above their average. CCI is relatively low when prices are far below their average. Using this method, CCI can be used to identify overbought and oversold levels.
CCI = (normalized_prices - normalized_prices.rolling(window = window, center = False).mean()) / (2.5 * normalized_prices.std())
The basic idea of theoretically optimal strategy is that whenever the stock is going to drop, you short, whenever the stock is going to rise, you long. By this condition, you will have the optimal return.
The cumulative return of theoretical optimal is 5.7861. The average daily return of theoretical optimal is 0.0038167861508578197. The standard deviation of daily return of theoretical optimal is
0.004547823197908003. The cumulative return of benchmark: 0.012299999999999978. The average daily return of benchmark is 0.00016808697819094035. The standard deviation of daily return of benchmark is 0.017004366271213763.
\section{Manual Rule-Based Trader}
I used SMA, EMA, Bollinger Bands and CCI in analyzing the stock of JPM. Since I have four indicators, I can have four ways to long and short. The way I choose is to use voting. If there are more indicators showing that we should long or short, then we will follow the majority.
The cumulative return of manual strategy is 0.42525449999999987. The average daily return of manual strategy is 0.0007861827938844847. The standard deviation of daily return of manual strategy is 0.01290166157229807. The cumulative return of benchmark: 0.012348734279835405. The average daily return of benchmark is 0.00016940225817359267. The standard deviation of daily return of benchmark is 0.017076464107706853.
With only SMA, the strategy is better than benchmark.
With only EMA, the strategy is better than benchmark.
With only Bollinger Bands, the strategy is better than benchmark.
With only CCI, the strategy is better than benchmark.
Thus, based on the voting rule, we will follow the majority, which will likely give us a model which beats the bench mark as well.
The cumulative return of manual strategy is -0.00040099999999976266. The average daily return of manual strategy is 3.0001047184868185e-05. The standard deviation of daily return of manual strategy is 0.007855890383816394. The cumulative return of benchmark is -0.0837506219789147. The average daily return of benchmark is -0.00013764540221914082. The standard deviation of daily return of benchmark is 0.008518500961407126.
If we only use one indicator, we have the following tables, where we only use one indicator to calculate the cr, adr and sddr for manual strategy and benchmark:
SMA and EMA alone | SMA(In) | SMA(Out) | EMA(IN) | EMA(Out) |
---|---|---|---|---|
cr manual | 0.237825500000 | 0 | 0.101193000000 | 0 |
adr manual | 0.00044785538630 | 0 | 0.000219204440363 | 0 |
sddr manual | 0.0070244551783 | 0 | 0.007480085024 | 0 |
cr benchmark | 0.0123487342798 | -0.08375062197 | 0.0123487342798 | -0.08375062197 |
adr benchmark | 0.000169402258173 | -0.000137645402219 | 0.000169402258173 | -0.000137645402219 |
sddr benchmark | 0.0170764641077 | 0.0085185009614 | 0.0170764641077 | 0.0085185009614 |
BB and CCI alone | BB(In) | BB(Out) | CCI(IN) | CCI(Out) |
---|---|---|---|---|
cr manual | 0.199506500000 | 0.051515500000 | 0.34460099999 | 0.136424000000 |
adr manual | 0.000451037401201 | 0.00013027484872 | 0.00065849536336 | 0.000279401450883 |
sddr manual | 0.0134295062645 | 0.0078081124918 | 0.0119302194151 | 0.0070959021720 |
cr benchmark | 0.0123487342798 | -0.08375062197 | 0.0123487342798 | -0.08375062197 |
adr benchmark | 0.000169402258173 | -0.000137645402219 | 0.000169402258173 | -0.000137645402219 |
sddr benchmark | 0.0170764641077 | 0.0085185009614 | 0.0170764641077 | 0.0085185009614 |
We want to use random forest to predict the future stock prices. The basic label would be if we should long or not. Thus with data for each day, we have the label which comes from the stock prices return. Thus we can make the training X and training y from the in sample data by computing the indicators from data of each day. Specifically, for each day, we have four indicators, which forms a vector of the dimension 4. And we have the label being If $cr[t]$ is positive, we buy the stock (long), else if $cr[t]$ is negative, we sell the stock (short), else we do nothing. Thus we have training X and training y. We can use bootstrap aggregating to form random forest and train on the data. Then when we have new data, we first compute the indicators of the data and then use the random forest to predict on that.
In project 6, I used four indicators including SMA, EMA, Bollinger Bands and CCI. In the implementation, I used a SMA dataframe, EMA dataframe, standard divation and a CCI dataframe and combined them to form a dataframe. And then use this as X and use the 10-day return as label y to perform learning.
For the standardization of the data, all I did is to divide the dataframe by the first day price i.e.\ normalize the data to starting from 1. This is the same method I used in Project 6 and based on the indicators I chose, the normalization to 1 is enough.
Comparing the random forest learner and the manual strategy, we can see that the random forest learner performs much more better than the manual strategy in in-sample data. Thus it shows that the performance of random forest learning strategy is much more better than manual strategy in the things that has been learned.
The assumptions used in the experiment is that the commission is 9.95 and the impact is 0.005.
I would expect this with the in-sample data every time. The way we trained the random forest, we used the data that we shouldn’t have seen which is a great help in predicting the future (we are using the future to predict the future). Thus we will always see such good results in the in-sample data.
Hypothesis: When the impact increases, the accumulated return should be significantly smaller. And there will be a lot of times the trader will do nothing, holding nothing.
The first metric to check the performance of the strategy learner is the number of zeros in the trades. The metric shows the efficient of the learner. The bigger the number, the less efficient the learner is. By using this metric, we can see that when the impact is 0, the number of zeros is 324. However when the impact reaches 0.0025, 0.01, 0.015, 0.02, the number of zeros is 349, 374, 378, 398. And we can see from the graph that when the impact is so big, our learner can not grasp a good way to profit, thus having a sinuous curve.
Another naive metric is the accumulative return. As we can see from the experiment 1, the accumulative return can reach 1.8 or higher. While in the experiment 2, using impact 0.01, 0.015, 0.02, the accumulative can merely beat 1.0 although there are times the return can reach more than 1.2.
We know that the impact affect the price in trading, more specifically, it costs more to trade. Thus intuitively, with the increasing of impact, the profit decreases.