A note on stochastic volatility model estimation

The paper assesses the method proposed by Shumway and Stoffer (2006, Chapter 6, Section 10) to estimate the parameters and volatility of stochastic volatility models. First, the paper presents a Monte Carlo evaluation of the parameter estimates considering several distributions for the perturbations in the observation equation. Second, the method is assessed empirically, through backtesting evaluation of VaR forecasts of the S&P 500 time series returns. In both analyses, the paper also evaluates the convenience of using the Fuller transformation.


Introduction
Stochastic volatility (SV) models, initially proposed by Taylor (1982Taylor ( , 1986, compose a well-known class of models to estimate volatility. These models have gained attention in the financial econometrics literature because of their flexibility to capture the nonlinear behavior observed in financial time series returns. In this sense, they are an attractive alternative to many ARCH-type models which are based on the seminal work of Engle (1982).
Let r 1 , . . . ,r n be a time series of financial returns. In the SV model proposed by Taylor (1982Taylor ( , 1986, where {ε t } is a sequence of independent identically distributed (IID) random variables with zero mean and unit variance, {ω t } is an IID sequence with ω t ∼ N(0,σ 2 ω ), and {ε t } and {ω t } are assumed to be independent for all t. In this model the volatility at time t, given by σ t = β exp(h t /2), is unobserved because h t is a latent variable.
To estimate model (1)-(2), several methods have been proposed in the literature, many of them based on the strategy of taking the log of the squared returns in (1) to obtain where y t = ln(r 2 t ), α = ln(β 2 ) and η t = ln(ε 2 t ). Then (3)-(2) is a state space model with non-Gaussian perturbations η t . Therefore the challenge is how to perform inference in a non-Gaussian framework. Harvey, Ruiz & Shephard (1994) considered a quasi-maximum likelihood approach where perturbations η t are assumed Gaussian and estimates are obtained by maximizing the likelihood constructed via the Kalman filter. However, this method can produce estimates with high bias and variance when the variability of the latent process is small relative to the variability of the log-squared returns. To improve the efficiency of the latter proposal, Kim, Shephard & Chib (1998) proposed a Markov Chain Monte Carlo (MCMC) method where the distribution of η t is approximated by a mixture of seven normals with probabilities, means and variances previously specified. For an account of other estimation methods including MCMC alternatives, see the reviews of Broto & Ruiz (2004) and Shephard & Andersen (2009).
An interesting method for maximum likelihood estimation was proposed by Shumway & Stoffer (2006, Chapter 6, Section 10). They also approximated η t as a mixture of normals: where {V t,i } is an IID sequence with V t,i ∼ N(µ i , s 2 i ), µ 1 = 0, and {I t,i } is an IID Bernoulli sequence with P[I t,i = 1] = π i and π i = 1/m. Then, adopting the approximation given by Equation (4), the state space model (3)-(2) is conditionally Gaussian. Therefore it is possible to obtain the innovations via a Kalman filter algorithm and write the likelihood function based on these innovations, as follows. Let h t|t−1 = E(h t |y 1 ,...,y t−1 ) be the predicted values of h t given the past information; the Kalman recursions are given by: where the probabilities π jt = P(I jt = 1|y 1 ,...,y t ) are calculated by: with the density p j (t|t − 1) being approximated by a N(h t|t−1 + µ j ,Σ jt ). Then, the log-likelihood is evaluated by: In order to compute expressions in (5)-(11), one must first specify the number of terms in the mixture, m, as well as some initial values. Shumway and Stoffer considered m = 2, h 1|0 = 0 and P 1|0 = φ 2 + σ 2 ω . The estimation of parameters Θ = (α, φ , σ ω , µ 1 , . . . µ m , σ 1 , . . . ,σ m ) is obtained by maximizing the innovationbased likelihood function given by Equation (11). Note that unlike Kim at al. (1998), the parameters of the mixture given by µ i and σ i , for i ∈ {1,...,m}, have to be estimated.
As far as we know, the method proposed by Shumway and Stoffer has not been evaluated in terms of the quality of the point estimates. Thus, here we intend to fill this gap in the literature by assessing the performance of the estimation method through Monte Carlo simulations. Specifically, we evaluate the performance when ε t in (1) follows different distribution functions, ranging from symmetric to asymmetric ones. There is empirical evidence of the convenience of using non-Gaussian distributions for ε t , as discussed by Bai, Russell & Tiao (2003) and Liesenfeld & Jung (2000). We also study the behavior of the estimates when the mixture in (4) is defined with m = 2 and m = 3 terms. As mentioned above, Shumway and Stoffer considered m = 2 and here our objective is to assess if the inclusion of a third term in the mixture improves the estimation even at the cost of loss of precision (because the number of parameters increases). 1 In addition, we evaluate the convenience 1 A referee expressed concerns about the use of only m = 3 in the mixture approximation given the results of Kim et al. (1998) who used m = 7 to approximate the distribution of the logarithm of a log chi-square distribution with one degree of freedom. However, the quality of the estimates of the parameters (as well as the volatility estimates) also and crucially depend on factors such as the filtering algorithm, the convexity of the log-likelihood and the optimization routine. In the proposal of Shumway and Stoffer, one has to estimate two parameters for each extra term in the mixture, implying less precision and possibly non-convergence of parameters (especially in small samples like that used in the Monte Carlo experiments of the paper). In addition, the approximation proposed by Kim et al. (1998) only holds for Gaussian perturbations and, as far as we know, the case of mixture approximations for non-Gaussian perturbations (very important in applications) has not been studied. of using the Fuller transformation of the data. This transformation, suggested by Fuller (1996, pp.494-497), is denoted by: where s 2 is the sample variance of r t and c is a small constant, used to avoid zero returns (which leads to nonfinite values of log squared returns) or mitigate the effects of inliers. In short, our main goal of the paper is to provide practitioners with valuable clues to apply the method of Shumway and Stoffer. On the other hand, the quality of volatility estimates is assessed empirically through a backtesting exercise of one-step-ahead Valueat-Risk (VaR) forecasts of the S&P 500 time series returns. The VaR calculations were performed by a novel procedure and this topic was not addressed in the proposal of Shumway and Stoffer. The remainder of the paper is organized as follows. In Section 2 we present the results of the Monte Carlo experiments; Section 3 presents the results of the backtesting exercise of VaR forecasts of S&P 500 returns; and Section 4 summarizes the main conclusions.

Setup
For each combination of parameters, we simulated 2,000 replications of time series of size n = 1,250 from model (1)-(2), assuming that ε t disturbances follow the standard Gaussian distribution, two heavy-tailed distributions and three skew distributions. The heavy-tailed distributions are the t-Student distribution and the generalized error distribution (GED), with probability density functions (pdf) defined as: respectively, where λ = 2 −2/ν Γ 1 ν Γ 3 ν . The GED distribution whose pdf is given by Equation (14) has zero mean and unit variance and the t-Student distribution in (13) has to be standardized in order to have variance equal to 1. For those distributions, we considered parameters equal to ν = 5 for the t-Student and ν = 1.5 for the GED to reproduce heavy-tailed distributions. These values of ν imply kurtosis values around 4 and 9 for GED and t-Student distributions, respectively. Similar values of ν have been obtained in empirical applications; see Zevallos, Gasco & Ehlers (2017) when using SV models, and Nelson (1991) when using EGARCH models.
In addition, to generate ε t with asymmetric distribution we used the method proposed by Fernandez & Steel (1998). Here, let f (.) be a univariate unimodal probability density function symmetric at zero. Then, a skewed pdf, say p(.), can defined through where the parameter γ ∈ (0,∞) controls the skewness. Thus, if γ = 1 then p(ε t |γ) is symmetric, if γ > 1 then p(ε t |γ) is positively skewed, and if γ < 1 then p(ε t |γ) is negatively skewed. We considered three cases: i) the skew normal, when f (.) is the standardized normal density, ii) the skew t-Student, when f (.) is a t-Student with ν = 5, and finally iii) the skew GED, when f (.) is a GED with ν = 1.5. In all cases the γ parameter was equal to 0.8 to obtain slightly asymmetric distributions for the returns. The simulated distributions are standardized as well to impose mean zero and unit variance.
In the simulation, for Gaussian disturbances, the values of parameters φ , σ ω and β are chosen to be the same as those considered by Jacquier, Polson & Rossi (1994). In total, we have nine parameter combinations which represent three levels of signal-to-total (ST ) ratio for the log-squared returns. From (3), the ST is defined . Thus, in the setup, the values of ST are approximately equal to 0.32, 0.12 and 0.019. For heavy-tailed and asymmetric distributions, we choose parameters considered by Breidt & Carriquiry (1996) as those which describe the more realistic scenarios of Jacquier et al. (1994). Thus, the ST values are around 0.11 for heavy-tailed distributions, and around 0.134 for asymmetric distributions.
Despite the availability of R codes for estimation provided by the package astsa, see https://www.stat. pitt.edu/stoffer/tsa4/ and Shumway & Stoffer (2006, Chapter 6, Section 10), we developed MATLAB codes because the estimation convergence is faster compared with R. In the optimization procedure, we considered as initial parameter values: φ = 0.95, σ ω = 0.5, and β equal to the sample mean of returns transformed by Equation (12).
The performance of the method is assessed in terms of the bias and the squared root of the mean squared error (RMSE). Each simulated time series was estimated by the method of Shumway and Stoffer using two and three terms in the mixture (4), and with and without the Fuller transformation (12), considering c = 0.005.

Results
In Table 1 we present the results when ε t follows a standardized normal distribution. Cases 1-3, Cases 4-6 and Cases 7-9 correspond to ST values of 0.32, 0.12 and 0.019, respectively. The table shows that we obtained very good results in terms of bias forβ andφ for all cases. For instance, in terms of the estimates of φ , the highest bias was 9.3% of the true value (case 7, m = 2 with Fuller transformation), so the biases were small for all combinations of parameters. With respect toσ ω , we obtained mixed results in terms of bias, with good ones in Cases 1-6 when using m = 2 with Fuller transformation, as well as in Cases 7-8 using m = 3 with Fuller transformation. In turn, the RMSE values ofφ andσ ω for Cases 7-9 were bigger than of those of Cases 1-6. In addition, for Cases 7-9, the RMSE values ofσ ω were the lowest when m = 3 and using Fuller transformation. So, inclusion of a third term in the mixture expression and the Fuller transformation improved the results where they were difficult to estimate (cases with the lower ST values, around 0.01).
Tables 2 and 3 present the results for heavy-tailed and asymmetric distributions, respectively. For the cases considered, we also obtained very good results in terms of bias forβ andφ , and overall good results in terms of bias forσ ω using m = 2 with Fuller transformation.
Finally, regarding the best combination of number of terms in the mixture (m = 2 or m = 3) and the use or not of the Fuller transformation, the results in Tables 2 and 3 indicate that using the Fuller transformation and m = 2 is the best choice, except in terms of the bias ofβ in four cases of Table 2. In addition, in terms of the bias ofφ it is better to use the Fuller transformation and m = 3 in almost all of the considered cases. For Gaussian perturbations (see Table 1), we obtained mixed results. For Cases 4, 5 and 6 (with ST values similar to those of Tables 2 and 3), and to a lesser extent for Cases 2 and 3, using the Fuller transformation and m = 2 was also the best option. However, for cases presenting the lowest ST values, the best combination was to use the Fuller transformation and m = 3 for Cases 7 and 9, and for Case 8 the best results were obtained using m = 3 (with or without the Fuller transformation). These results indicate that when the signal-to-total ratio is very small, more terms are needed to approximate the distribution of ε t . Table 1 Bias and root mean squared error (RMSE) of the estimates for SV models with perturbations ε t following the standard Gaussian distribution. For each case, the lowest values of bias and RMSE are marked in bold. F refers to Fuller transformation and m indicates the number of terms in the mixture approximation.  Table 2 Bias and root mean squared error (RMSE) of the estimates of SV models with heavy-tailed perturbations: (a) ε t ∼ t 5 and (b) ε t ∼ GED (ν = 1.5). In both cases, β = 0.0252. For each parameter combination, the lowest values of bias and RMSE are marked in bold. F refers to Fuller transformation and m indicates the number of terms in the mixture approximation.  Table 3 Bias and root mean squared error (RMSE) of the estimates of SV models with skew perturbations:

VaR forecasting
In this section, we assess the method of Shumway and Stoffer empirically, through the estimation of onestep-ahead VaR forecasts of the S&P 500 returns. Our sample spans from January 6, 2003 to July 8, 2019, a total of 4,152 days. The data were obtained from Bloomberg, and all holidays were excluded to avoid zero returns.
To obtain the one-step-ahead VaR predictions we proceeded as follows. Once the parameters are estimated, based on the Kalman Filter presented by Shumway & Stoffer (2006, Chapter 6, Section 10) and reproduced in the Introduction, it is straightforward to obtain the one-step-ahead forecast of h t , denoted byĥ T +1|T , where T is the end of the sample. Then, the one-step-ahead volatility forecast is given byσ T +1|T =β exp(0.5ĥ T +1|T ). Therefore, from Equation (1) an estimator of the 100(1 − α)% VaR is given byq ασT +1|T , whereq α is an estimator of the α-quantile of ε t .
Findingq α from the data poses a challenging problem because the method applied does not specify a parametric model for η t = log(ε 2 t ); instead a mixture model is proposed for η t . To overcome this difficulty, we proceed as follows. First, for t = 1, . . . ,T we calculated the smoothed volatilitiesσ t|T =β exp(0.5ĥ t|T ) wherê h t|T are obtained via a Kalman smoother. Second, we estimated ε t by e t = r t /σ t|T . Third,q α was calculated as the α-quantile of the standardized empirical distribution of e 1 , . . . ,e T .
The returns of the period from February 23, 2012 to July 8, 2019 were used in a backtesting exercise to evaluate the 100(1 − α)%-VaR forecasts using a rolling-window sample of 2,500 returns. Thus, for each of these dates, a VaR forecast was obtained using the previous 2,500 returns. For each window, both parameters and volatilities were estimated again. Like the Monte Carlo experiment, our objective in this section is to compare the VaR forecast using m = 2 and m = 3, and evaluate if the Fuller transformation is convenient. However, we stress that the Fuller transformation was applied only to estimate the parameters of the SV model and the forecasts were obtained using the original data.
As an illustration, in Table 4 we present the parameter estimates using m = 2 and m = 3, as well as compare them with and without the use of the Fuller transformation, over the period from January 6, 2003 to February 22, 2012 (2,500 observations). This is an important period because it includes the financial crisis of October 2008 and some periods of high volatility, such as the beginning of August 2011 (the period of the downgrade of US bond rating). Table 4 shows that the estimates of φ are around 0.98 (strong persistence), and the values ofσ ω are around 0.2. Therefore, the estimates of the S&P 500 are close of the parameters of Case 6 in Table 1.

Table 4
Results of SV estimation of S&P 500 returns of the period from 06-Jan-2003 to 22-Feb-2012 (sample size of 2,500 observations). The errors, obtained from the Hessian matrix, are in parentheses. The quality of VaR forecasts was assessed by analyzing the proportion of violations, where a violation occurs when the loss is bigger than the forecast VaR. We applied the Kupiec test (Kupiec, 1995) to check if the proportion of violations was close to the theoretical value, the Christoffersen (1998) test to check conditional coverage, and the test proposed by Christoffersen & Pelletier (2004) to check if the violations were independent. The results are presented in Table 5. Taking into account both the percentage of violations and the p-values of the tests, the following conclusions emerge from Table 5. For 95% VaRs, the best results are obtained using m = 3 without the Fuller transformation (good results) and the second-best results, in terms of nominal value, are obtained using m = 2 and the Fuller transformation. For 97.5% and 99% VaRs, the best results are obtained using m = 3 and Fuller transformation. These results are very good in terms of both percentage of violations and performance of the tests. For 97.5% VaRs the second-best results are obtained using m = 3 without Fuller transformation, considered good results even though there is overestimation. For 99% VaRs, except when using m = 3 and Fuller transformation, the results are not good, indicating the importance of both the transformation and the number of terms in the mixture.
In Figure 1 we present the graph of the returns in the backtesting period and the 99%-VaR forecasts calculated using m = 3 and the Fuller transformation. These forecasts mimic very well the movements of the volatility and they are not far below the losses. Therefore, the behavior of the VaR forecasts is very good.

Figure 1
Backtesting results for S&P returns in black dots and predicted 99%-VaR, with Fuller transformation and m = 3 in red line.

Conclusions
Monte Carlo experiments revealed that the method for estimating SV models proposed by Shumway & Stoffer (2006, Chapter 6, Section 10) presents good performance overall in terms of point estimates, considering several distributions for the perturbations (ε t ) in the observation equation. However, the estimation ofσ ω in terms of bias is not good in all cases, because of the existence of high bias values. Overall, the best results were obtained using m = 2 terms in the mixture and using the Fuller transformation, except for the cases of very low signal-to-total ratio, where m = 3 terms are needed. On the other hand, the backtesting analysis of the S&P 500 returns revealed that in terms of one-step-ahead 97.5% and 99% VaR forecasts, using m = 3 and Fuller transformation provides very good results. Therefore, prudent advice to practitioners who want to use the method discussed in this paper is to use Fuller transformation with both m = 2 and m = 3 terms in the mixture, and decide between the two values of m based on backtesting considerations.

Acknowledgments
The first author acknowledges financial support from CAPES Ph.D. scholarship. The second author acknowledges financial support from the São Paulo Research Foundation (FAPESP) grant 2018/04654-9 and the support of the Centre for Applied Research on Econometrics, Finance and Statistics (CAREFS). The authors thank an anonymous referee for comments and valuable suggestions that led to an improvement in the paper.