Andrey KostenkoJekyll2016-11-05T13:12:42+00:00/Andrey Kostenko/akoste01@gmail.comForecasting Analyst Down Underandreykostenkohttps://feedburner.google.com<![CDATA[Time series models: VAR model definition]]>/time-series-models-var-model-definition2015-09-30T00:00:00+00:002015-09-30T00:00:00+00:00Andrey Kostenkoakoste01@gmail.com
<p>The Vector AutoRegression (VAR) family of models has been widely used for modelling and forecasting since the early 1980s. A VAR model is a conceptually simple system of multivariate models where each variable is explained by its own past values and the past values of all other variables in the system.</p>
<h1 id="var-model-definition">VAR model definition</h1>
<p>The Vector AutoRegression (VAR) family of models has been widely used for modelling and forecasting since the work of <span class="citation" data-cites="Sims80">Sims (1980)</span><a href="#fn1" class="footnoteRef" id="fnref1"><sup>1</sup></a>. A VAR model is a system of multivariate models in which each variable is explained by its own past values and the past values of all other variables in the system. For example, with two variables <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>, and two lagged values of each variables, the system is given by</p>
<p><span class="math display">\[x_t=\alpha_0+\alpha_1 x_{t-1}+\alpha_2 x_{t-2} + \alpha_3y_{t-1} + \alpha_4y_{t-2} + \epsilon_{1t}\]</span> <span class="math display">\[y_t=\beta_0+\beta_1 x_{t-1}+\beta_2 x_{t-2} + \beta_3y_{t-1} + \beta_4y_{t-2} + \epsilon_{2t}\]</span></p>
<p>where <span class="math inline">\(\epsilon_1\)</span> and <span class="math inline">\(\epsilon_2\)</span> are generally correlated random errors. There are ten unknown parameters in the above system. More generally, the system of <span class="math inline">\(k\)</span> variables with <span class="math inline">\(l\)</span> legs on each variable plus a constant in each equation will have <span class="math inline">\((k^2l)+k\)</span> parameters to estimate. As the right-hand side variables are the same in each equation, the ordinary least squares (OLS) can be used to estimate the system. Using matrix notation, the general VAR model can be written</p>
<p><span class="math display">\[\mathbf{Y}_t=\mathbf{A}\mathbf{Y}_{t-1}+ \boldsymbol{\epsilon}_t\]</span></p>
<p>where <span class="math inline">\(\mathbf{A}\)</span> is a matrix of polynomials in the lag operator, <span class="math inline">\(\mathbf{Y}\)</span> is the vector of variables, and <span class="math inline">\(\boldsymbol{\epsilon}\)</span> is a vector of random errors. Further details including the statistical and econometric research preceding the development of VAR models, problems of parameter estimation, cointegration and forecasting are available from <span class="citation" data-cites="Holden95">Holden (1995)</span>.</p>
<h1 id="references" class="references unnumbered">References</h1>
<div id="ref-Holden95">
<p>Holden, K. (1995). Vector autoregression modeling and forecasting. <em>Journal of Forecasting</em>, <em>14</em>(3), 159–166. <a href="http://doi.org/10.1002/for.3980140302" class="uri">http://doi.org/10.1002/for.3980140302</a></p>
</div>
<div id="ref-Sims80">
<p>Sims, C. (1980). Macroeconomics and reality. <em>Econometrica</em>, <em>48</em>(1), 1–48. <a href="http://doi.org/10.2307/1912017" class="uri">http://doi.org/10.2307/1912017</a></p>
</div>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>In October 2011, Christopher Sims of Princeton University shared the Nobel Prize for economics with Thomas Sargent of New York University “for their empirical research on cause and effect in the macroeconomy”. <a href="http://www.voxeu.org/article/why-christopher-sims-won-nobel-prize" target="_blank">Read more…</a><a href="#fnref1">↩</a></p></li>
</ol>
</section>
<p><a href="/time-series-models-var-model-definition/">Time series models: VAR model definition</a> was originally published by Andrey Kostenko at <a href="">Andrey Kostenko Analytics</a> on September 30, 2015.</p>
<img src="http://feeds.feedburner.com/~r/andreykostenko/~4/qsc5fq0q9nM" height="1" width="1" alt=""/><![CDATA[Time series models: ARIMA model definition]]>/time-series-models-arima-model-definition2015-09-30T00:00:00+00:002015-09-30T00:00:00+00:00Andrey Kostenkoakoste01@gmail.com
<p>The Autoregressive Integrated Moving Average (ARIMA) family of models has now been a workhorse of time series forecasting for almost 50 years. The acronym indicates that the model is composed of different terms, an autoregressive term (AR), a moving-average term (MA), and an integration term (I) that accounts for the non-stationarity of the time series. The three terms combined constitute one of the most widely used discrete-time dynamic model.</p>
<h1 id="arima-model-definition">ARIMA model definition</h1>
<p>Consider the three linear dynamical system models:</p>
<p>AutoRegressive (AR) model of order <span class="math inline">\(p\)</span>:</p>
<ol type="1">
<li><span class="math display">\[x_t=\phi_1 x_{t-1}+ \dots + \phi_p x_{t-p}+\omega_t\]</span></li>
</ol>
<p>Moving Average (MA) model of order <span class="math inline">\(q\)</span>:</p>
<ol start="2" type="1">
<li><span class="math display">\[x_t=\theta_1 \omega_{t-1}+ \dots + \theta_q \omega_{t-q}+\omega_t\]</span></li>
</ol>
<p>AutoRegressive Moving Average (ARMA) model of order <span class="math inline">\(p\)</span> and <span class="math inline">\(q\)</span>:</p>
<ol start="3" type="1">
<li><span class="math display">\[x_t=\phi_1 x_{t-1}+ \dots + \phi_p x_{t-p} + \theta_1 \omega_{t-1}+ \dots + \theta_q \omega_{t-q}+\omega_t\]</span></li>
</ol>
<p>where <span class="math inline">\(w_t\)</span> is a Gaussian white noise with zero mean, constant variance <span class="math inline">\(\sigma^2_\omega\)</span> and zero covariances. In the AR model, the current value of the process, <span class="math inline">\(x_t\)</span>, is expressed as a finite linear aggregate of previous values of the process and white noise <span class="math inline">\(\omega_t\)</span>. In the MA model, the current value of the process, <span class="math inline">\(x_t\)</span>, is a finite linear aggregate of previous values of the white noise, <span class="math inline">\(\omega_{t-1}\)</span>, <span class="math inline">\(\omega_{t-2}\)</span>, <span class="math inline">\(\omega_{t-q}\)</span>, plus the current value of the noise <span class="math inline">\(\omega_t\)</span>. In the ARMA model, <span class="math inline">\(x_t\)</span> is expressed as a sum of finite linear aggregate of previous values of the process, and the past and current white noise inputs. The ARMA process is the most flexible of the three, meaning that it will require fewer parameters to be estimated when the model is used to forecast time series data.</p>
<p>As detailed in <span class="citation" data-cites="Holden95">Holden (1995)</span>, that any stationary process can be uniquely represented by a possibly infinite MA process is due to <span class="citation" data-cites="bWold1938">Wold (1938 Theorem 7)</span>. That is, ignoring any deterministic component that can be removed by subtraction, (1) can be written as</p>
<p><span class="math display">\[x_t=(1+\theta_1L+\theta_2L^2 + \dots)\epsilon_t=\theta(L)\epsilon_t,\]</span></p>
<p>where <span class="math inline">\(L\)</span> is the lag operator, often called the backshift operator <span class="math inline">\(B\)</span>, such that <span class="math inline">\((1-L^k)x_t\equiv x_t-x_{t-k}\)</span>. That any stationary process can be uniquely represented by a possibly infinite AR process is due to <span class="citation" data-cites="bWhittle1963">Whittle (1963, p. 21)</span>. Again, ignoring any deterministic component, (2) can be written as</p>
<p><span class="math display">\[\epsilon=(1-\phi_1L-\phi_2L^2 - \phi_3L^3 \dots)x_t=\phi(L)x_t.\]</span></p>
<p>Equation (3), which can be written as</p>
<p><span class="math display">\[\phi(L)x_t=\theta(L)\epsilon_t \to \frac{\phi(L)}{\theta(L)}x_t=\epsilon,\]</span></p>
<p>provides the groundwork for the ARMA models of <span class="citation" data-cites="bBJ1976">Box & Jenkins (1976)</span> (first published in the 1970, after a series of working papers).</p>
<p>For time series that are not stationary <span class="citation" data-cites="bBJ1976">Box & Jenkins (1976)</span> popularised the idea that differencing a non-stationary process <span class="math inline">\(d\)</span> times can reduce it to a (nearly) stationary process so that the ARMA modelling framework above applies. That is, if <span class="math inline">\(y_t\)</span> is non-stationary, stationarity can often be achieved by taking differences <span class="math inline">\(x_t=(1-L)^dy_t\)</span>. Note that in practice <span class="math inline">\(d\)</span> is rarely greater than 2.</p>
<p>The general (non-seasonal) ARIMA process is commonly referred to as ARIMA(<span class="math inline">\(p,d,q\)</span>), where <span class="math inline">\(d\)</span> is the order of integration of the process, <span class="math inline">\(p\)</span> is the number of autoregressive parameters and <span class="math inline">\(q\)</span> is the number of moving average parameters. Special cases follow naturally: ARIMA(<span class="math inline">\(p,0,q\)</span>) <span class="math inline">\(\equiv\)</span> ARMA(<span class="math inline">\(p,q\)</span>), ARIMA(<span class="math inline">\(p,0,0\)</span>) <span class="math inline">\(\equiv\)</span> AR(<span class="math inline">\(p\)</span>), ARIMA(<span class="math inline">\(0,0,q\)</span>) <span class="math inline">\(\equiv\)</span> MA(<span class="math inline">\(q\)</span>), ARIMA(<span class="math inline">\(0,1,0\)</span>) is a random walk process and ARIMA(<span class="math inline">\(0,0,0\)</span>) is a white noise. Further details are available in almost any book on time series forecasting and zillions of papers.</p>
<h1 id="references" class="references unnumbered">References</h1>
<div id="ref-bBJ1976">
<p>Box, G. E. P., & Jenkins, G. M. (1976). <em>Time series analysis, forecasting and control</em>. San Francisco: Holden-Day.</p>
</div>
<div id="ref-Holden95">
<p>Holden, K. (1995). Vector autoregression modeling and forecasting. <em>Journal of Forecasting</em>, <em>14</em>(3), 159–166. <a href="http://doi.org/10.1002/for.3980140302" class="uri">http://doi.org/10.1002/for.3980140302</a></p>
</div>
<div id="ref-bWhittle1963">
<p>Whittle, P. (1963). <em>Prediction and regulation</em>. London: English University Press.</p>
</div>
<div id="ref-bWold1938">
<p>Wold, H. (1938). <em>A study in the analysis of stationary time series</em>. Upsala: Almquist; Wiksell.</p>
</div>
<p><a href="/time-series-models-arima-model-definition/">Time series models: ARIMA model definition</a> was originally published by Andrey Kostenko at <a href="">Andrey Kostenko Analytics</a> on September 30, 2015.</p>
<img src="http://feeds.feedburner.com/~r/andreykostenko/~4/_JOAJtxkoEw" height="1" width="1" alt=""/><![CDATA[Types of forecast: ex ante vs ex post]]>/types-of-forecasts-ex-ante-vs-ex-post2015-09-06T00:00:00+00:002015-09-06T00:00:00+00:00Andrey Kostenkoakoste01@gmail.com
<p>Ex ante forecast is a forecast based solely on information available at the time of the forecast, whereas ex post forecast is a forecast that uses information beyond the time at which the forecast is made. Let’s discuss the two in more detail, as in different contexts the terms may mean slightly different things.</p>
<style type="text/css">
blockquote {font-family: Comic Sans MS; font-style: italic;font-size: 15px;}
</style>
<h1 id="introduction">Introduction</h1>
<p>Only one thing is true about forecasts - they are always wrong. Any forecast <span class="math inline">\(\hat{Y}_t\)</span> of the actual value <span class="math inline">\(Y_t\)</span> is associated with the forecast error <span class="math inline">\(e_t = Y_t − \hat{Y}_t\)</span>. The art and science of forecasting is all about bringing these forecast errors down to zero, as close as possible, for all future values of <span class="math inline">\(t\)</span>. Doing this first for some of the past values of <span class="math inline">\(t\)</span> is often a useful exercise for developing a better forecasting model. It is this distinction between future and past values of <span class="math inline">\(t\)</span> that underlines the distinction between two types of forecast: ex post forecasts and ex ante forecasts.</p>
<p>The terms <em>ex ante</em> and <em>ex post</em> are among those few words in the forecasting literature that a general reader is unlikely to understand without first learning what they in fact mean. To see how the term <em>ex ante</em> have been used in the forecasting literature, check out the following excerpts:</p>
<blockquote style="font-family: Comic Sans MS; font-size: 14px;">
<ol type="1">
<li>Empirical results show that this procedure selects models that give reasonable <strong>ex ante</strong> forecast accuracy.</li>
<li>The models allowing for serial correlation are shown to have the best <strong>ex ante</strong> forecasting performance.</li>
<li>I looked primarily for studies that used real data to compare the <strong>ex ante</strong> forecasting accuracy of alternative methods.</li>
<li>They examined the potential benefits of <strong>ex ante</strong> rules when contrasted with model selection based on within-sample model fitting.</li>
<li>It is not clear to me from reading the paper if the resulting forecasts are true <strong>ex ante</strong> forecasts.</li>
<li>The forecasts were tested on a holdout <strong>ex ante</strong> sample that was known only to the administrator of the competition.</li>
<li>It may be, for example, that forecasters are using future information, perhaps inadvertently, so that forecasts are not <strong>ex ante</strong>, and I have known several cases where further study showed that a method was not as good as first suggested.</li>
<li>The NN3 competition evaluates the <strong>ex ante</strong> accuracy of forecasting the next 18 observations on two homogeneous sets of 111 or 11 time series of varying length and time series patterns on multiple established error metrics.</li>
<li>No matter how honest our efforts, and no matter how generous our colleagues, as long as comparisons were not genuinely <strong>ex ante</strong> a measure of doubt about our results was inevitable.</li>
</ol>
</blockquote>
<p>In general, ex ante is often used to mean ‘before-the-fact’ and ex post as ‘after-the-fact’. In the forecasting literature, an ex ante forecast is said to be any forecast that uses information available only at the time of the forecast. When a forecast prepared at certain time uses information available after that time, it is said to be an ex post forecast. A typical example of the latter is when known (actual rather than projected) values of external variables (regressors, predictors or drivers) are used in producing forecasts for the hold-out part of a time series. While ex post forecasts may be useful for exploring the properties of the forecasting model, it is ex ante forecasts (and the associated accuracy) that are ultimately important.</p>
<h1 id="forecasting-univariate-time-series">Forecasting univariate time series</h1>
<p>In the context of univariate time series forecasting, the difference between ex ante and ex post forecasts is well explained by <span class="citation" data-cites="bChan2010">Chan (2010)</span>. The remainder of this paragraph, including the quote below, follows him almost verbatim.</p>
<blockquote>
<p>The ex post forecasts are made when the “future” observations are known during the forecasting period. It is used as a means to check against known data so that the forecasting model can be evaluated. The ex ante forecasts are made beyond the present, when the future observations are not available for checking.</p>
</blockquote>
<p>Suppose that we observe <span class="math inline">\(\{Y_1,\dots,Y_n\}\)</span>, so we may use <span class="math inline">\(\{Y_1,\dots,Y_t\}\)</span> for <span class="math inline">\(t < n\)</span> to estimate a model and use the estimated model to forecast <span class="math inline">\(\{Y_{t+1},\dots,Y_n\}\)</span>. These are ex post forecasts since we can use them to compare against the observed <span class="math inline">\(\{Y_{t+1},\dots,Y_n\}\)</span>. The estimation period in this case is <span class="math inline">\(t\)</span>. On the other hand, when we forecast <span class="math inline">\(\{Y_{n+1},\dots,Y_{n+h}\}\)</span> for <span class="math inline">\(h > 0\)</span>, we are doing ex ante forecasts. After fitting a model, we estimate a future value <span class="math inline">\(Y_{n+h}\)</span> at time <span class="math inline">\(n\)</span> by <span class="math inline">\(\hat{Y}_n(h)\)</span> based on the fitted model, while the actual value of <span class="math inline">\(Y_{n+h}\)</span> is unknown.</p>
<p>Somewhat differently from the above, in the context of univariate time series ex post forecast is sometimes used synonymously with the ‘model fit’ (i.e. small ex post forecast errors are associated with models that fit the data well). For example, according to <span class="citation" data-cites="GM88a">Gardner & McKenzie (1988)</span>, in the 1960s and 1970s, the period of early theoretical development in quantitative forecasting, it was expected that</p>
<blockquote>
<p>The better the fit (ex post) of the forecasting model, the better the accuracy (ex ante) should be.</p>
</blockquote>
<p>Note that it is well known nowadays that the above statement is in general false. <span class="citation" data-cites="Ledolter10">Ledolter (2010)</span> also uses the term <em>ex post</em> when referring to the in-sample forecast performance:</p>
<blockquote>
<p>If sufficient observations are available, one should divide the series into two parts, derive estimates of the parameters that are necessary to construct the forecasts from the first part, and evaluate the accuracy of the ex ante forecasts from the observations in the holdout period. This is important, since models that fit the data well (i.e., have small ex post forecast errors) sometimes perform badly in forecasting, and vice versa.</p>
</blockquote>
<h1 id="forecasting-with-external-variables">Forecasting with external variables</h1>
<p>In the context of time series with external variables ex post forecasts most commonly are forecast made with knowledge of the external variables. This type of forecast can be useful in quantifying the sources of forecast uncertainty, when answering questions like what part of a forecast error is due to a poor forecast of temperature and what part of it should be attributed to the forecasting model itself being imperfect. Quantifying the effect of errors due to the need to use estimated future values of external variables rather than their actual or observed values can often help improve the forecasting model. For example, <span class="citation" data-cites="HF10">Hyndman & Fan (2010)</span> write:</p>
<blockquote>
<p>Specifically, ex ante forecasts are the forecasts made in advance using whatever information is available at the time. On the other hand, ex post forecasts are those that are made using information on the “driver variables” that is only known after the event being forecast. The difference between the ex ante forecasts and ex post forecasts will provide a measure of the effectiveness of the model for forecasting (taking out the effect of the forecast errors in the input variables).</p>
</blockquote>
<p>It is in this latter context that the distinction between the <em>ex post</em> and <em>ex ante</em> forecast is most useful. The reader of forecasting literature, however, should be aware that some well established names in the forecasting literature may use the terms simply to distinguish between the in-sample forecasts (done as part of the model fitting process) and the out-of-sample forecasts (done as part of the forecast evaluation process).</p>
<h1 id="references" class="references unnumbered">References</h1>
<div id="ref-bChan2010">
<p>Chan, N. H. (2010). <em>Time series applications to finance with R and S-Plus</em> (2nd ed.). John Wiley & Sons. <a href="http://www.sta.cuhk.edu.hk/nhchan" class="uri">http://www.sta.cuhk.edu.hk/nhchan</a></p>
</div>
<div id="ref-GM88a">
<p>Gardner, E. S., & McKenzie, E. (1988). Model identification in exponential smoothing. <em>Journal of the Operational Research Society</em>, <em>39</em>(9), 863–867. <a href="http://doi.org/10.2307/2583529" class="uri">http://doi.org/10.2307/2583529</a></p>
</div>
<div id="ref-HF10">
<p>Hyndman, R. J., & Fan, S. (2010). Density forecasting for long-term peak electricity demand. <em>IEEE Transactions On Power Systems</em>, <em>25</em>(2), 1142–1153. <a href="http://doi.org/10.1109/TPWRS.2009.2036017" class="uri">http://doi.org/10.1109/TPWRS.2009.2036017</a></p>
</div>
<div id="ref-Ledolter10">
<p>Ledolter, J. (2010). Prediction and forecasting. <em>Encyclopedia of Statistical Sciences</em>. <a href="http://doi.org/10.1002/0471667196.ess2046.pub3" class="uri">http://doi.org/10.1002/0471667196.ess2046.pub3</a></p>
</div>
<p><a href="/types-of-forecasts-ex-ante-vs-ex-post/">Types of forecast: ex ante vs ex post</a> was originally published by Andrey Kostenko at <a href="">Andrey Kostenko Analytics</a> on September 06, 2015.</p>
<img src="http://feeds.feedburner.com/~r/andreykostenko/~4/PDXqaNLS_bA" height="1" width="1" alt=""/><![CDATA[Outliers and the correlation coefficient]]>/outliers-and-the-correlation-coefficient2015-09-05T00:00:00+00:002015-09-05T00:00:00+00:00Andrey Kostenkoakoste01@gmail.com
<p>I was asked to illustrate how outliers can affect the standard sample correlation coefficient and show how the use of robust measures of correlation (association) could help when there is a need to automate the analysis. The post may be of interest to people with little background in statistics or data analysis.</p>
<h1 id="outliers-and-the-correlation-coefficient">Outliers and the correlation coefficient</h1>
<p>Let’s begin with a dataset showing Highway death rates (per 100 million vehicle miles of travel) and maximum speed limits in 10 countries, as previously published by <span class="citation" data-cites="Rivkin86">Rivkin (1986)</span>. The dataset is likely to be from the time when the United States had a maximum speed limit of 55 miles per hour.</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">cs</span><span class="o"><-</span><span class="n">c</span><span class="p">(</span><span class="s2">"Norway"</span><span class="p">,</span><span class="s2">"United States"</span><span class="p">,</span><span class="s2">"Finland"</span><span class="p">,</span><span class="s2">"Britain"</span><span class="p">,</span><span class="s2">"Denmark"</span><span class="p">,</span>
<span class="s2">"Canada"</span><span class="p">,</span><span class="s2">"Japan"</span><span class="p">,</span><span class="s2">"Australia"</span><span class="p">,</span><span class="s2">"Netherlands"</span><span class="p">,</span><span class="s2">"Italy"</span><span class="p">)</span>
<span class="n">dr</span><span class="o"><-</span><span class="n">c</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="m">3.3</span><span class="p">,</span><span class="m">3.4</span><span class="p">,</span><span class="m">3.5</span><span class="p">,</span><span class="m">4.1</span><span class="p">,</span><span class="m">4.3</span><span class="p">,</span><span class="m">4.7</span><span class="p">,</span><span class="m">4.9</span><span class="p">,</span><span class="m">5.1</span><span class="p">,</span><span class="m">6.1</span><span class="p">)</span>
<span class="n">sl</span><span class="o"><-</span><span class="n">c</span><span class="p">(</span><span class="m">55</span><span class="p">,</span><span class="m">55</span><span class="p">,</span><span class="m">55</span><span class="p">,</span><span class="m">70</span><span class="p">,</span><span class="m">55</span><span class="p">,</span><span class="m">60</span><span class="p">,</span><span class="m">55</span><span class="p">,</span><span class="m">60</span><span class="p">,</span><span class="m">60</span><span class="p">,</span><span class="m">75</span><span class="p">)</span>
<span class="n">df</span><span class="o"><-</span><span class="n">data.frame</span><span class="p">(</span><span class="n">death_rate</span><span class="o">=</span><span class="n">dr</span><span class="p">,</span><span class="n">speed_limit</span><span class="o">=</span><span class="n">sl</span><span class="p">)</span>
<span class="n">rownames</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o"><-</span><span class="n">cs</span>
<span class="n">df</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">## death_rate speed_limit
## Norway 3.0 55
## United States 3.3 55
## Finland 3.4 55
## Britain 3.5 70
## Denmark 4.1 55
## Canada 4.3 60
## Japan 4.7 55
## Australia 4.9 60
## Netherlands 5.1 60
## Italy 6.1 75</code></pre>
</figure>
<p>Among the questions that can be asked here are whether there is a correlation between speed limit and death rate and whether lower speed limits reduce the highway death rate.</p>
<p>It can be shown that the sample correlation coefficient for death rate and speed limit is 0.55</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">cor.test</span><span class="p">(</span><span class="n">death_rate</span><span class="p">,</span><span class="n">speed_limit</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Pearson's product-moment correlation
##
## data: death_rate and speed_limit
## t = 1.8546, df = 8, p-value = 0.1008
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1241625 0.8756458
## sample estimates:
## cor
## 0.54833</code></pre>
</figure>
<p>but if Italy alone is removed, it drops to 0.098</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="m">-10</span><span class="p">,],</span><span class="n">cor.test</span><span class="p">(</span><span class="n">death_rate</span><span class="p">,</span><span class="n">speed_limit</span><span class="p">))</span> <span class="err">#</span><span class="n">or</span> <span class="n">df</span><span class="p">[</span><span class="n">rownames</span><span class="p">(</span><span class="n">df</span><span class="p">)</span><span class="o">!=</span><span class="s2">"Italy"</span><span class="p">,]</span> </code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Pearson's product-moment correlation
##
## data: death_rate and speed_limit
## t = 0.26013, df = 7, p-value = 0.8022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6056285 0.7154765
## sample estimates:
## cor
## 0.09784921</code></pre>
</figure>
<p>and if then Britain is removed, it jumps to 0.70</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="o">-</span><span class="n">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="m">10</span><span class="p">),],</span><span class="n">cor.test</span><span class="p">(</span><span class="n">death_rate</span><span class="p">,</span><span class="n">speed_limit</span><span class="p">))</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Pearson's product-moment correlation
##
## data: death_rate and speed_limit
## t = 2.3869, df = 6, p-value = 0.05425
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01332986 0.94019351
## sample estimates:
## cor
## 0.6978986</code></pre>
</figure>
<p>By removing only Britain, one can have the sample correlation as high as 0.81, and unlike with all the above cases, the coefficient is statistically significant:</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">with</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="m">-4</span><span class="p">,],</span><span class="n">cor.test</span><span class="p">(</span><span class="n">death_rate</span><span class="p">,</span><span class="n">speed_limit</span><span class="p">))</span> </code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Pearson's product-moment correlation
##
## data: death_rate and speed_limit
## t = 3.7102, df = 7, p-value = 0.007553
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3267378 0.9594924
## sample estimates:
## cor
## 0.8141862</code></pre>
</figure>
<p>The above shows that outliers can easily deflate or inflate the sample correlation coefficient. Usually, an outlier that is consistent with the trend of the vast majority of the data will inflate the correlation (see the bottom-right quadrant of Fig. 1 below), and an outlier that is not consistent with the rest of the data can substantially decrease the correlation (see the top-right quadrant).</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">2</span><span class="p">))</span>
<span class="n">plot</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">main</span><span class="o">=</span><span class="s2">"all ten countires present: r=0.55"</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="m">-10</span><span class="p">,],</span><span class="n">main</span><span class="o">=</span><span class="s2">"only Italy removed: r=0.098"</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="o">-</span><span class="n">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="m">10</span><span class="p">),],</span><span class="n">main</span><span class="o">=</span><span class="s2">"Italy and Britain removed: r=0.70"</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">df</span><span class="p">[</span><span class="m">-4</span><span class="p">,],</span><span class="n">main</span><span class="o">=</span><span class="s2">"only Britain removed: r=0.81"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<figcaption>
<div style="text-align: center !important;
font-style: italic;
text-indent: 0;
">
Figure 1. The effect of outliers on the sample correlation
</div>
</figcaption>
<img src="/figures/Outliers_and_the_correlation_coefficient_plot1-1.svg">
</figure>
<p>This high sensitivity of the sample correlation coefficient to outliers is well known and may complicate the analysis that otherwise could easily be automated. There are generally two strategies here: 1) remove the outliers prior to using the standard correlation coefficient, and 2) use measures of correlation and association that are robust to outliers. Of course, the two strategies can always be combined, but when you need to automate the analysis using the robust measures alone would often be a preferred approach.</p>
<h1 id="robust-alternatives">Robust alternatives</h1>
<p>It will be convenient to introduce the robust versions of the Pearson’s product-moment correlation coefficient by considering another dataset, the one that shows average brain and body weights for 28 species of land animals. R package MASS has a number of datasets useful for demonstrating various aspects of statistical methods. To see them all, just type <em>data(package=“MASS”)</em>.</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">require</span><span class="p">(</span><span class="n">MASS</span><span class="p">)</span>
<span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="m">2</span><span class="p">),</span><span class="n">mar</span><span class="o">=</span><span class="n">c</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="m">4</span><span class="p">,</span><span class="m">2</span><span class="p">,</span><span class="m">2</span><span class="p">))</span>
<span class="n">plot</span><span class="p">(</span><span class="n">Animals</span><span class="o">$</span><span class="n">body</span><span class="p">,</span><span class="n">Animals</span><span class="o">$</span><span class="n">brain</span><span class="p">)</span>
<span class="n">plot</span><span class="p">(</span><span class="n">Animals</span><span class="o">$</span><span class="n">body</span><span class="p">,</span><span class="n">Animals</span><span class="o">$</span><span class="n">brain</span><span class="p">,</span><span class="n">log</span><span class="o">=</span><span class="s2">"xy"</span><span class="p">)</span></code></pre>
</figure>
<figure>
<figcaption>
<div style="text-align: center !important;
font-style: italic;
text-indent: 0;
">
Figure 2. Linear and log-log plots of the data
</div>
</figcaption>
<img src="/figures/Outliers_and_the_correlation_coefficient_plot2-1.svg">
</figure>
<p>The linear plot on the left is not very informative thanks to a few outliers (elephants and humans), but the log-log plot shows rather a strong correlation between the two weights, which can be readily quantified with the R’s <em>cor.test()</em> function. Without the use of attach(), there are three common ways to use the function with data:</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">cor.test</span><span class="p">(</span><span class="n">Animals</span><span class="o">$</span><span class="n">body</span><span class="p">,</span><span class="n">Animals</span><span class="o">$</span><span class="n">brain</span><span class="p">)</span>
<span class="n">with</span><span class="p">(</span><span class="n">Animals</span><span class="p">,</span> <span class="n">cor.test</span><span class="p">(</span><span class="n">body</span><span class="p">,</span> <span class="n">brain</span><span class="p">))</span>
<span class="n">cor.test</span><span class="p">(</span><span class="o">~</span><span class="n">body</span><span class="o">+</span><span class="n">brain</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">Animals</span><span class="p">)</span></code></pre>
</figure>
<p>The last approach uses the so-called formula interface, for which the function has a registered S3 method. All three give</p>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Pearson's product-moment correlation
##
## data: body and brain
## t = -0.027235, df = 26, p-value = 0.9785
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3776655 0.3684700
## sample estimates:
## cor
## -0.005341163</code></pre>
</figure>
<p>showing practically zero correlation, thereby contradicting the relationship clearly present on the log-log plot. This situation can be reconciled by using correlation (measures of association) other than that of Pearson. The <em>cor.test()</em> function can also compute Spearman’s rank correlation rho and Kendall’s rank correlation tau. All what is required is overwriting the default value of the <em>method</em> parameter:</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">cor.test</span><span class="p">(</span><span class="o">~</span><span class="n">body</span><span class="o">+</span><span class="n">brain</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">Animals</span><span class="p">,</span> <span class="n">method</span> <span class="o">=</span><span class="s2">"spearman"</span><span class="p">,</span><span class="n">exact</span><span class="o">=</span><span class="n">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Spearman's rank correlation rho
##
## data: body and brain
## S = 1036.6, p-value = 1.813e-05
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7162994</code></pre>
</figure>
<p>and</p>
<figure class="highlight">
<pre><code class="language-r" data-lang="r"><span class="n">cor.test</span><span class="p">(</span><span class="o">~</span><span class="n">body</span><span class="o">+</span><span class="n">brain</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">Animals</span><span class="p">,</span> <span class="n">method</span> <span class="o">=</span><span class="s2">"kendall"</span><span class="p">,</span> <span class="n">exact</span><span class="o">=</span><span class="n">FALSE</span><span class="p">)</span></code></pre>
</figure>
<figure class="highlight">
<pre><code class="language-text" data-lang="text">##
## Kendall's rank correlation tau
##
## data: body and brain
## z = 4.6042, p-value = 4.141e-06
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.6172191</code></pre>
</figure>
<p>The two results are now consistent with what the log-log plot suggests. The effect of <em>exact=FALSE</em> could also be achieved by the use of <em>suppressWarnings()</em> function to drop the warnings about the ties on observations that prevent calculation of exact p-values (so only approximations are returned), the issue that is of little interest to us now.</p>
<h1 id="conclusion">Conclusion</h1>
<p>To conclude, outliers can distort the output of classical statistical methods to the extent that wrong conclusions can be made. There are generally two approaches to remedy this problem: detect and then exclude the outliers from an analysis to be conducted with standard methods, or make use of the methods that are robust to outliers and other influential observations. A number of robust methods exist, for example, for regression modelling, of which the calculation of the sample correlation coefficients can be considered a special case. The second approach may often be preferred for automated solutions, as correctly detecting and removing outliers is in most cases difficult.</p>
<h1 id="references" class="references unnumbered">References</h1>
<div id="ref-Rivkin86">
<p>Rivkin, D. J. (1986). Fifty-five mph speed limit is no safety guarantee. <em>New York Times (Letters to the Editor)</em>, <em>25</em>, p. 26.</p>
</div>
<p><a href="/outliers-and-the-correlation-coefficient/">Outliers and the correlation coefficient</a> was originally published by Andrey Kostenko at <a href="">Andrey Kostenko Analytics</a> on September 05, 2015.</p>
<img src="http://feeds.feedburner.com/~r/andreykostenko/~4/eNtmiSx745U" height="1" width="1" alt=""/><![CDATA[Useful SAS books short-listed]]>/Useful-SAS-books-shortlisted2015-02-13T00:00:00-00:002015-01-05T00:00:00+00:00Andrey Kostenkoakoste01@gmail.com
<p>A time ago the Western Power’s System Forecasting team, where I now belong in, initiated an upgrade of the software used for data analysis and forecasting. Several options were considered, including SPSS Modeller and a few industry specific options, but the final decision was to use SAS.</p>
<p>One reason for this desirable shift (R was not considered a suitable option) was to make use of advanced data techniques and models to quickly build reliable and accurate forecasts to improve business decisions. <a href="http://www.sas.com/en_us/industry/utilities.html#energy-forecasting" target="_blank"> SAS Institute claims </a> their current technologies can convert “<em>transactional meter and operational systems data into a forecast-ready format</em>”, which is exactly what the team required. However, for a corporation as big as Western Power, it may take another few years before all SAS components required to achieve this goal can be properly integrated and optimised. This time can be used to learn how to use SAS most efficiently and effectively.</p>
<p>There are many on-line resources with plenty of code available to help one learn SAS for free, e.g. <a href="http://www.ats.ucla.edu/stat/sas/modules/" title="UCLA's SAS learning modules">UCLA’s SAS learning modules</a> and <a href="http://robslink.com/SAS/Home.htm" title="Robert Allison's SAS/Graph examples">Robert Allison’s SAS/Graph examples</a>, but books published by SAS Press may remain the best learning option for beginners. It is the purpose of this post to list the books that are likely to be useful for the type of tasks the team performs now and will be expected to perform in the near future.</p>
<p>The list is not meant to be exhaustive, and in particular it does not include some older books on time series forecasting. More recent books on time series analysis, forecasting and predictive modelling are in as most relevant. The general programming books (functions, macros) are useful for developing a general understanding of and intuition about the particularities of the SAS programming language. The book on statistical programming with SAS/IML (Interactive Matrix Language) contains many examples of using R from within SAS environment. The two books on SQL will help learn how to interact with the company’s databases. There is also a book devoted solely to reports, which I think should not be ignored, and a book on techniques useful for cleaning data.</p>
<p>The list includes 12 items, and the first few have already been purchased:</p>
<ol type="1">
<li><p><i class="fa fa-check-square-o fa-lg"></i> <a href="http://www.amazon.com/Predictive-Modeling-SAS-Enterprise-Miner/dp/1607647672" target="_blank"> </a> <strong>Predictive Modeling with SAS Enterprise Miner: Practical Solutions for Business Applications</strong> by <em>Kattamuri S. Sarma</em>, 2nd ed., 2013.</p></li>
<li><p><i class="fa fa-check-square-o fa-lg"></i> <a href="http://www.amazon.com/Applied-Data-Mining-Forecasting-Using/dp/1607646625" target="_blank"> </a> <strong>Applied Data Mining for Forecasting Using SAS</strong> by <em>Tim Rey, Arthur Kordon, and Chip Wells</em>, 2012</p></li>
<li><p><i class="fa fa-check-square-o fa-lg"></i> <a href="http://www.amazon.com/SAS-Functions-Example-Second-Edition/dp/1607643405" target="_blank"> </a> <strong>SAS functions by example</strong> by <em>Ron Code</em>, 2nd ed., 2010</p></li>
<li><p><i class="fa fa-check-square-o fa-lg"></i> <a href="http://www.amazon.com/Forecasting-Time-Series-Second-Edition/dp/1590471822" target="_blank"> </a> <strong>SAS for Forecasting Time Series</strong> by <em>John C. Brocklebank and David A. Dickey</em>, 2nd ed., 2003</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/gp/product/1612901700/" target="_blank"> </a> <strong>Practical Time Series Analysis Using SAS</strong> by <em>Anders Milhoj</em>, 2013</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/PROC-REPORT-Example-Techniques-Professional/dp/1612907849" target="_blank"> </a> <strong>PROC REPORT by Example: Techniques for Building Professional Reports Using SAS</strong>, by <em>Lisa Fine</em>, 2013</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/PROC-SQL-Beyond-Basics-Edition/dp/1612900275" target="_blank"> </a> <strong>PROC SQL: Beyond the Basics Using SAS</strong> by <em>Kirk Paul Lafler</em>, 2nd ed., 2013</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/Programming-Enterprise-Guide-Second-Edition/dp/1607645289" target="_blank"> </a> <strong>SAS Programming for Enterprise Guide Users</strong> by <em>Neil Constable</em>, 2nd ed., 2010</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/Statistical-Programming-SAS-IML-Software/dp/1607646633" target="_blank"> </a> <strong>Statistical Programming with SAS/IML Software</strong> by <em>Rick Wicklin</em>, 2010</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/Codys-Cleaning-Techniques-Second-Edition/dp/1599946599" target="_blank"> </a> <strong>Cody’s Data Cleaning Techniques Using SAS</strong> by <em>Ron Cody</em>, 2nd ed., 2008</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/PROC-SQL-Example-Using-within/dp/1599942976" target="_blank"> </a> <strong>PROC SQL by Example: Using SQL within SAS</strong> by <em>Howard Schreier</em>, 2008</p></li>
<li><p><i class="fa fa-book fa-lg"></i> <a href="http://www.amazon.com/Macro-Programming-Made-Second-Edition/dp/1590478827" target="_blank"> </a> <strong>SAS Macro Programming Made Easy</strong> by <em>Michele Burlew</em>, 2nd ed., 2007</p></li>
</ol>
<div class="references">
</div>
<p><a href="/Useful-SAS-books-shortlisted/">Useful SAS books short-listed</a> was originally published by Andrey Kostenko at <a href="">Andrey Kostenko Analytics</a> on January 05, 2015.</p>
<img src="http://feeds.feedburner.com/~r/andreykostenko/~4/WLBdKKWtFNE" height="1" width="1" alt=""/><![CDATA[Motivational quotes from my PhD thesis]]>/Motivational-quotes-from-my-thesis2015-01-02T00:00:00+00:002015-01-02T00:00:00+00:00Andrey Kostenkoakoste01@gmail.com
<p>This quick post is largely for testing purposes. It shares four motivational quotes that I carefully selected for the introductory page of the front matter of my PhD thesis.</p>
<!--more-->
<p>The thesis, entitled <em>Statistical issues in modelling and forecasting sequential count data</em>, was examined by Prof James Taylor of Oxford University and Prof Williams Dunsmuir of the University of New South Wales. I plan to use this blog to revisit some of the main points made there, but for now here are the promised quotations in the original order.</p>
<figure class="half">
<div id="container2">
<div id="container1">
<div id="col1">
<figure>
<img src="http://upload.wikimedia.org/wikipedia/commons/9/9b/Carl_Friedrich_Gauss.jpg" alt="" />
</figure>
</div>
<div id="col2">
<p><blockquote2> It is not knowledge, but the act of learning, not possession but the act of getting there, which grants the greatest enjoyment. When I have clarified and exhausted a subject, then I turn away from it, in order to go into darkness again; the never-satisfied man is so strange if he has completed a structure, then it is not in order to dwell in it peacefully, but in order to begin another. I imagine the world conqueror must feel thus, who, after one kingdom is scarcely conquered, stretches out his arms for others. <cite><a href="http://en.wikipedia.org/wiki/Carl_Friedrich_Gauss" target="_blank">Karl Gauss </a></cite> </blockquote2></p>
</div>
</div>
</div>
</figure>
<figure class="half">
<div id="container2">
<div id="container1">
<div id="col1">
<figure>
<img src="http://upload.wikimedia.org/wikipedia/commons/9/9c/Sir_Winston_S_Churchill.jpg" alt="" />
</figure>
</div>
<div id="col2">
<p><blockquote2> Every day you may make progress. Every step may be fruitful. Yet there will stretch out before you an ever-lengthening, ever-ascending, ever-improving path. You know you will never get to the end of the journey. But this, so far from discouraging, only adds to the joy and glory of the climb. <cite><a href="http://en.wikipedia.org/wiki/Winston_Churchill" target="_blank">Winston Churchill </a></cite> </blockquote2></p>
</div>
</div>
</div>
</figure>
<figure class="half">
<div id="container2">
<div id="container1">
<div id="col1">
<figure>
<img src="http://upload.wikimedia.org/wikipedia/commons/4/48/Ralph_Waldo_Emerson_cph.3b20760.jpg" alt="" />
</figure>
</div>
<div id="col2">
<p><blockquote2> Finish each day and be done with it. You have done what you could. Some blunders and absurdities no doubt crept in, forget them as soon as you can. Tomorrow is a new day, you shall begin it well and serenely… <cite><a href="http://en.wikipedia.org/wiki/Ralph_Waldo_Emerson" target="_blank">Ralph Waldo Emerson</a></cite> </blockquote2></p>
</div>
</div>
</div>
</figure>
<figure class="half">
<div id="container2">
<div id="container1">
<div id="col1">
<figure>
<img src="http://upload.wikimedia.org/wikipedia/commons/a/ae/Paul_Halmos.jpeg" alt="" />
</figure>
</div>
<div id="col2">
<p><blockquote2> You can’t be perfect, but if you don’t try, you won’t be good enough. <cite><a href="http://en.wikipedia.org/wiki/Paul_Halmos" target="_blank">Paul Halmos</a></cite> </blockquote2></p>
</div>
</div>
</div>
</figure>
<p>
We all procrastinate from time to time, and my favourite way to do that while working on my thesis was to read much of what Paul Halmos wrote. The quickest way to learn what he wrote about, and how, is probably by reading <a href="http://www.ams.org/notices/200709/tx070901136p.pdf" target="_blank"> this article </a> by John Ewing.
</p>
<style>
blockquote2 {
font-family: Georgia, serif;
font-size: 18px;
font-style: italic;
margin: 0.25em 0;
padding: 0.15em 20px;
line-height: 1.45;
position: relative;
color: #383838;
}
blockquote2:before {
color: #7a7a7a;
content: "\201C";
font-size: 3em;
line-height: 0.1em;
margin-right: 0.25em;
vertical-align: -0.4em;
}
blockquote2 cite {
color: #7a7a7a;;
font-size: 1em;
display: block;
margin-top: 5px;
}
blockquote2 cite:before {
content: "\2014 \2009";
}
#container2 {
clear:left;
float:left;
width:100%;
overflow:hidden;
background:#FFFFF0; column 2 background colour
}
#container1 {
float:left;
width:100%;
position:relative;
right:50%;
background:#FADE3E; column 1 background colour
}
#col1 {
float:left;
width:46%;
position:relative;
left:52%;
overflow:hidden;
}
#col2 {
text-align: justify;
float:left;
width:46%;
position:relative;
left:56%;
overflow:hidden;
}
</style>
<div class="references">
</div>
<p><a href="/Motivational-quotes-from-my-thesis/">Motivational quotes from my PhD thesis</a> was originally published by Andrey Kostenko at <a href="">Andrey Kostenko Analytics</a> on January 02, 2015.</p>
<img src="http://feeds.feedburner.com/~r/andreykostenko/~4/VYcaJ8xN1eA" height="1" width="1" alt=""/>