Fake climate sceptics love the hiatus, the period since the strong El Niño in 1998 where global mean temperature has not increased according to their simplistic notions of global warming. The longer the “hiatus”, the more they can deny that climate change will be a problem this century. This gives an incentive for developing methods that report the longest possible hiatus, ideally without obviously cherry-picking the start date.

Professor Ross McKitrick has a new paper in the ever so prestigious Open Journal of Statistics where he reports that the hiatus started in the HadCRUT4 global temperature record in 1995 to the delight of several climate sceptic blogs.

McKitrick uses a regression technique that is supposed to be robust to heteroscedasticity (unequal variance) and autocorrelation to find the trend in the temperature time series. He starts with the last five years of data and tests if the trend is statistically different from zero, i.e. does the 95% confidence interval around the mean include zero. He then repeats this analysis with six years of data and so on until the 95% confidence interval does not include zero. This is declared the start of the hiatus.

But McKitrick has missed an obvious trick. If he had used the 99% confidence interval, he would have obtained a much longer hiatus and impressed the credulous even more. And if he had used the 99.9% confidence interval … This is beginning to to show the problems with the method.

Typically, when testing hypotheses we are interested in rejecting the null hypothesis that there is no effect. McKitrick is interested in the converse, in accepting the null hypothesis as much as he can to make the hiatus as long as possible. So whereas normally we need to be certain that the statistical methods we are using don’t report false positives (Type I errors) more often than they are supposed to (i.e. 5% of the time at p=0.05), McKitrick needs to be certain that his test has sufficient power to reject the null hypothesis when the null hypothesis is false. He doesn’t report a power test. Instead he assumes that because his method is robust to heteroscedasticity and autocorrelation it will give good answers.

The easiest way to run a power test is to provide some simulated data with realistic properties where we know that there is an effect, in this case, that there is a constant trend in the data. McKitrick has provided code on his website. The code is written strangely, as if he is not familiar with the language (hint: `matplot`

), but it is well commented and easy to run.

I’ve simulated data that has the same trend, autocorrelation (an AR(2) model) and residual variance as the HadCRUT4 data since 1970 and applied McKitrick’s method to them. I did this 100 times. Ninety-five percent of these trials show an apparent hiatus lasting at least five years even though the trend is constant. In over 70% of trials, the hiatus lasts over 10 years. In 10% of trials the apparent hiatus started in or before 1995 – the year McKitrick reports. With this method, a hiatus lasting since 1995 is not exceptional even if the true trend in the data is constant. McKitrick’s method is not a tool for measuring the length of a hiatus, it is a recipe for making one.

Note my simulations do not include hetroscedasticity, as I’m not sure how to estimate or simulate it in an autocorrelated variable. I think hetroscedasticity would tend to make the apparent hiatus seem longer.

#load McKitrick's code into R then run this to repeat my analysis times<-I(year+(month-.5)/12) mod<-lm(HadCRUT~times, data=as.data.frame(hadley), subset=times>1970) arm<-ar(resid(mod)) hadley2<-hadley res<-replicate(100,{#autocorrelation dummy<-predict(mod, newdata=data.frame(year=hadley[,"year"], month=hadley[,"month"]))+arima.sim(list(ar=arm$ar),n=length(times),sd=arm$var.pred^.5) hadley2[,"HadCRUT"]<-dummy pause_vfh = vfpause(hadley2, 1900) # Compute Pause stats using VF method min(pause_vfh[-(1:4),2][pause_vfh[-(1:4),3]<0])#find start of pause }) res[is.infinite(res)]<-2014#assign trials with hiatus <5yr to 2014 x11(4.5,4.5);par(mar=c(3,3,1,1), mgp=c(1.5,.5,0)) plot(table(res), xlab=" Year hiatus starts", ylab="Number of trials", xaxt="n") axis(1)

McKitrick, R.R. (2014) HAC-Robust Measurement of the Duration of a Trendless Subsample in a Global Climate Time Series. Open Journal of Statistics, 4, 527-535. http://dx.doi.org/10.4236/ojs.2014.47050

Hello Richard

Thank you for your interest in my paper. Let me make a couple of observations.

I use OLS to find the trend. The HAC method is used to compute the robust confidence intervals. I can’t tell if by your phrase “supposed to be” you are dubious about the robustness of the VF method but if you look at the article cited (V&F 2005), it contains all the power curves, null rejection rates and size estimates you are seeking.

What you are referring to in this post is a null distribution around Jmax. In 100 simulations assuming AR(2) around a positive trend you show that a 1995 or earlier start date occurs 10% of the time. It would be helpful if you also verified in each of those simulations that all the conditions of the definition were met (that the trend CI includes zero across the entire time subsample and applied in both the NH and SH.) Assuming that those things are the case, and you were to get roughly the same answer in 1000 or 10,000 simulations, what you are saying is that under the assumptions of your null, a pause of 19 years is now in the lower 10% tail of the null distribution. And by the looks of it in your Figure, in another 3 years it will be in the lower 5% tail. That’s an interesting additional bit of information on the topic and I encourage you to publish it, especially if you also add in the UAH and RSS computations as well.

However the problem with this kind of estimation–and what I expect a stats journal would point out– is that if what we really want to know is whether Jmax is significantly different from zero, you need a null that assumes it is zero and works out the corresponding distribution. And the difficulty with that is the well-known ‘Davies problem’ in which the parameter to be estimated is not identified under the null. There are simulation methods for handling this problem, which Tim Vogelsang and I briefly review in our new paper comparing models and observations in the tropical troposphere, again using HAC-robust methods (http://onlinelibrary.wiley.com/doi/10.1002/env.2294/abstract). We also outline a simple bootstrap method that gets around the simulation problem, but you’d need to verify whether you need to use a block bootstrap since you have assumed an AR2 error structure. You might get a wider or narrower CI around Jmax than the one you drew above, it’s hard to tell, especially since it will likely be a non-standard distribution.

Ross,

Have you considered applying your method to different final times? In other words, repeat the entire analysis for an end year of 2013, then 2012, then 2011, then …… It would be interesting to know the years for which the trend for the preceding 15 years was statistically significant, by your definition (i.e., the 95% confidence interval around the trend does not include zero).

Pingback: How long is the pause? | Climate Etc.

ATTP: No, but it would be an easy modification of the code. Presumably as the end date goes back into the early years of the current century the measured pause duration would get shorter.

I’m hoping Richard will verify his distribution diagram with code that actually implements the definition of the pause in the paper, i.e. that satisfies conditions (a) to (c). It looks to me like he just picked the longest interval with a non-significant global trend, but that doesn’t correspond to the definition in the paper. The definition I propose requires HAC-robustness, insignificance at every annual step within the subsample, and separate pauses in both the NH and SH of equal or greater length.

currently too sick to answer. back soon.

Ross,

I’ve done a sample of such tests, although not as extensively as you could do. If you were to do so, I suspect that you would find that there would not be a single year for which the past decade did not qualify as a “pause” as per your definition. In addition, there would be many years for which the past 15 years would qualify as a “pause” and, in some cases, even periods in excess of 15 years. S

So, here’s the question for you. Using your method, how could you determine if we were warming? I don’t think you can. As Richard’s post illustrates, what your method does is find pauses. Given that the intrinsic variability in the data is such that the uncertainty (2 sigma) for an period of less than a decade would be well in excess of 0.2 degrees per decade, we would need to be warming at well over 0.2 degrees per decade before your method concluded that the past decade wasn’t a pause. So, given that your method can’t tell if we’re warming or not, I don’t quite see the relevance.

Of course, I’m not suggesting there’s an error in your calculations, simply that it doesn’t mean much. We’re very clearly in a period of slower warming relative to past periods. You’ve shown that to be true. Now, if you were to put into context and consider all the other information we have, it could be interesting, but if you don’t do that, it’s really just a statistical test that tells us something we already know. Additionally, if you’re actually suggesting that we haven’t warmed for the last 16 years or more, then you’re almost certainly making a Type I error.

To put it into other words, McKitrick’s trick is excellent for detecting these:

The real problem is that most of the discussion of the statistical significance of trends on climate blogs fails to understand the real purpose of null hypothesis significance testing (NHST). The basic idea is to enforce a degree of self skepticim on the part of the researcher by providing the (usually rather low hurdle) that they need to show that the observations are unlikely if their research hypothesis is incorrect (the null hypothesis is normally the thing the researcher does not want to be true). Those claiming that there has been a hiatus on the basis of a NHST where H0 is a flat trend, are basically arguing for the null hypothesis, which entirely circumvents the self-skepticim that the NHST is intended to provide. UNLESS they also provide an analysis of the statistical power of the test.

To see why this is true, consider a two-headed coin. If we observe four flips of this coin, it will come down heads four times in a row (as there is a head on both sides). If we then perform the usual NHST for the unbiasedness of a coin, we will be unable to reject the null hypothesis H0: p(head) = p(tail) = 0.5, at thr 95% level of significance, because the pvalue = 0.5^4 = 0.0625. If we only flip the coin four times, we can never reject the null hypothesis, whatever sequence of heads and tails, and the test is obviously pointless. In statistical terms, we would say that the test has zero statistical power (power = probability that HO is rejected whhen H0 is false). As we observe more flips, the p-value can become smaller, and it becomes more likely that the null hypothesis can be rejected.

There are two reasons why we could fail to reject the null hypothesis: (i) H0 is actually true, (ii) H0 is false, but we don’t have enough data to provide the evidence suggesting that it is wrong. In this case, it is a fundamental error to call a period “trendless” because the magnitude of the trend is not statistically significant, using a NHST where the H0 is a flat trend. This is because we have not established that the reason for the failure to reject HO is (i) rather than (ii). Without performing a test of statistical power to show that there is enough data to beconfident of rejecting the null when it is false, there is is little statistical basis for claiming that there is no trend (because you are arguing for the H0).

However, performing an analysis of statistical power is not all that straighforward, which is I suspect why climate skeptics never attempt it. However, there is an easier solution, which is simply to use a H0 that assumes the underlying rate of warming is unchanged if you want to assert that there has been a hiatus (which is a change in the underlying rate of warming). This then becomes essentially a problem of breakpoint detection. A breakpoint analysis, that was robust to heteroskedasticity and autocorrelation (which rules out the Chow test) would be much more satisfactory evidence for the existence of a hiatus.

In short, your H0 should represent the hypothesis you don’t want to be true in order for your research hypothesis to be true. If you are arguing that the climate has warmed over some period, your H0 should be that there has been no warming. If you are arguing that there has been a change in the underlying rate of warming, your H0 should be that there has been no change. The NHST (for all its failings) at least provides a useful sanity check if you do this, but not if you take H0 as the hypothesis you are arguing for.

or of course, you could try and understand the observations, for instance by seeing if the apparent hiatus is potentially explainable by known sources of variation such as ENSO…

Yes, this is the real issue. Playing games with whether it’s 6 years or 13 or 15 years depending on how we do the statistical analysis, choose our end points, set up our null, etc, is far less interesting. If we observe a planet around some other star not moving in its orbit according to how we calculate using Kepler’s law, an auxiliary hypothesis is that an undetected planet might be modifying its orbit, and once we include that effect, we have an explanation. Thus far, the “skeptical” reaction to the hiatus would be akin to saying that we should throw out our models of planetary motion.

In this case, mis-specified forcings in the cmip5 ensemble or lack of appropriate ENSO sampling (relative to the true realization that nature decided to take) is an auxiliary hypothesis that once accounted for mostly solves the “issue,” with some technical caveats still being sorted out.

Chris,

Indeed, if we hadn’t done this, we’d currently be living in a world where Newton’s Law of Gravity would be regarded as having been falsified (well, it has – by Einstein – but that’s an extra issue) and in which we would not know of the existence of Neptune.

As I mentioned in my reply to Richard, the power properties of the VF confidence intervals are well-documented. Due to low power I don’t bother testing intervals less than 5 years because the wideness of the CI wouldn’t mean anything. The start date of the pause is not simply due to power loss. If you look at the trumpet diagram in my paper you’ll see that the CI’s don’t uniformly widen from that point on, which would be the case if the only thing going on was power loss. They also shrink over some intervals. Also, you still get a pause of 14-20 years using an AR1 model, which is biased to over-reject the null.

There are any number of null hypotheses one can test, but I think the one of most interest is trend=0. At least that’s the one I was interested in.

I agree that a breakpoint analysis would be interesting. Nothing in my paper argues otherwise, and I hope someone undertakes it, or maybe I’ll get around to it. Here I proposed a specific definition and did the measurement, as an alternative to people just guessing based on eyeball analysis. Breakpoint detection is getting more robust. Up to now the methods for breakpoint detection presupposed either IID or AR1 errors. My recent paper with Tim Vogelsang introduced a HAC-robust method that allows for a data-dependent breakpoint detection. It’s not a simple calculation, though in the case where you impose a break at a known point, a simple bootstrap procedure is available for computing p values.

“There are any number of null hypotheses one can test, but I think the one of most interest is trend=0. At least that’s the one I was interested in”

You are missing the point, the null hypothesis should be the one you *don’t want to be true*. If you are looking for a hiatus, your starting point should be that there isn’t one, and see if the observations rule out that possibility. Otherwise the NHST does not provide any element of scientific skepticism. Taking the null hypothesis to be “no effect” without justification is a practice criticised by the “null ritual” (http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf).

This webpage (https://statistics.laerd.com/statistical-guides/hypothesis-testing-3.php) puts it nicely “The null hypothesis is essentially the “devil’s advocate” position. That is, it assumes that whatever you are trying to prove did not happen (hint: it usually states that something equals zero).” Note that it *usually* states that something equals zero, an occasion where it doesn’t is where you are trying to establish that that particular something equals zero.

“Here I proposed a specific definition and did the measurement, as an alternative to people just guessing based on eyeball analysis. ”

The standard OLS regression analysis already gives a well defined, objective alternative to eyeball analysis, and it is also meaningless without an explicit analysis of statistical power.

Not being able to reject the null hypothesis does not mean that it is true, and never has done. Being unable to reject the null hypothesis that the trend is zero does not mean the period is trendless.

I don’t think there are many that would disagree that there is an apparent hiatus, but the statistical analysis needs to be sound. As I pointed out, if you take the effects of ENSO into account (e.g. by regression) the hiatus essentially disapears, which suggests that while there is an apparent hiatus, we cannot assert that there has been a change in the underlying rate of warming (just a redistribution of heat between the surface and oceans). Statistical analysis that doesn’t take into account known physics is liable to produce a misleading answer.

Indeed. Ross could carry out a similar analysis, but instead of assuming that it’s trendless, could assume that the null hypothesis is that we’ve been warming at 0.2 degrees per decade. My quick check suggests that you cannot reject this null for either of the satellite records and that you have to go back to a start date in the 1960s before rejecting this null for the surface temperature records.

In the abstract of the paper it says: “The use of a simple AR1 trend model suggests a shorter hiatus of 14-20 years but is likely unreliable”. I should just point out that using the SkS trend calculator (which IIUC uses an AR1 model), you can get the following longest trends where zero is within the 95% CI:

UAH: 1994-present (19 years) 0.141 ±0.155 °C/decade (2σ)

RSS: 1990-present (23 years) 0.114 ±0.124 °C/decade

GISTEMP: 1996-present (17 years) 0.107 ±0.110 °C/decade (2σ)

NOAA: 1995-present (18 years) 0.088 ±0.097 °C/decade (2σ)

HATCRUT4: 1995-present (18 years) 0.093 ±0.100 °C/decade (2σ)

Note the UAH data is hardly “trendless” given that the trend for the whole dataset is actually *less* than that (0.138 ±0.069 °C/decade)!

The documentation on the SkS trend calculator states that it uses an ARMA(1,1) trend model, as per Foster and Rahmstorf 2011 – http://iopscience.iop.org/1748-9326/6/4/044022.

Incidentally, the proposed method gives a hiatus period of 16 years for the UAH dataset and 26 years for the RSS dataset. Both of these datasets are derived from the essentially same raw satellite observations. This rather suggests that the analysis of the maximum tend length is not very stable as small changes in the interpretation of the satellite measurements makes a very large difference in the outcome. Essentially, given the other uncertainties, the method for deriving the maximum “trendless” [sic] period is not particularly important.

It is ironic that the last line of the paper ends “Overall this analysis confirms the point raised by the IPCC report [1] regarding the existence of the hiatus and adds more *precision* to the understanding of its length” [*emphasis* mine] given that the maximum durations for the two most similar datasets varies by fully 10 years!

Just to show how misleading it is for a period with a non-statistically significant trend to be described as trendless, lets look at the RSS datset, using woodfortrees.org, with the trend from 1988 plotted (corresponding to the period identified by the proposed method) shown in green, the trend for the entire dataset shown in blue and the trend from the start of the dataset to 1988 in magenta.

http://woodfortrees.org/graph/rss/from/plot/rss/from:1988/trend/plot/rss/from/trend/plot/rss/to:1988/trend

Now I suspect that climate skeptic blogs will interpret “trendless” to mean that there has been no warming during the “hiatus” period identified by the test. In that case, if the 26 year hiatus period has seen no warming, that means all of the warming has to have ocurred from 1979 to 1988? No, that period actually has a negative trend. I suspect the proposed method would say that the 1979 to 1988 period was also “trendless”, which would imply that either there had been no warming from 1979-2014, or there was a step change in 1988, however just looking at the data shows that neither hypothesis is plausible. Note the 1988-2014 trend is almost identical to the 1979-2014 trend, which suggests that actually 1988 wasn’t in any way an unusual year, and clearly wasn’t the start of any “hiatus”.

Hopefully this gives an illustration of why it is not reasonable to interpret a failure to reject the null hypothesis as meaning that the period is “trendless”, and the misunderstandings this is likely to create on climate blogs. I can understand how someone could suggest there had been a hiatus since, say 1998 (as there is a suggestion of a breakpoint in 1998, largely due to the super El-nino event), but can Prof. MacKitrick explain why we should think a hiatus started in 1988, when there is no real evidence of any breakpoint at that time?

It turns out, I was spot on about how climate skeptic blogs would interpret “trendless”, from WUWT:

“Professor Ross McKitrick, however, has upped the ante with a new statistical paper to say there has been no global warming for 19 years.”

http://wattsupwiththat.com/2014/09/04/global-temperature-update-no-global-warming-for-17-years-11-months/

Thank you all for your comments. I’ll reply when I can. I’m currently suffering from iritus which is making reading difficult.

I’ve now had the time to test the HAC method and it seems to behave well. I was cautious because 1) the limited number of citations Vogelsang and Franses (2005) have received in the nine years since it was published and 2) no one has bothered to make an R function for the method, both suggesting that the statistical community has not found the method very useful.

I note that McKitrick (2014) does not demonstrate that the global temperature record is heteroskedastic and hence that a method robust to heteroskedasticity is needed.

Some of my simulations have a continuous hiatus, others have an interrupted hiatus. So the definition I have used is not identical to that in McKitrick (2014). However, the definition in McKitrick (2014) was not generated prior to seeing the data. I have little doubt that if there had been an early warm period followed by cooling and then significant warming, that the definition would have been changed. See for example the discussion of Arctic temperatures in http://www.rossmckitrick.com/uploads/4/8/0/8/4808045/letter.to.policymaker.pdf.

I want to reiterate some of the points made by dikranmarsupial above. It is doubtful that null hypothesis significance tests are a useful tool here (indeed anywhere). The null hypothesis in McKitrick (2014) is that the trend is exactly zero. However we know a priori that it is impossible for the trend to be exactly zero and therefore we know that the null hypothesis must be false and failure to reject it is a Type II error. If we cannot reject the null hypothesis, it is because we don’t have enough power. That short noisy temperature records don’t have much power does not seem surprising to me.

Failure to reject the null hypothesis is exactly that. It is not an indication that the null hypothesis should be accepted.

Trying to explain short term variability in global temperature trends is a useful task, but it requires physics. Reliance on statistics alone will generate futile arguments.

Ahh, yes, Type II error. I said Type I error in my earlier comment. Reject the null; accept the null; reject the hypothesis; accept the hypothesis; I get easily confused :-) Hope you’ve recovered from your bout of iritus.

Thanks – the ophthalmologist says the inflammation has gone, but the effects of the atropine haven’t. I’ve got yet another week of blurry vision to look forward to. I’m really amazed how persistent the effects of this drug are.

Richard – OK so you have admitted that you generated a distribution based on a careless misreading of my hiatus definition that not only fails to correspond to the one I applied, but is inaccurate in a way that exaggerates the evidence for your conclusion. Because you permit interrupted hiatuses to count, your distribution will be too wide, and the lower tail is too large. Faced with this realization you should have immediately re-done your analysis and posted a correction. For all we know the observed hiatus may be in the bottom 1% tail of a distribution accurately generated–and since you haven’t corrected your inaccurate post we have no way of knowing.

But instead you leave your inaccurate post unchanged and resort to a ridiculous smear

In other words, you are saying that if the results had looked different than they did I would have cheated by weakening the definition. Well sir, you are the one who cheated by changing the definition without telling your readers. Real classy joint you run here.