<<

SUPPLEMENTARY INFORMATION DOI: 10.1038/NGEO1430

Supplementary Information: Broad range of 2050 warming from

an observationally constrained large ensemble

Daniel J. Rowlands1,2,3, David J. Frame1,2,4,5, Duncan Ackerley6,7, Tolu Aina1,8, Ben B. B. Booth9, Carl Christensen1, Matthew Collins10, Nicholas Faull1, Chris E. Forest11, Benjamin S. Grandey1, Edward Gryspeerdt1, Eleanor J. Highwood7, William J. Ingram1,9, Sylvia Knight12, Ana Lopez2,3, Neil Massey1,4, Frances McNamara13, Nicolai Meinshausen14, Claudio Piani15,16, Suzanne M. Rosier1,17, Benjamin M. Sanderson18, Leonard A. Smith3,19, D´aith´ıA. Stone20, Milo Thurston8, Kuniko Yamazaki1, Y. Hiro Yamazaki1,21, and Myles R. Allen1,2,4

1Atmospheric, Oceanic & Planetary Physics, Department of Physics, University of Oxford, Parks Road, Oxford OX1 3PU, UK. 2School of Geography and the Environment, University of Oxford, South Parks Road, Oxford OX1 3QY, UK. 3Centre for the Analysis of Time Series, London School of Economics, London WC2A 2AE, UK. 4Smith School of Enterprise and the Environment, Hayes House, 75 George St, Oxford OX1 2BQ, UK. 5Climate Change Research Institute, School of Geography, Environment and Earth Sciences, Victoria University of Wellington, Wellington 6012, New Zealand. 6Monash Weather and Climate, Monash University, Clayton, Victoria, 3800, Australia. 7Department of Meteorology, University of Reading, Earley Gate, Reading, RG6 6BB, UK. 8Oxford e-Research Centre, Keble Road, Oxford OX1 3QG, UK. 9Met Office Hadley Centre, FitzRoy Road, Exeter EX1 3PU, UK. 10College of Engineering, Mathematic and Physical Sciences, University of Exeter,

NATURE GEOSCIENCE | www.nature.com/naturegeoscience 1 © 2012 Macmillan Publishers Limited. All rights reserved. Exeter, EX4 4QJ, UK. 11Department of Meteorology, Earth and Environmental Systems Institute, Pennsylvania State University, University Park, PA 16802, USA. 12Royal Meteorological Society, Reading, RG1 7LL, UK. 13BBC Science, BBC White City, 201 Wood Lane, London W12 7TS, UK. 14Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK. 15Abdus Salam International Center for Theoretical Physics, Trieste 34151, Italy. 16The American University of Paris, Paris 75007, France. 17NIWA Wellington, 301 Evans Bay Parade, Hataitai, Wellington 6021, New Zealand. 18National Center for Atmospheric Research, 1850 Table Mesa Dr, Boulder, Colorado 80305, USA. 19Pembroke College, Oxford OX1 1DW, UK. 20Climate Systems Analysis Group, University of Cape Town, Private Bag X3, Rondebosch, Cape Town, South Africa. 21School of Geography, Politics and Sociology, Newcastle University, Newcastle upon Tyne, NE1 7RU, U.K.

© 2012 Macmillan Publishers Limited. All rights reserved. Supplementary Information

Model description and flux adjustments

1 HadCM3L is a version of the UK Met Office Unified Model using a horizontal grid of 3.75◦ longitude by

2.5◦ latitude with 19 levels in the vertical in the atmosphere, and 20 in the ocean. The model contains an interactive sulphur cycle, simulating the direct and first indirect effects2.

Typically AOGCMs require long spin-up periods to reach a stable equilibrium, and drifts can occur when atmospheric and oceanic components are coupled together. A technique has been developed to allow a large number of drift-free coupled model simulations to be produced, with no need for a new ocean spin-up when the fast components of the model (atmosphere, land-surface scheme) are perturbed 3. 10 versions of the HadCM3L ocean model coupled to the standard atmosphere are spun up for 200 years and necessary

flux adjustments corresponding to the climate around 1920 calculated. Additional flux adjustments arising from atmospheric parameter perturbations are then calculated from the heat flux adjustment

(HFA) fields associated with each of the 153 atmospheric parameter configurations when coupled to a slab ocean. Each HFA field is then expressed as a difference from the standard atmospheric physics HFA, and under the assumption of linearity represents an approximation to the additional flux adjustment resulting from the perturbed atmospheric physics. The total flux adjustment for a given combination of ocean and atmospheric physics is then the combination of one of the 10 ocean flux adjustments with one of the 153 additional flux adjustments from the HFA, thus giving a total of 1530 different combinations of atmosphere and ocean physics. Each of the 1530 possible combinations (“model versions”) with the associated total flux adjustment, are then run under a set of transient forcings from 1920-2080 and also under control forcing representative of 1900 conditions for the same length of time in initial condition ensembles as discussed in the next section.

Each simulation returns “trickle” files on a yearly basis, consisting of monthly time-series averaged over

61 regions, and “upload” files every 10 years containing seasonally averaged full field output. This paper analyzes the “trickle” data only.

Forcings in the BBC Climate Change Ensemble

For a given set of model parameters transient simulations are forced with changes in the concentrations of well mixed greenhouse gases (global, Fig. SI 1a), ozone (monthly latitude-height field, not shown) and emissions of SO2 (seasonal latitude-longitude field, Fig. SI 1c). Historical values are used from 1920-2000 and the SRES A1B scenario 4 for 2000-2080. An estimate of natural forcing uncertainty is

© 2012 Macmillan Publishers Limited. All rights reserved. included in the ensemble through a set of historical and future solar and volcanic forcing scenarios (Fig. SI

1b,d). Corresponding control simulations are forced with constant pre-industrial conditions. SRES A1B4 represents a mid-range emissions scenario and given the limited impact of emissions scenario by 2050 5 is expected to produce qualitatively similar results to the newer RCP 4.5 mid-range scenario6.

Anthropogenic forcing. Fig. SI 1a shows an estimate of the radiative forcing due to well-mixed greenhouse gases under the historical and SRES A1B scenario, with forcing estimates taken from Table

6.2 of ref. [7]. The values in Fig. SI 1a are indicative and the actual radiative forcing will vary between simulations due to physics perturbations. For example, the UKMO QUMP perturbed physics ensemble shows an approximate spread of 1W/m2 in longwave forcing by the end of the 21st century8. The BBC

CCE does not return sufficient diagnostics to calculate this quantity for individual ensemble members.

HadCM3L is coupled to a fully interactive tropospheric sulphur cycle in the BBC CCE, which param-

2 eterizes the conversion of sulphur dioxide (SO2) and dimethyl sulphide (DMS) into sulphate aerosol ,

9 representing the direct and first indirect (cloud albedo) radiative effects . A major source of SO2 is anthropogenic emissions (Fig. SI 1c) which are uncertain 10. This is accounted for by introducing an additional parameter, anthsca, which acts as a direct scaling on anthropogenic SO2 emissions in the range [0.5, 1.5]. For comparison to Fig. SI 1c, ref. [2] estimate contributions from background volcanic emissions of 7.5 TgS/yr and DMS 40 TgS/yr.

Natural forcing. There is considerable uncertainty in historical estimates of natural forcing, from variations in solar activity and from reflective stratospheric sulphate aerosol due to explosive volcanic eruptions. We generate 5 historical volcanic scenarios from 1920-2000 based on observations of volcanic aerosol in the stratosphere 11,12, specified as the optical depth at 0.55 microns in four equal-area latitude bands: 90◦S-30◦S, 30◦S-0◦,0◦-30◦N and 30◦N-90◦N, which are converted into radiative forcing values in Fig. SI 1b 13. The 5 historical scenarios are generated by taking the raw observations from the Sato 11 and Ammann 12 datasets, a logarithmic average of the two (Avg S+A), a damped version of the Sato dataset (Sato -) and an amplified version of the Ammann dataset (Ammann +). For future volcanic forcing 10 possible scenarios are generated. The first 3 repeat historical forcing from the Sato (1850-1929 and 1920-1999) and Ammann (1890-1969) datasets. A further 7 scenarios are created using proxies from

1400-1959 14, splitting the 560 year period into 7 consecutive 80 year segments. Each scenario ensures that there were no major eruptions over 2000-2005, coinciding with the start of the BBC CCE. Removing scenarios showing major eruptions from 2005-2010 does not affect results.

Similarly 5 historical solar forcing scenarios are generated from observed solar activity from 1920-

2003 15,16,17,18 (Fig. SI 1d). To allow for the possibility that all have underestimated the actual solar

© 2012 Macmillan Publishers Limited. All rights reserved. trend, the 20th century changes from ref. [16] are doubled to create a 5th scenario. All historical scenar- ios start at the same value in 1920 to avoid any large changes in forcing at the start of the simulation.

For each historical scenario, we assume that the future solar forcing will decrease at the same rate as it has increased over the past 80 years, increase at the same rate or show no trend corresponding to solid, dashed and dotted lines respectively in Fig. SI 1d. This gives a total of 15 possible solar forcing scenarios over the period 1920-2080. Given the recent exceptional solar minimum19 and the level at which the future solar forcing scenarios were fixed at the start of the experiment, some of the scenarios that repeat the past 1920-2003 trend (dashed lines, Fig. SI 1d) may seem unrealistic with hindsight. As a conservative sensitivity study, we have repeated the analysis retaining only simulations corresponding to solar scenarios that reverse any trend over the 1920-2003 and so solar forcing decreases in the future

(solid lines, Fig. SI 1d). This reduces the upper bound for 2050 global-mean warming by approximately

0.25K. We cannot attribute this solely to the solar forcing scenario as simulations with the same model physics as those removed but different solar forcing scenarios were not returned. Therefore, a thorough sampling of the high response region would permit a more extensive investigation into the sensitivity of the upper bound to the future solar forcing scenario.

Sulphur cycle responses. Fig. SI 2a shows the diversity of global-decadal-mean sulphate burdens across a randomly selected subset of the transient simulations in the BBC CCE. Each time-series is coloured by the corresponding value of anthsca. The range of burdens in the year 2000 is comparable to the uncertainty estimate from the multi-model AeroCom project 10, indicating that the ensemble is exploring a wide range of sulphur cycle responses. Partway through the project a bug, specific to the climateprediction.net version of the code, was discovered in the sulphur cycle in a number of simulations, which are identified by the grey coloured time-series in Fig. SI 2a. The bug shuts down the mechanism for the oxidation of DMS and SO2 to Aitken mode sulphate, resulting in very small sulphate burdens. Almost all are well outside of the uncertainty estimate from ref. [10] and so were removed from the analysis. Fig. SI

2b shows that there is no apparent relationship between the sulphate burden and climate sensitivity of ensemble members in the BBC CCE. There is a very weak (yet significant at the 99% level) correlation between the two of approximately 0.2, despite there being no relationship between climate sensitivity and anthsca which primarily controls the sulphate burden, indicating that any relationship is due to the model response rather than the ensemble design. Hence, to the extent that the sulphate burden can be used as a first order approximation to the sulphate radiative forcing20, this indicates that there is no systematic compensation of climate sensitivity and sulphate forcing of the magnitude observed in the

CMIP-3 ensemble 21 in the BBC CCE.

Given simple energy balance arguments 21,22, one might expect there to be a much more significant

© 2012 Macmillan Publishers Limited. All rights reserved. relationship between climate sensitivity and sulphate burden when considering only model versions that pass the r2 test. However, we find a similar magnitude of the relationship which hints at two possible causes. One is that whilst the direct sulphate aerosol forcing depends nearly linearly on the sulphate burden, particularly that in the accumulation mode, the indirect sulphate aerosol forcing depends on it in a non-linear fashion 9. Given the limited diagnostics returned we are unable to calculate either forcing in this ensemble. A second reason is the impact of our reference period of 1961-1990. All results presented in our analysis are from transient-control anomalies with the mean over 1961-1990 subtracted in each region. We chose 1961-1990 as it is a standard reference period with good observational coverage in the surface temperature observations used (Fig. 3), and avoids using any data when the models may still be spinning up after the coupling of model atmospheres and oceans in 1920 23. Consequently, any process that causes a large response across the ensemble in 1961-1990 mean warming over the control will not be well constrained in our analysis.

We can observed the impact of the removal by considering Fig. SI 3, which shows time-series of individual simulations from a large ensemble of the standard physics model version, varying the initial conditions, volcanic, solar forcing and anthsca. In total the ensemble consists of 197 transient simulations and

31 control simulations (only differing in initial conditions), for which the ensemble mean of the control simulations is removed from each transient simulation. Fig. SI 3a shows the expected relationship that as anthsca is increased the warming relative to the control is reduced, given the increased sulphate radiative forcing. On the other hand, Fig. SI 3b indicates that removing the mean anomaly over 1961-1990 from each transient-control anomaly removes much of this relationship. The reason for this is that sulphate forcing increased significantly, both in absolute terms and compared to greenhouse gas forcing, before the 1961-1990 reference period and subsequently has not changed appreciably.

This can be further understood by considering the variance explained by the 3 factors perturbed across the ensemble, namely solar and volcanic forcing (integrated from the start of the simulation in 1920) and the value of the anthsca, through a multiple-regression of the transient-control anomaly for each ensemble member onto the 3 quantities for a set of overlapping 20-year means in the simulation. Contrasting Fig. SI

3c,d we see that the difference in total standard deviation between the two cases is largely explained by the reduced contribution due to anthsca when considering simulations with the mean anomaly over

1961-1990 removed.

Thus by expressing each simulation as an anomaly from the 1961-1990 mean transient-control anomaly prior to the analysis we remove much of the spread across the ensemble due to anthsca, and so we do not see a strong relationship between climate sensitivity and sulphate burden in the constrained ensemble. Indeed, if one considers simulations with a particular (narrow) range of warming over 2001-2010

© 2012 Macmillan Publishers Limited. All rights reserved. relative to 1961-1990 across the full perturbed physics ensemble, correlations between climate sensitivity and sulphate burden in the sub ensemble are not significantly different to that for the whole ensemble

(approximately 0.2), whereas if one considers a narrow range of warming over 2001-2010 relative to the control, correlations are much larger (above 0.5 and somewhat independent of the exact range considered).

The implication of this feature is that our analysis is therefore unable to constrain much of the aerosol forcing uncertainty, at least to the extent that we have expressed it through the sulphate burden. It therefore appears that additional observations would be needed in the light of this, which we briefly discuss later. As observed in Fig. SI 2, the total sulphate burden in the ensemble members is reasonably constrained in agreement with estimates from ref. [10], even if we are not able to constrain the impact of differing sulphate burdens on temperature changes.

Drivers of Uncertainty

To build upon the arguments made in the previous section, we have extended the simple linear variance decomposition to cover the full perturbed physics ensemble as shown in Fig. SI 4, for both the transient- control anomaly (Fig. SI 4a) and transient-control anomaly relative to the mean anomaly over 1961-1990

(Fig. SI 4b). We have chosen a low dimensional set of explanatory variables in the linear regression to simplify the interpretation of the results. As before we use the volcanic and solar forcing along with anthsca, and now in addition we add the climate sensitivity for each model atmosphere as a proxy for atmospheric physics and the model ocean vertical diffusivity to represent the ocean physics. This parameter has been found to be a dominant factor in controlling the vertical heat transfer in the ocean across the ensemble 23. We describe the estimation of climate sensitivity in a later section (see Estimation of climate sensitivity).

Fig. SI 4a shows how the balance of drivers evolve through the simulation: over the mid-part of the

20th century approximately half of the variance is attributable to changes in the anthropogenic sulphate emissions scaling factor, and hence total aerosol forcing. As the simulation moves into the 21st century, atmospheric feedbacks play a much larger role, with climate sensitivity being the dominant driver explain- ing over 60% of the variance by the last 20 years of the simulation. We refer the reader to refs. [24, 25] for a detailed investigation of the processes governing the strength of atmospheric feedbacks in these simulations. The impact of uncertainty in ocean parameters plays a relatively small role in the ensemble, although we observed the impact increasing further into the simulation which is expected given the longer time-scales involved in the ocean response.

Comparison with Fig. SI 4b yields similar conclusions regarding the 21st century response, and as observed

© 2012 Macmillan Publishers Limited. All rights reserved. previously the removal of the mean transient-control anomaly over 1961-1990 removes much of the impact of the varying anthropogenic sulphate emissions, particularly over the period of 1961-1990 used for the comparison with surface temperature observations.

This analysis has only quantified the extent of the linear dependence between the forecast warming and the set of explanatory variables considered. Using a non-linear model 26 did not improve the regression

fit appreciably. We note further that these results are somewhat dependent on the sampling of the explanatory variables in the ensemble used for the regression. For example, sampling a wider diversity of ocean configurations would increase the role of the ocean physics in driving the uncertainty, especially in the latter part of the simulations.

When considering the drivers of uncertainty in the ensemble constrained by r2 and the flux adjustments, we see a more complicated set of behaviours (not shown), the interpretation of which is left to future work. We observe the lowest response ensemble members in the constrained ensemble tend to have low values of climate sensitivity and high values of anthsca (and hence high aerosol forcing), and vice versa for the highest response ensemble members. Fig. SI 4 indicates that differences are largely driven by variations in climate sensitivity.

Observational constraint

Model simulations are compared to the observed spatio-temporal evolution of land 27 and ocean 28 near surface temperatures over 1961-2010. We use observations from the Hadley Centre observation website

(http://www.hadobs.org) as of December 2010. Spatial averages are taken to extract the large scale

Giorgi and ocean regions used in the analysis (Table SI 3 and Fig. SI 7). Ocean regions are extracted using the HadCM3L land-ocean mask, and individual gridboxes are only included in the spatial average if more than 6 months in a given year are present. The overlap between some ocean and Giorgi regions

(Fig. SI 7) is not problematic as all data is expressed in a reduced dimension space which allows for subsequent redundancy in the data (see Spatial truncation and choice of projection operator).

Time-series for each region are then averaged to 5 year resolution, and expressed as an anomaly from

1961-1990. All AOGCM simulations are prepared identically, although no masking due to missing data in the observations is performed owing to limited post-processing available on client machines. We expect this to make little difference as the spatial coverage of the observations is greater than 90% of the area considered in all 5 year periods. The spatial and temporal scale of the analysis strikes a balance between evaluating the performance of a model including regional scale information, whilst removing smaller scales than climate models do not simulate well 29,30.

© 2012 Macmillan Publishers Limited. All rights reserved. The input to the analysis from the observations, and AOGCM simulations is therefore an R n matrix, × t where R = 28, is the number of regions and nt = 10, is the number of 5 year periods over 1961-2010.

The inclusion of additional observational constraints might further reduce the range of warming consis- tent with the full set of observations, but at the risk of ruling out all but a handful of model versions which makes any interpretation of uncertainty problematic. The use of the large scale spatio-temporal pattern of surface warming is predominantly motivated the depth of physical understanding of the forced response 31,32 and has 5 desirable properties as a constraint: (1) a detectable anthropogenic signal, (2) a relatively homogenous observational record, (3) a consistent simulation by state of the art coupled

AOGCMs 32, (4) demonstration that variability can be captured by AOGCM pre-industrial control sim- ulations and (5) a strong relationship with forecast quantities of interest31.

However, the strength of the recent surface warming as a constraint is weakened by uncertainties in sulphate and natural forcing 33 in the experiment. The pure greenhouse gas signal may provide a stronger constraint 31 but it requires the use of signal separation methods 5,31,32 which may limit the inclusion of regional scale information and blur the distinction between what is directly observed and what is modelled.

It is unlikely that the addition of ocean heat uptake diagnostics would reduce the range of uncertainty estimates from this ensemble, given that much of the spread in ocean heat uptake in the ensemble arises from atmospheric parameter perturbations. Sampling a wider range of ocean heat uptake responses arising from ocean physics would be desirable, although we note that studies using the same model have failed to produce a significant spread due to ocean physics 8. The use of constraints based on mean climatology is also inappropriate given the weak relationship with forecast warming 34 and the potential circularity in the methodology given that atmospheric configurations were selected based on the quality of climatology35 and flux adjustments are applied to all model versions which leads to similar mean climatologies. Flux adjustments can also play a strong role in dictating the seasonal cycle which further complicates the use of intra-annual constraints. Further, research has shown that climatological constraints can give mutually inconsistent results 35, thus making the final result critically dependent on subjective decisions regarding which constraints to use and how to deal with irreducible errors.

CMIP-3 and QUMP simulations

We use simulations from the CMIP-3 ensemble, available from the PCMDI database (http://www- pcmdi.llnl.gov/). Summary details of the data used is shown in Table SI 4.

Specifically we have accessed all CMIP-3 20th century simulations that were extended into the 21st century under the SRES A1B scenario. Some modelling groups only extended a 20th century simulation

© 2012 Macmillan Publishers Limited. All rights reserved. forced solely with changes in anthropogenic forcing, whilst others used changes in all forcings in the

20th century run. Each CMIP-3 simulation is expressed as a transient-control anomaly, formed by

first averaging over transient initial condition ensemble members and then removing the corresponding average of matched (in time) control segments, to ensure consistency with the preparation of BBC CCE simulations. Therefore CMIP-3 models with corresponding pre-industrial control simulations that are shorter than 160 years, the length of simulations in the BBC CCE, are not included. The 17 UKMO

QUMP 8,36 simulations are prepared identically.

For each CMIP-3 model, we also extract the full pre-industrial control simulation and split each into two halves, one for covariance estimation and the second for uncertainty analysis as discussed in the next section (Table SI 4).

Statistical model

The key steps in the statistical analysis are shown by the schematic in Fig. SI 5. This section provides an in-depth explanation of the statistical model expanding upon the details of the goodness-of-fit calculation and uncertainty analysis given in Methods

To assess the goodness-of-fit, which is our limited expression of the “true” model error, in AOGCM

R n simulations we compare a set of observations, Y R × t , to each AOGCM transient-control simulation, ∈ R nt R nt X R × , where R=28 is the number of regions, nt=10 the number of 5 year means and R × ∈ gives the dimension of the input data. Both observations and model simulations are contaminated with internal variability. The magnitude of the error is compared to the level expected from estimates of

R nt internal variability, Vi R × , which provides grounds for rejecting certain model simulations as being ∈ unlikely to be statistically consistent with the observations. This assumes that the difference between the hypothetical perfect model and the observations arises from internal variability alone.

We first project the spatial component of the data into a reduced-dimension space using the operator,

ns R P R × , where ns is the number of retained spatial components. The choice of P and the sensitivity ∈ of results are discussed in a later section (see Spatial truncation and choice of projection operator). The projection expresses all data in a p dimensional space, through y = vec(PY) for example in the case of the observations, where the vec() operator creates a column vector from a matrix by stacking the columns on top of each other. We assume that all distributions in the reduced-dimension subspace can be modelled with a multivariate Gaussian, as is standard practice 37,38,39 and may be expected due to the extensive aggregation involved.

Outline. We model the observations, y, in the reduced dimension space as

© 2012 Macmillan Publishers Limited. All rights reserved. y (µ , C ), (SI 1) ∼Np y y where µy is the unobserved underlying noise free climate signal, Cy accounts for internal variability, observational uncertainty and a component of structural uncertainty and () is the p dimensional Np multivariate Gaussian distribution.

It is difficult to estimate Cy from observations directly without various simplifying assumptions such as AR(1) noise. Given the limited length of and presence of forced signals in observations, it is standard

40,41,42 practice to use segments of long pre-industrial AOGCM control simulations to characterize Cy . Using all of the available pre-industrial control simulations from the CMIP-3 ensemble allows for structural

42 uncertainty in the estimation of Cy . This is particularly important since variance can differ by factor of two across models 43. Using segments from all available simulations is supported by ref. [42].

th Overlapping the i CMIP-3 control simulation at 5 year frequency produces ni 50-year segments, each the same length as for the observations. The ni segments are split into two halves of size ni1 and ni2, the

first used in the estimation of Cy and the second for uncertainty analysis. We keep a buffer of at least 50 years between the start of segments in the two sets to avoid any data being present in both. Hence in the case of a 200 year control segment, each half (of length 100 years) would contribute 11 overlapping 50-year segments. The last two columns of Table SI 4 indicate the size of these sets for each CMIP-3 model.

p n1 Bringing all of the segments from the first halves of the simulations together gives U1 R × , composed ∈ of the set uT , where u = vec(PV ) and n = 614 is the total number of overlapping segments from { i1} i1 i1 1 the first half of all of the CMIP-3 controls. Overlapping segments are not independent realisations thus number of degrees of freedom of the estimated covariance Cy is much less than n1, and is expected to be around 1.5 times the number of non-overlapping segments44, about 120 in this case. However, we avoid having to estimate the degrees of freedom by using a non-parametric test in the uncertainty analysis.

45 Observational uncertainty is not expected to be a significant factor in Cy and hence is not explicitly included in the analysis. As a simple test we explore the sensitivity of our results to the magnitude of

Cy in a later section (Fig. SI 12).

We model xθ, the transient-control anomaly from an ensemble member with parameters and natural forcing scenario defined by θ similarly,

© 2012 Macmillan Publishers Limited. All rights reserved. x = xt xc , (SI 2) θ θ − θ N N 1 t 1 c = zit zic, (SI 3) N θ − N θ t i=1 c i=1 ! !

it ic th where zθ and zθ are the i transient and control initial condition ensemble members respectively. Both are modelled as multivariate Gaussian distributions,

zit (µt , Ct ), (SI 4) θ ∼Np θ θ zic (µc , Cc ), (SI 5) θ ∼Np θ θ

where µθ∗ represents the unobservable noise free signal, and Cθ∗ represents the covariance structure of variability within components of µθ∗ . If initial condition ensemble members are treated as probabilistically independent (i.e. having uncorrelated internal variability), justified given the 40-year spin up prior to using any data and the consideration of “fast” atmospheric variables in this analysis then

Ct xt µt , θ , (SI 6) θ ∼Np θ N " t # Cc xc µc , θ , (SI 7) θ ∼Np θ N " c # where Nc and Nt are the number of control and transient initial condition ensemble members respectively.

In practice initial condition ensembles large enough to characterize the dependence of variability on either forcing or θ are not available (Nt,Nc are generally less than 3), and hence we assume that all simulations, both transient and control, in the BBC CCE have the same variability structure, Cx. This represents the variability present in the models, which we characterize using a 1000 year HadCM3 control simulation46, which is subtly different to Cy representing an estimate of variability in the observations.

To estimate Cx the HadCM3 control simulation is overlapped at 5 year frequency and split into two

p nh1 halves as described above. The first half forms Uh1 R × , where nh1 = 89 (Table SI 4), and the ∈ second half is used for uncertainty analysis. The de-coupling of variability from forcing and θ may be questioned given the expected link between the strength of feedbacks and variability47, although the impact is likely to be much smaller than the variation in the mean. Initial tests using the limited number

© 2012 Macmillan Publishers Limited. All rights reserved. of model versions with initial condition ensemble sizes larger than 10 has revealed inconclusive results, which are particularly sensitive to the statistical test used to compare estimated covariance matrices. The same covariance is used for the QUMP transient and control simulations, whilst for CMIP-3 simulations

Cy is used to represent variability in model simulations. As a simple test, exchanging Cy with Cx and vice-versa does not impact results.

We assume that variability in the transient and control initial condition ensemble averages is uncorrelated, since 40 years passes between the start of simulations and the use of any data, hence

1 1 x µ = µt µc , + C . (SI 8) θ ∼Np θ θ − θ N N x " " t c # # Under the null hypothesis that the difference between observations and model simulations arises from internal variability alone, and additionally assuming that variability in y and xθ is uncorrelated, the difference between y and xθ is modelled as,

1 1 y x 0, C = C + + C . (SI 9) − θ ∼Np N y N N x " " t c # #

Hence. the null hypothesis assumes that the relationship between the observations and model simulations is given by,

y = xθ + u (SI 10)

T where E(uu )=CN . Thus if the difference between the model and observations is significantly larger than expected from estimates of u, the model simulation can be rejected as being inconsistent with the observations. The next section describes this procedure more formally.

Uncertainty analysis. We calculate the goodness-of-fit, more formally known as the Mahalanobis distance, between observations, y, and modelled transient-control anomaly, xθ, as

2 T 1 r =(y x ) C− (y x ) , (SI 11) θ − θ N − θ which represents a distance in the p dimensional space under the metric of CN which normalizes errors by the magnitude and correlation structure expected in estimates of variability in y and xθ. This is equivalent to the mean square error in a transformed pre-whitened space (i.e. one in which all components are uncorrelated and have unit variance).

Under the null hypothesis that y and xθ are drawn from the same distribution,

© 2012 Macmillan Publishers Limited. All rights reserved. H0 : µy = µθ, (SI 12)

2 2 rθ should behave as a sample from the distribution of r arising from variability in y and xθ alone.

Defining ui as a sample from the distribution under H0 gives

2 T 1 ri = ui CN− ui. (SI 13)

2 2 The distribution of ri provides a means of rejecting rθ at a given significance level. The interpretation of 2 2 2 2 this test is the following: if a given rθ lies in a high percentile of the distribution of ri , i.e. rθ >r1 α { } − 2 2 where r1 α is the 1-α percentile of the distribution ri , then there is less than an α chance of observing − { } 2 2 an r as extreme as rθ under the H0. Thus we reject, H0, and hence xθ at the α significance level.

To avoid a strong bias in uncertainty estimates, an independent set of control segments from those used to estimate C and C are used to calculate r2 38 (Table SI 4). Furthermore, establishing the x y { i } distribution r2 , that is the distribution of expected r2 under the null hypothesis, in this way avoids { i } the need to assume a parametric F-distribution and associated difficulties in estimating the number of degrees of freedom in estimated covariances 38. The crucial assumption we do make is that the distribution of goodness-of-fit between the hypothetical perfect model and the observations can be estimated using internal variability generated from segments of AOGCM pre-industrial control simulations.

In the statistical analysis H is tested at the 34% significance level, which forms a set θ for which H 0 { i} 0 cannot be rejected at this level. This set of hypothesis tests, each corresponding to a value of θ in the set, all have a 34% of rejecting H0 when H0 is true, implying that the hypothetical true set of parameters has a 34% chance of being outside the set defined by θ . The relationship between hypothesis tests { i} and confidence intervals permits the construction of a confidence interval on θ and subsequent functions

48 thereof . The relation states that if a hypothesis test fails to reject θi at the α significance level, then the 1-α confidence interval will contain θi. Following the relation, confidence intervals can be calculated for any simulated property, e.g. ∆Tθ, the forecast warming projected for parameters θ, by considering the range in the set ∆T . { θi } This statistical test we have described provides a close approximation to likelihood profiling48, a procedure for calculating confidence intervals where in this case the value of forecast warming would be fixed and the r2 statistic optimized over all regions of parameter space producing the given warming. Doing this exactly is not feasible with such a complex model, but we assume that our sample of ensemble members is sufficient to represent a continuum (i.e. infinite number) and so can identify the optimal r2 at each

© 2012 Macmillan Publishers Limited. All rights reserved. value of forecast warming. Given this interpretation, no correction is required to the significance level to account for the large number of hypothesis tests (otherwise an infinite correction would be required in the limit of the ensemble size tending to infinity).

Applying a formal Bayesian analysis requires a number of additional assumptions, including prior distri- butions on the model parameters 49. This is problematic given the high-dimension input space and the difficulty in specifying prior distributions over quantities that often do not have direct real world coun- terparts 50. Furthermore, recent work has highlighted the potential pitfalls in model weighting showing results can be acutely sensitive to the level of unpredictable noise in model simulations and that model weighting implicitly assumes that past model performance will persist through time 51. Whilst we concede that the methodology presented here also is somewhat dependent on the level and structure of noise in observations and model simulations, as expressed through CN and the sensitivity to which is addressed below (Fig. SI 12), the confidence interval does not attempt to differentiate between model versions that

2 2 have an rθ below the threshold value of r1 α. − We choose not to use a Bayesian approach with explicit subjective weights on parameters, though further analyses could, because the costs associated with the additional inclusion of expert priors over parameters in a high dimensional space are not obviously outweighed by the benefits: we would still be producing an uncertainty range rather than a full probability density function. In this case we have chosen a simpler approach that contains fewer assumptions that gives a reasonably straightforward uncertainty range. We would welcome collaborations on alternative approaches for the interpretation of our ensemble, but note that a full probability density function is likely to remain elusive in the presence of the structural biases to which all the current generation of AOGCMs are subject.

Spatial truncation and choice of projection operator

The projection operator, P, and hence dimension in the r2 calculation is a key choice in the analysis.

On one hand increasing the dimension maximizes the use of information contained in the observations,

1 whilst reducing the dimension allows more accurate estimation of CN and more importantly CN− since errors are magnified by the non-linear operation of matrix inversion. In particular we wish to estimate

CN accurately, but also ensure that the proxy for internal variability in the climate system, namely pre-industrial control simulations, faithfully represents variability at the spatial and temporal scales considered.

In principle the full state space dimension of 280 can be used for the calculation of goodness-of-fit, since this represents a significant degree of truncation from the full field output of the model. Rather than do

© 2012 Macmillan Publishers Limited. All rights reserved. this we choose a more compact representation of the spatial dimension through projection onto a set of

Empirical Orthogonal Functions (EOFs). The leading spatial EOFs of the regional surface temperature evolution across the ensemble from 1961-2010 are used. This is appealing from an intuitive standpoint because EOFs form a basis explaining the maximal amount of variance across the ensemble.

To allow each region to contribute to the EOFs proportionally we calculate the area weighted spatial

“covariance” matrix across BBC CCE historical data from 1961-2010,

1 1 T 1 C = W 2 ZZ W 2 , (SI 14) Z m 1 −

R m where Z R × is a matrix consisting of the set of BBC CCE transient minus control simulations over ∈ R R 1961-2010, Xθ , W R × a diagonal matrix containing the normalized area weight for each region { } ∈ and m = n n with n the number of ensemble members. C is then diagonalized, t × cpdn cpdn Z

T CZ = PZ QZ PZ , (SI 15)

where PZ represent a set of spatial EOFs and QZ is a diagonal matrix corresponding to the variance across the ensemble and in time for each EOF. CZ is not strictly a covariance matrix since each simulation is expressed as an anomaly from 1961-1990 rather than the full 1961-2010 period. This results in the last 20 years, 1991-2010, where the climate change signal has been strongest, influencing the EOFs more strongly, i.e. it concentrates on spatial patterns that project onto the climate change signal that we are attempting to constrain. Sample size is not a problem in the estimation of CZ given the ratio of sample size to dimension is approximately 1000.

All of the input data, both observations and model simulations, is expressed in the basis defined by

PZ . The projection of a single model simulation onto each EOF yields a time-series of length nt. The dimension of the state space, p, used in the r2 calculation is then n n where n is the number of s × t s spatial EOFs retained. For example, xθ and y are the projection of weighted data onto the EOFs,

1 T 2 xθ = vec PZ (ns)W Xθ , (SI 16)

$ 1 % T 2 y = vec PZ (ns)W Y . (SI 17) $ %

T 1 Hence the projection operator is defined as P = PZ (ns)W 2 where the ns label implies the retention

© 2012 Macmillan Publishers Limited. All rights reserved. of the leading ns spatial EOFs. It is necessary to include the area weighting matrix because the EOFs are orthogonal only with respect to the area-weighted data. Identical projections are performed on the

CMIP-3 and QUMP transient minus control simulations and the CMIP-3 pre-industrial control segments.

The eigenspectrum of CZ is shown on the left axis of Fig. SI 6, along with the cumulative variance explained with successive EOFs on the right. None of the eigenvalues are classed as degenerate by the

North ratio 52, and apart from the first EOF there is no clear separation in the spectrum which makes the choice of truncation difficult. The first EOF, explaining over 80% of the variance across the ensemble, is very well correlated with the global-mean and shows a structure corresponding to enhanced warming over Northern Hemisphere land regions (Fig. SI 7). Subsequent EOFs show structures representing hemispheric and land-ocean contrasts, although interpreting subsequent EOFs physically is somewhat difficult given that very few properties of the climate system are orthogonal 53,54.

We note the Amazon region receives a relatively large weight in the first EOF, which is a likely consequence of the enhanced response observed in this particular model 55 compared to other AOGCMs. Our results are unchanged if the analysis is repeated removing the Amazon region.

The choice of 3 spatial EOFs in the main analysis is motivated by explaining over 90% of the ensemble variance, whilst keeping the resulting dimension such that small sample size issues do not dominate the estimation of required covariance matrices. Later we investigate the sensitivity of results to the spatial

EOF truncation and also the impact of using an alternative physically motivated dimension reduction using the Karoly Indices 56.

On inter-decadal time-scales and in the leading EOFs, variability simulated by the CMIP-3 controls provides an adequate representation of the variability in observations de-trended using the CMIP-3 all- forcing ensemble mean (Fig. SI 8).

There is even less separation of the eigenspectrum in time and so we do not truncate the temporal dimension of the data. Thus, the r2 measure simply compares observed and modelled time-series of the magnitude of the leading EOFs from 1961-2010. Fig. SI 9 shows the evolution of the time-series of the leading EOFs in the BBC CCE (blue lines) along with the projection of the observations (black lines).

Each modelled time-series is coloured by the goodness-of-fit, darker colours implying closer agreement, with thick blue lines indicating the 66% confidence interval on the EOF loadings. While the confidence intervals for the first 2 EOFs constrain the raw ensemble range, subsequent EOFs leave the range almost unchanged.

© 2012 Macmillan Publishers Limited. All rights reserved. Estimation of covariance matrices

Estimating Cy and Cx requires some attention given the ratio of the dimension, p, to the number of samples in U1 and Uh1 respectively. The unbiased sample covariance matrix, C can be estimated from n samples of data, U (where the mean has been removed from each dimension) as,

1 C = UUT . (SI 18) n 1 −

In the limit that n >> p this provides a good estimator of the true covariance matrix, Σ, that generated the data. However when the size of n and p is comparable the sample covariance matrix can be a poor estimator of Σ, errors in which are amplified by the non-linear action of inversion57. When p>nthe sample covariance matrix is rank-deficient and therefore cannot be inverted: a pseudo-inverse58 must be used instead. The problem arises from the sample covariance matrix overestimating variance in the leading EOFs and correspondingly underestimating variance in low ranked EOFs given the finite size sample 53. Using a truncated version of C tends to retain components in which variance is overestimated, thus making the test unnecessarily conservative, not to mention that fact that the directions in state-space given by retained EOFs can also be heavily biased 57.

Regularizing the covariance estimation can alleviate some of the deficiencies of the sample covariance in this situation by shrinking the sample eigenvalues towards the grand-mean in an attempt to correct for the small sample size bias, whilst keeping the total variance fixed. Following ref. [58], we consider the

Lediot-Wolf regularized covariance estimator 57,

Cρ = (1 ρ)C + ρµI, (SI 19) − where ρ is the shrinkage factor, µ is the mean of the sample eigenvalues and I is the identity matrix. Cρ can show large improvements in mean square error (measured through the Frobenius norm, C Σ 2) relative | ρ − | to C, and is full rank permitting inversion without needing to choose a sharp cut-offin the eigenspectrum.

This form of regularization can be viewed as a truncation, albeit based on firmer statistical grounds than the pseudo-inverse commonly used in climate change detection and attribution studies 38. The sharp cut- offwhen using the pseudo-inverse is equivalent to assuming the high-rank EOFs have infinite variance, which is a somewhat strong assumption to make.

We refer the reader to refs. [57, 58] for specific details on the derivation of Eqn. SI 19, which seeks a value

59,60 of ρ to minimize the mean square error of Cρ. Other regularization techniques are available , which

© 2012 Macmillan Publishers Limited. All rights reserved. all address the problem of poorly estimated components in the sample covariance.

We find that Cρ provides a more robust estimator of Σ, combating part of the small sample size bias in the covariance estimation. In particular with a regularized covariance estimate, the r2 threshold of the control distribution used for hypothesis testing rises approximately in line with the F-distribution as more EOFs are included, indicating that the bias correction performed by the 2nd half of the the controls to give the nominal Type I error rate is small. In contrast when using the sample covariance the required correction can be very large, indicating estimation errors in the covariance.

Sensitivity studies

We investigate the sensitivity of the uncertainty estimate for 2050 global-mean temperatures to a set of assumptions in the analysis, namely flux adjustment threshold, the spatial truncation and covariance estimation technique, estimate of the covariance of variability in the observations, warming over 1961-1990 relative to the control and biases in model climatologies.

Flux adjustment threshold. In the main analysis we used a threshold of 5W/m2 on the global ± annual mean total flux adjustment, partially motivated by estimates of observational uncertainty in top-of-atmosphere fluxes 8. Uncertainty estimates are stable over thresholds of 2-10W/m2 and start to increase with larger thresholds owing to the inclusion of high climate sensitivity models indicated by the grey shading in Fig. SI 10a. It is questionable whether an absolute flux threshold or one relative to the tuned standard physics model version is more appropriate, given that the flux adjustment is primarily needed to overcome the low ocean resolution: the red line in Fig. SI 10a shows the impact of using a threshold relative to the standard physics global-mean flux adjustment of 4.7W/m2, which in particular − does not impact the upper-bound of our uncertainty estimate.

The high sensitivity models are primarily associated with the low value of one model parameter (Fig. SI

10b), which at least in single-parameter perturbations gives this effect via unrealistic patterns of strato- spheric water vapour 61. We note ref. [61] only considered a single-parameter perturbation and Fig. SI

10b indicates that the combination of other parameter perturbations can compensate and produce lower climate sensitivities. Additionally, high climate sensitivity model versions are often associated with large negative flux adjustments (Fig. SI 10b), hence positive (i.e. net downward) TOA imbalance in the control phase, due to the reduction in outgoing longwave radiation observed in the base climate of these model versions 24. The impact of the removal of the flux adjustment constraint is shown by comparing Fig. 2a,b to Fig. SI 10c,d: the inclusion of a number of high sensitivity models into the constrained ensemble causes an increase of approximately 0.4K in the upper bound of the “likely” range for 2050 warming.

© 2012 Macmillan Publishers Limited. All rights reserved. Spatial truncation and covariance estimation. Whilst using independent controls for covariance estimation and uncertainty analysis ensures that the Type I error rate is correct, using the sample covariance makes uncertainty estimates acutely sensitive to the truncation as the number of spatial

EOFs is increased above 4 (Fig. SI 11, black solid lines). Above 15 spatial EOFs (i.e. a 150 dimensional state-space) all models in the BBC CCE are rejected as being inconsistent with the observations at the

34% significance level used.

This could be because increasing the number of spatial EOFs introduces small scale features that no

2 model can simulate accurately, thus causing estimated rθ values to increase quicker than expected if errors were due to internal variability alone. However, replacing the observations with the best fit model simulation (a perfect-model scenario), results in the same qualitative behaviour (Fig. SI 11, black dashed lines). This is attributable to errors in the estimation of covariance matrices, which are amplified as the number of spatial EOFs is increased, introducing components whose variances are underestimated, which subsequently receive high weight under inversion and dominate the r2 calculation.

Conversely, using a regularized estimate of the covariance produces uncertainty estimates that are stable over the range of possible truncations in both cases (Fig. SI 11, red lines). This does not imply that the model simulations are consistent with the observations over all truncations, rather that the regularization automatically down-weights components that are likely to be poorly estimated given the ratio of the dimension to sample size in the covariance estimation. However, in the sample covariance poorly estimated components dominate the r2 calculation.

Should we expect this lack of sensitivity to truncation? Intuitively, in the context of a simple statistical analysis incorporating more data reduces uncertainty (e.g. uncertainty on regression coefficients as the number of data points is increased). However, in this case the additional spatial EOFs bring little new information as they correspond to smaller-scale spatial variability and little variance. The effective number of degrees of freedom 62 of the observational constraint saturates at around 40 in the full dimension state space (c.f. 280 dimensions), indicating that introducing additional spatial EOFs does not bring in new information.

When dealing with the relatively small dimension (3 EOFs) used in the main analysis, a number of the issues highlighted here are not relevant. We have presented this discussion to demonstrate the robustness of our uncertainty estimates to this choice. The low number of EOFs used also supports other assumptions made, namely using CMIP-3 control variability to represent observed variability, the independence of HadCM3L variability to forcing and θ, and the use Gaussian distributions.

Additionally we have tried a physically-based dimension reduction using the Karoly Indices56 (KI),

© 2012 Macmillan Publishers Limited. All rights reserved. instead of the spatial EOF based dimension reduction. We create a new projection operator, PKI to extract 4 temperature indices: the global-mean, land-ocean contrast, hemispheric contrast and northern hemisphere meridional temperature gradient (the zonal band from 52.5◦-67.5◦N minus the zonal band

22.5◦-37.5◦N). The 4 indices explain approximately 80% of the spatial variance across the BBC CCE. Uncertainty estimates on the far right of Fig. SI 11 show that results based on the Karoly Indices using a regularized covariance are close to those based on a wide range of spatial EOFs. For the the sample covariance these uncertainty estimates are in agreement with the results based on a small number of

EOFs, which the Karoly Indices are probably an approximation to.

Magnitude of variability in observations, Cy. We have investigated how the uncertainty estimate evolves as the magnitude of estimated variability in the observations is changed through scaling Cy. This assumes that the correlation structure of variability is correct, but the magnitude may not be. We repeat the analysis replacing Cy with αCy, allowing α to vary in the range [0.2, 2], the extent of which is partially informed by the difference in variability across structurally different models43. Using a value of

α larger than 1 can be thought of as adding an additional discrepancy term to the analysis to account for structural errors 63, or attempting to include observational uncertainty, since no formal estimate of the spatio-temporal error structure of observed surface temperatures currently exists 27. Fig. SI 12 indicates that the uncertainty estimate of 1.4-3K warming for 2050 is robust to this scaling.

Warming between control and 1961-1990. We have performed the analysis in this study using anomalies relative to a standard 1961-1990 baseline. This however does hide an important difference between model simulations owing to the warming between control simulation (representing conditions around 1900) and the 1961-1990 reference period, expressed as the mean value of the transient-control anomaly over 1961-1990. The distribution of this warming in the global-mean (reconstructed from Giorgi and ocean regions) is shown in Fig. SI 13a along with an estimate of the “observed” change, which we have estimated as the difference between 1891-1910 and 1961-1990 from HadCRUT3 27. We use observed in a loose sense given that the control simulations cannot be compared like for like to any period in the past. However to make a crude comparison, we assume that since the control simulations are spun-up to conditions around 1900 3, representative observations can be taken over the period 1891-1910. The uncertainty in the observed change, shown by the grey band as a 5-95% range, accounts for estimates of observational error 27 and also a component due to internal variability estimated from the CMIP-3 control segments.

We find that the dominant effect in determining this warming over 1961-1990 relative to the control is the scaling on anthropogenic sulphate emissions (anthsca), explaining approximately half of the variance in this quantity (Fig. SI 4a). Previously (see Sulphur cycle responses) we demonstrated that anthsca

© 2012 Macmillan Publishers Limited. All rights reserved. controls the model sulphate burden and hence at least the direct radiative effects of sulphate aerosols in an approximately linear fashion (Fig. SI 2). Hence, by removing this anomaly from each model simulation we remove much of the variance due to aerosol forcing, which therefore explains why we observed little correlation between climate sensitivity and sulphate burden in the constrained ensemble of simulations passing the r2 test.

The relative paucity of observations at the start of the 20th century making the use of spatial patterns of change problematic, and the difficulty in defining an exact period that control simulations correspond to preclude the use of the warming as a formal constraint. We have however investigated the sensitivity of our results to applying a threshold on the warming and find that our upper bound is reduced by approximately 0.2K when using the 5-95% uncertainty estimate shown in Fig. SI 13a. Adopting a longer reference period of 1881-1920 for the observations, rather than 1891-1910 changes the results very little.

We are reluctant to use a reference period closer to the coupling of atmospheres and oceans in 1920 to avoid using data when models may still be spinning up.

Fig. SI 13 is identical to Fig. 2b, except the points are coloured by the mean warming over 1961-1990 relative to the control. There appears to be no systematic relationship to the 2050 warming relative to

1961-1990, which is perhaps expected given that the model climate sensitivity is the dominant factor in

2050 warming, explaining approximately 55% of the variance (Fig. SI 4b). Conversely the anthropogenic sulphate scaling is the predominant driver of the warming up to 1961-1990 and there is no a-priori relationship between the two across the ensemble. We note that the illustrative models highlighted in

Fig. 3 are outside of the uncertainty range for the 1961-1990 warming over the control, although as discussed above we do not use this as a formal constraint.

Biases in climatology. As a final sensitivity study, we have investigated the sensitivity of the un- certainty estimate for 2050 global-mean warming to an additional constraint penalizing biases in the climatology of each model. This is motivated by climate impact studies which generally require absolute values rather than anomalies 64 and arguments that evaluating goodness-of-fit based on anomalies does not lead to consistent results 65. There is a wide range in absolute values of global-mean temperature in CMIP-3 simulations (approximately 12-15◦C for pre-industrial controls) relative to the uncertainty in the corresponding observed value, approximately 14 0.3◦C for the 1961-1990 global-mean tempera- ± ture 66,67. Hence the necessary covariances required to perform the analysis with absolute values would be grossly overestimated, implicitly weakening the observational constraint. We therefore use a simpler test, investigating the sensitivity of uncertainty estimates to an additional threshold on the bias in global- mean temperature, defined as the difference between transient simulation mean over 1961-1990 and the observed mean over the same period 66,67. Fig. SI 14 indicates that the uncertainty estimate from the

© 2012 Macmillan Publishers Limited. All rights reserved. BBC CCE is relatively insensitive to this additional constraint viewed in the light of the uncertainty in

67 the observed value of approximately 0.3◦C and the range of warming that might be expected from in- ± ternal variability alone (black error bar). Applying this same methodology to CMIP-3 and QUMP gives uncertainty estimates that are much more sensitive to the climatological temperature bias threshold, showing the difficulty of applying additional constraints with small ensemble sizes.

Estimation of climate sensitivity

Equilibrium double CO2 slab model simulations with physics perturbations matching the 153 atmospheric + sulphur cycle versions in the BBC CCE do not exist, so their equilibrium climate sensitivities are estimated using a statistical “emulator”. This approach is becoming widely adopted in the field, whereby a statistical model is trained on past evaluations of a climate model with perturbed physics and then used to predict various output quantities for new parameter combinations 35,68,69. The recent UKCP09 climate projections relied heavily on the use of statistical emulation 63. These algorithms can be thought of at the fundamental level as non-linear regressions of the climate model parameters onto output quantities of interest.

We use the random forest technique 26 to build a statistical emulator of climate sensitivity, based on a 14,001 member perturbed physics ensemble generated from climateprediction.net. The technique is popular in the machine learning community, requiring very little tuning (there are only 3 parameters) and offers very good performance compared to other approaches in common benchmark datasets 26,70.

All simulations are from HadSM3, which consists of the same coupled to a slab thermodynamic ocean rather than the full dynamical ocean in HadCM3L. Further experimental details can be found in ref. [71]. Subsequent simulations have recently been performed to vary parameters continuously rather than the original grid design, which improves the ability of the emulator to learn about how climate sensitivity changes as we vary the model parameters. The 14,001 member ensemble consists of perturbations to all of the atmospheric parameters listed in Table SI 1. The impact of sulphur cycle perturbations on climate sensitivity is very small 2, thus climate sensitivities are based on atmospheric parameters alone.

To demonstrate the accuracy of climate sensitivity estimates we validate predictions of using the 14,001 member HadSM3 ensemble. We perform a 10-fold cross-validation splitting the 14,001 member ensemble randomly into 10 segments, and fit 10 versions of the random forest leaving out each segment in turn from the training data. This produces 14,001 out-of-sample predictions (by predicting a given ensemble member using the random forest where it was not used in the fitting), that can be compared to the

© 2012 Macmillan Publishers Limited. All rights reserved. simulated values (Fig. SI 15a).

The random forest is not expected to perfectly fit the simulated values of climate sensitivity because of internal variability and estimation error. Fig. SI 15a indicates that the RF predictions are very accurate, explaining over 95% of the variance in the simulated values. Integrated over all values of climate sensitivity the root mean-square prediction error is approximately 0.3K (Fig. SI 15b). This is in line with estimates of the uncertainty in the simulated values of climate sensitivity, indicating that the random forest is picking up the predictable structure of the variation of climate sensitivity across the parameter space.

The statistical error estimates are approximately 30% smaller than a similar study which found a 1 standard deviation error of approximately 0.45K 68. This can be attributed to the climateprediction.net ensemble being both approximately 50 times larger and exploring a lower dimensional parameter space.

© 2012 Macmillan Publishers Limited. All rights reserved. Supplementary References

1. Jones, C. D. & Palmer, J. R. Spinup methods for HadCM3L. Tech. Rep. CRTN 84, Hadley Centre

for Climate Prediction and Research (1998).

2. Ackerley, D., Highwood, E. J. & Frame, D. J. Quantifying the effects of perturbing physics of an

interactive sulfur scheme using an ensemble of GCMs on the climateprediction.net platform. J.

Geophys. Res. 114 (2009). D01203.

3. Frame, D. J. et al. The climateprediction.net BBC climate change experiment: design of the coupled

model ensemble. Phil. Trans. Roy. Soc. Lond. A 367, 855–870 (2009).

4. Nakicenovic, N. & Swart, R. Special Report on Emissions Scenarios (Cambridge University Press,

2000).

5. Stott, P. A. & Kettleborough, J. A. Origins and estimates of uncertainty in predictions of twenty-first

century temperature rise. Nature 416, 723–726 (2002).

6. Moss, R. et al. Towards New Scenarios For Analysis Of Emissions, Climate Change, Impacts and

Response Strategies. Tech. Rep., IPCC (2008).

7. IPCC. Climate Change 2001: The Scientific Basis: Contribution of Working Group I to the Third

Assessment Report of the Intergovernmental Panel on Climate Change. Tech. Rep., IPCC Working

Group I (2001).

8. Collins, M. et al. Climate model errors, feedbacks and forcings: a comparison of perturbed physics

and multi-model ensembles. Climate Dynamics 36, 1737–1766 (2010).

9. Jones, A., Roberts, D. L. & Slingo, A. A climate model study of the indirect radiative forcing by

anthropogenic sulphate aerosols. Nature 370, 450–453 (1994).

10. Textor, C. et al. Analysis and quantification of the diversities of aerosol life cycles within AeroCom.

Atmos. Chem. Phys. 6, 1777–1813 (2006).

11. Sato, M., Hansen, J. E., McCormick, M. P. & Pollack, J. B. Stratospheric aerosol optical depth,

1850-1990. J. Geophys. Res. 98, 22987–22994 (1993). Doi:10.1029/93JD02553.

12. Ammann, C. M., Meehl, G. A., Washington, W. M. & Zender, C. S. A monthly and latitudinally

varying volcanic forcing dataset in simulations of 20th century climate. Geophys. Res. Lett. 30 (2003).

Doi:10.1029/2003GL016875.

© 2012 Macmillan Publishers Limited. All rights reserved. 13. Ricke, K., Morgan, M. G. & Allen, M. R. Regional climate response to solar-radiation management.

Nature Geoscience 3, 537–541 (2010).

14. Crowley, T. J. Causes of climate change over the past 1000 years. Science 289 (1996).

Doi:10.1126/science.289.5477.270.

15. Hoyt, D. & Schatten, K. A discussion of plausible solar irradiance variations, 17001992. J. Geophys.

Res. 98A (1993). Doi:10.1029/93JA01944.

16. Lean, J., Beer, J. & Bradley, R. Reconstruction of solar irradiance since 1610: implications for

climate change. Geophys. Res. Lett. 3195–3198 (1995). Doi:10.1029/95GL03093.

17. Solanki, S. & Brivova, N. Can solar variability explain global warming since 1970. J. Geophys. Res.

108, 1200 (2003). Doi:10.1029/2002JA009753.

18. Lockwood, M. Personal Communication.

19. Lockwood, M. Solar change and climate: an update in the light of the current exceptional solar

minimum. Phil. Trans. Roy. Soc. Lond. A 466, 303–329 (2010).

20. Schultz, C. et al. Radiative forcing by aerosols as derived from the aerocom present-day and pre-

industrial simulations. Atmos. Chem. Phys. 6, 5225–5246 (2006).

21. Knutti, R. Why are climate models reproducing the observed global surface warming so well? Geo-

phys. Res. Lett. 35, L18704, doi:10.1029/2008GL034932 (2008).

22. Kiehl, J. Twentieth century climate model response and climate sensitivity. Geophys. Res. Lett. 34,

L22710, doi:10.1029/2007GL031383 (2007).

23. Yamazaki, K. Exploring the Impact of Ocean Representation on Ensemble Simulations of Climate

Change. D.Phil thesis, University of Oxford (2008).

24. Sanderson, B. M., Piani, C., Ingram, W. J., Stone, D. A. & Allen, M. R. Towards constraining

climate sensitivity by linear analysis of feedback patterns in thousands of perturbed-physics GCM

simulations. Climate Dynamics 30, 175–190 (2008).

25. Sanderson, B. M., Shell, K. M. & Ingram, W. Climate feedbacks determined using radiative kernels

in a multi-thousand member ensemble of AOGCMs. Climate Dynamics 35, 1219–1236 (2010).

26. Breiman, L. Random Forests. Machine Learning 45, 5–32 (2001).

© 2012 Macmillan Publishers Limited. All rights reserved. 27. Brohan, P., Kennedy, J. J., Harris, I., Tett, S. F. B. & Jones, P. D. Uncertainty estimates in regional

and global observed temperature changes: A new data set from 1950. J. Geophys. Res. 111, D12106,

doi:10.1029/2005JD006548 (2006).

28. Rayner, N. A. et al. Improved Analyses of Changes and Uncertainties in the Sea Surface Temperature

Measured In Situ since the Mid-Nineteenth Century: The HadSST2 Dataset. J. Clim. 19, 446–469

(2006).

29. Stott, P. A. & Tett, S. F. B. Scale-Dependent Detection of Climate Change. J. Clim. 11, 3282–3294

(1998).

30. Masson, D. & Knutti, R. Spatial-Scale Dependence of Climate Model Performance in the CMIP3

Ensemble. J. Clim. 24, 2680–2692 (2011).

31. Allen, M. R., Stott, P. A., Mitchell, J. F. B., Schnur, R. & Delworth, T. L. Quantifying the uncertainty

in forecasts of anthropogenic climate change. Nature 407, 617–620 (2000).

32. Stott, P. A. et al. Observational constraints on past attributable warming and predictions of future

global warming. J. Clim. 19, 3055–3069 (2006).

33. Gregory, J. M., Stouffer, R. J., Raper, S. C. B., Stott, P. A. & Rayner, N. A. An Observationally

Based Estimate of the Climate Sensitivity. J. Clim. 15, 3117–3121 (2002).

34. Knutti, R., Furrer, R., Tebaldi, C., Cermak, J. & Meehl, G. A. Challenges in combining projections

from multiple climate models. J. Clim. 23, 2739–2758 (2010).

35. Sanderson, B. M. et al. Constraints on model response to greenhouse gas forcing and the role of

subgrid-scale processes. J. Clim. 21, 2384–2400 (2008).

36. Collins, M. et al. Towards quantifying uncertainty in transient climate change. Climate Dynamics

27, 127–147 (2006).

37. Toth, Z. Circulation patterns in phase space: A multinormal distribution? Mon. Weather Rev.

1501–1511 (2001).

38. Allen, M. R. & Tett, S. F. B. Checking for model consistency in optimal fingerprinting. Climate

Dynamics 15, 419–434 (1999).

39. Tomassini, L., Reichert, P., Knutti, R., Stocker, T. F. & Borsuk, M. E. Robust Bayesian Uncertainty

Analysis of Climate System Properties Using Markov Chain Monte Carlo Methods. J. Clim. 20,

1239–1254 (2007).

© 2012 Macmillan Publishers Limited. All rights reserved. 40. Allen, M. R. Do it yourself climate prediction. Nature 401 (1999).

41. Stott, P. A. et al. Attribution of twentieth century temperature change to natural and anthropogenic

causes. Climate Dynamics 17, 1–21 (2001).

42. Gillett, N. P. et al. Detecting anthropogenic influence with a multi-model ensemble. Geophys. Res.

Lett. 29 (2002).

43. Barnett, T. P. Comparison of Near-Surface Air Temperature Variability in 11 Coupled Global Climate

Models. J. Clim. 12, 511–518 (1999).

44. Allen, M. R. & Smith, L. A. Detecting irregular oscillations in the presence of coloured noise. J.

Clim. 9, 3373–3404 (1996).

45. Hegerl, G. C., Jones, P. D. & Barnett, T. P. Effect of Observational Sampling Error on the Detection

of Anthropogenic Climate Change. J. Clim. 14, 198–207 (2001).

46. Collins, M., Tett, S. F. B. & Cooper, C. The internal climate variability of HadCM3, a version of

the Hadley Centre coupled model without flux adjustments. Climate Dynamics 17, 61–81 (2001).

47. Wigley, T. M. L. & Raper, S. C. B. Natural variability of the climate system and detection of the

greenhouse effect. Nature 344, 324–327 (1990).

48. Pawitan, Y. In all Likelihood: Statistical Modeling and Inference Using Likelihood (Oxford Univ.

Press, 2001).

49. Frame, D. J. et al. Constraining climate forecasts: The role of prior assumptions. Geophys. Res.

Lett. 32, L09702, doi:10.1029/2004GL022241 (2005).

50. Allen, M. R., Frame, D. J., Kettleborough, J. & Stainforth, D. A. Model error in weather and climate

forecasting. In Palmer, T. & Hagedorn, R. (eds.) Predictability of Weather and Climate, 391–427

(Cambridge University Press, 2006).

51. Weigel, A. P., Knutti, R., Liniger, M. & Appenzeller, C. Risks of Model Weighting in Multimodel

Climate Projections. J. Clim. 23, 4175–4191 (2010).

52. North, G. R., Bell, T. L., Cahalan, R. F. & Moeng, F. J. Sampling errors in the estimation of

empirical orthogonal functions. Mon. Weather Rev. 110, 699–706 (1982).

53. Von Storch, H. & Hannoschock, G. Statistical aspects of estimated principal vectors (EOFs) based

on small sample sizes. J. Clim. 24, 716–724 (1986).

© 2012 Macmillan Publishers Limited. All rights reserved. 54. Dommenget, D. & Latif, M. A cautionary note on the interpretation of EOFs. J. Clim. 15, 216–225

(2002).

55. Cox, P. M., Betts, R. A., Jones, C. D., Spall, S. A. & Totterdell, I. J. Acceleration of global warming

due to carbon-cycle feedbacks in a coupled climate model. Nature 408, 184–187 (2000).

56. Karoly, D. J. & Braganza, K. Identifying global climate change using simple indices. Geophys. Res.

Lett. 28, 2205–2208 (2001).

57. Ledoit, O. & Wolf, M. A well-conditioned estimator for large-dimensional covariance matrices. J.

Multivar. Anal. 88, 365–411 (2004).

58. Ribes, A., Aza¨ıs,J. M. & Planton, S. Adaptation of the optimal fingerprint method for climate change

detection using a well-conditioned covariance matrix estimate. Climate Dynamics 33, 707–722 (2009).

59. Furrer, R. & Bengtsson, T. Estimation of high-dimensional covariance matrices in Kalman filter

variants. J. Multivar. Anal. 98, 227–255 (2007).

60. Bickel, P. J. & Levina, E. Regularized Estimation of Large Covariance Matrices. Ann. Statist. 36,

199–227 (2008).

61. Joshi, M. M., Webb, M. J., Maycock, A. C. & Collins, M. Stratospheric water vapour and high

climate sensitivity in a version of the HadSM3 climate model. Atmos. Chem. Phys. Discuss. 10,

6241–6255 (2010).

62. Bretherton, C. S., Widmann, M., Dymnikov, V. P., Wallace, J. M. & Blade, I. The Effective Number

of Spatial Degrees of Freedom of a Time-Varying Field. J. Clim. 12, 1990–2009 (1999).

63. Murphy, J. M. et al. A methodology for probabilistic predictions of regional climate change from

perturbed physics ensembles. Phil. Trans. Roy. Soc. Lond. A 365, 1993–2028 (2007).

64. Macadam, I., Pitman, A. J., Whetton, P. H. & Abramowitz, G. Ranking climate models by per-

formance using actual values and anomalies: Implications for climate change impact assessments.

Geophys. Res. Lett. 37, L16704 (2010).

65. Reifen, C. & Toumi, R. Climate projections: Past performance no guarantee of future skill? Geophys.

Res. Lett. (2009). L13704.

66. Jones, P. D., New, M., Parker, D. E., Martin, S. & Rigor, I. G. Surface air temperature and its

changes over the past 150 years. Rev. Geophys. 37, 173–200 (1999).

© 2012 Macmillan Publishers Limited. All rights reserved. 67. Hansen, J., Ruedy, R., Sato, M. & Lo, K. Global Surface Temperature Change. Rev. Geophys. 48

(2010). RG4004.

68. Rougier, J., Sexton, D. M. H., Murphy, J. M. & Stainforth, D. Analyzing the Climate Sensitivity of

the HadSM3 Climate Model Using Ensembles from Different but Related Experiments. J. Clim. 22,

3540–3557 (2009).

69. Sans`o,B. & Forest, C. E. Statistical calibration of climate system properties. Applied Statistics 58,

485–503 (2009).

70. Meinshausen, N. Quantile Regression Forests. Journal of Machine Learning Research 7, 983–999

(2006).

71. Stainforth, D. A. et al. Uncertainty in predictions of the climate response to rising levels of greenhouse

gases. Nature 433, 403–406 (2005).

© 2012 Macmillan Publishers Limited. All rights reserved. Name Unit Description Perturbation alpham# Fraction Albedo of sea ice at melting-point. 0.5, 0.57, 0.65 4 1 ct 10− s − Accretion constant. Time constant 0.5, 1,4 for conversion of cloud droplets to rain. 4 3 cw land† 10− kgm− Threshold cloud water content for 1, 2, 20 rain over land. 5 3 cw sea† 10− kgm− As cw land except over the sea. 2, 5, 50 dtice# K Temperature range below melting- 2, 5, 10 point over which sea-ice albedo varies linearly between the melting albedo and the cold ice albedo. eacf Fraction Large-scale cloud coverage when the 0.50, 0.63, 0.67 specific humidity in the grid box is equal to the saturation value. entcoef Scales the mixing between convective 0.6, 1, 3,9 clouds and surrounding environmen- tal air. i cnv ice lw‡ Type Switch to allow for non-spherical ice 1,7 particles in the radiation scheme. i cnv ice sw‡ Type ’’ 3,7 i st ice lw‡ Type ’’ 1,7 i st ice sw‡ Type ’’ 2,7 ice size m Radius of ice crystals in clouds as 2.5, 3,4 seen by the radiation scheme. rhcrit % Critical relative humidity for cloud 0.65, 0.73, 0.90 formation. Relates the grid box at- mospheric humidity to the amount of cloud in that grid box. 1 vf1 ms− Cloud ice fall speed. 0.5, 1,2

Supplementary Table SI 1: List of perturbed atmospheric physics parameters in the BBC CCE. Bold numbers indicate default values. Parameters with identical symbols after their names are perturbed together. Values for eacf and rhcrit are the average over the 19 model levels in the atmosphere.

© 2012 Macmillan Publishers Limited. All rights reserved. Name Unit Description Perturbation anthsca Fraction Scaling factor for anthropogenic sul- 0.5, 0.8, 1, 1.2, 1.5 phur dioxide emissions. cloudtau s Timescale for air to transit through 3600, 10800, 32400 a cloud. 5 l0¶ 10− Sulphate mass scavenging parame- 2.17, 6.5, 19.5 ter. 5 l1¶ 10− ’’ 0.99, 2.96, 8.86 num star 106 Threshold for condensation onto ac- 0.1, 1, 10 cumulation mode (one of the three modes of sulphate particles). so2 high level Model level Model level for high level SO2 emis- 1, 3,5 sions. volsca Fraction Scaling factor for magnitude of vol- 1, 2,3 canic emissions. ## 2 1 isopyc surf m s− Isopycnal diffusion at surface. 200, 1000, 2000 ## 2 1 isopyc depth m s− Isopycnal diffusion at depth. 200, 1000, 2000 5 2 1 vertvisc 10− m s− Background ocean tracer vertical vis- 0.5, 1 cosity. 5 2 1 vdiffsurf§ 10− m s− Background ocean tracer vertical dif- 0.5, 1,2 fusivity at surface. 8 1 vdiffdepth§ 10− ms− Increase in background ocean tracer 0.7, 2.8, 9.6 vertical diffusivity with depth. mllam% Fraction Wind mixing energy scaling factor. 0.3, 0.7 delta SI% m Decay of wind mixing energy with 50, 100 depth. 2 1 haney Wm− K− Haney heat coefficient. 81.88, 163.76 haneyfsact Fraction Scaling factor for Haney salinity co- 0.25,1 efficient.

Supplementary Table SI 2: List of perturbed sulphur cycle and ocean physics parameters in the BBC CCE. Bold numbers indicate default values. Parameters with identical numbers after their names are perturbed together.

© 2012 Macmillan Publishers Limited. All rights reserved. Name Label Min Lat (◦N) Max Lat (◦N) Min Lon (◦E) Max Lon (◦E) Alaska-NW Canada ALA 60 72 -170 -103 E Canada etc. CGI 50 75 -103 -10 Western N America WNA 30 60 -130 -103 Central N America CNA 30 50 -103 -85 Eastern N America ENA 25 50 -85 -50 Central America CAM 10 30 -116 -83 Amazonia AMZ -20 12 -82 -34 Southern S America SSA -56 -20 -76 -40 Northern Europe NEU 48 75 -10 40 S Europe-N Africa MED 30 48 -10 40 Sahara SAH 18 30 -20 65 Western Africa WAF -12 18 -20 22 Eastern Africa EAF -12 18 22 52 Southern Africa SAF -35 -12 10 52 Northern Asia NAS 50 70 40 180 Central Asia CAS 30 50 40 75 Tibetan Plateau TIB 30 50 75 100 Eastern Asia EAS 20 50 100 145 Southern Asia SAS 5 30 65 100 Southeast Asia SEA -11 20 95 155 Northern Australia NAU -30 -11 110 155 Southern Australia SAU -45 -30 110 155 North Atlantic Ocean NATL 0 65 -94 11 South Atlantic Ocean SATL -50 0 -68 19 North Pacific Ocean NPAC 0 60 -79 101 South Pacific Ocean SPAC -50 0 -71 105 North Indian Ocean NIND 0 25 45 98 South Indian Ocean SIND -50 0 22 146

Supplementary Table SI 3: Definitions of regions used in the analysis. The first 22 are Giorgi regions, and the final 6 refer to ocean basins. Second column gives to the label used in Fig. SI 7. Extents for ocean regions correspond to the bounding box defined in the basin mask (Fig. SI 7).

© 2012 Macmillan Publishers Limited. All rights reserved. Model ID Model Name 20th century forcing # ens # ctrl 1 # ctrl 2 1 GFDL CM2 0 all 1 20 (3) 20 (2) 2 GFDL CM2 1 all 3 33 (4) 33 (4) 3 GISS Model E H all 3 17 (3) 17 (2) 4 GISS Model E R all 5 36 (5) 36 (4) 5 INMCM3 0 all 1 11 (1) 10 (2) 6 MIROC3 2 MedRes all 3 36 (5) 36 (4) 7 MIUB ECHO G all 2 20 (3) 19 (2) 8 MRI CGCM2 3 2a all 3 21 (3) 21 (3) 9 NCAR CCSM3 0 all 2 38 (5) 38 (5) 10 CCCMA CGCM3 1 T47 anthropogenic only 5 86 (9) 86 (10) 11 CCCMA CGCM3 1 T63 anthropogenic only 1 21 (3) 21 (3) 12 CNRM CM3 anthropogenic only 1 24 (3) 24 (3) 13 CSIRO MK3 0 anthropogenic only 1 14 (2) 14 (2) 14 GISS AOM anthropogenic only 2 22 (4) 22 (4) 15 IAP FGOALS1 0 G anthropogenic only 2 25 (5) 26 (4) 16 IPSL CM4 anthropogenic only 1 36 (4) 36 (5) 17 MPI ECHAM5 anthropogenic only 2 37 (4) 36 (5) 18 UKMO HadCM3 anthropogenic only 1 89 (9) 88 (10) 19 UKMO HadGEM1 anthropogenic only 1 7 (1) 8 (2) - NCAR PCM1 - - 21 (3) 21 (3) - Total - - 614 (79) 612 (79)

Supplementary Table SI 4: Details of CMIP-3 simulations used in the analysis. Model ID refers to the numeric label of each model in Fig. SI 10c,d. The “anthropogenic only” label in the 3rd column refers models that were only forced with changes in anthro- pogenic forcing in the 20th century run that was extended under the SRES A1B scenario. The fourth column, # ens, refers to the number of transient ensemble members for each model. The final two columns refer to the number of 50-year segments from the correspond- ing pre-industrial control simulation used for estimating Cy and for uncertainty analysis respectively. The first number in each is the number of samples when overlapping each simulation at 5-year frequency, and the number in brackets refers to the number of non- overlapping segments.

© 2012 Macmillan Publishers Limited. All rights reserved. a Greenhouse gas Forcing b Volcanic Forcing Species Historical Future Future

CO2 Sato A 1890−1969 C 1560−1639 CH4 Ammann S 1850−1929 C 1640−1719 N2O Avg S+A S 1920−1999 C 1720−1799 4 CFC−11 4 Sato − C 1400−1479 C 1800−1879 CFC−12 Ammann + C 1480−1559 C 1880−1959 Total 2 2 ) ) 2 2 0 0 Radiative Forcing (W/m Forcing Radiative (W/m Forcing Radiative − 2 − 2 − 4 − 4

1920 1940 1960 1980 2000 2020 2040 2060 2080 1920 1940 1960 1980 2000 2020 2040 2060 2080

Year Year

c Sulphur Dioxide Emissions d Solar Forcing anthsca Historical Future

0.5 1371 SK Reversed 0.8 HS Repeated 140 1 LBBx2 No−Trend 1.2 ML1 1.5 1370 ML2 120 1369 ) 100 2 1368 80 1367 60 Solar Insolation (W/m 1366 40 Anthropogenic Sulphur Dioxide Emissions (TgS/yr) Anthropogenic Sulphur Dioxide 20 1365 0 1364

1920 1940 1960 1980 2000 2020 2040 2060 2080 1920 1940 1960 1980 2000 2020 2040 2060 2080

Year Year

Supplementary Figure SI 1: Input forcing and sulphate emission scenarios used in the BBC CCE. a, Radiative forcing due to well mixed greenhouse gases under historical and future SRES A1B scenarios. b, Radiative forcing due to volcanic emissions, specified as 5 possible historical scenarios, and 10 future scenarios. Stratospheric optical depths at 0.55 microns specified in each simulation are converted to radiative forcings using the approximate formula, 0.1 units optical depth = 2.5W/m2 radiative forcing. c, Global − annual total anthropogenic SO2 emissions under historical and future SRES A1B scenario which drive the interactive sulphur cycle. Time-series are scaled at 5 values in the range [0.5,1.5] to account for uncertainty in the emissions. d, Solar forcing, specified as 5 possible estimates over 1920-2003 and for each 3 possible future scenarios assuming a reverse in the historical trend, continuation of the historical trend, and no net trend in the 21st century.

© 2012 Macmillan Publishers Limited. All rights reserved.

a Global mean Sulphate Burden b Sulphate Burden v Climate Sensitivity anthsca anthsca

0.5 ) 0.5 2 10 10 0.8 0.8 /m 1 4 1 ) 2 1.2 1.2 /m

8 1.5 8 1.5 4 6 6 4 4 AeroCom AeroCom Sulphate Burden (mgSO 2 2 Year 2000 Sulphate Burden (mgSO Year 0 0

1920 1960 2000 2040 2080 2 4 6 8

Year Climate Sensitivity (K)

Supplementary Figure SI 2: Global-decadal-mean sulphate burdens (SO4) from a subset of BBC CCE simulations. a, Time-series of sulphate burdens coloured by anthsca. Simulations with a bug in the sulphur cycle are shown in grey. b, Sulphate burden in 2000 plotted against estimated equilibrium climate sensitivity. In a,b black bar indicates the mean and 2 standard deviation range for year 2000 10. ±

© 2012 Macmillan Publishers Limited. All rights reserved. a Standard Physics relative to control simulation b Standard Physics relative to 1961−1990 mean anomaly 4 4 anthsca anthsca 0.5 0.5 0.8 0.8 1 1 1.2 1.2 1.5 1.5 3 3 2 2 − control anomaly (K) Transient 1 1 control relative to 1961 − 1990 mean anomaly (K) − control relative Transient 0 0

1920 1940 1960 1980 2000 2020 2040 2060 2080 1920 1940 1960 1980 2000 2020 2040 2060 2080

Year Year

c Variance decomposition of transient−control anomaly d Variance decomposition relative to 1961−1990 mean anomaly

0.25 Volcanic 0.25 Volcanic Solar Solar anthsca anthsca Noise Noise Total Total 0.20 0.20 0.15 0.15 0.10 0.10 Standard deviation (K) Standard deviation (K) Standard deviation 0.05 0.05 0.00 0.00 1920 1940 1960 1980 2000 2020 2040 2060 2080 1920 1940 1960 1980 2000 2020 2040 2060 2080

Year (centre of 20−year period) Year (centre of 20−year period)

Supplementary Figure SI 3: Impact of the removal of the 1961-1990 mean transient-control anomaly in a large ensemble of the standard physics model version varying initial conditions, natural forcing and scaling on anthropogenic sulphate emissions (anthsca). a, Time-series of transient-control anomalies coloured by the value of the scaling on anthropogenic sulphate emissions. b, Time-series of transient-control anomalies, with the mean transient-control anomaly over 1961-1990 removed. c, Decomposition of variance in global 20-year mean transient-control anomalies from the standard physics ensemble, ex- pressed as a contribution from volcanic (red), solar (green) forcing, and anthsca (blue). Black dashed line indicates the unexplained variance in a linear-regression of the 20-year mean transient-control anomalies at each time-point onto the 3 factors. Values are expressed as standard-deviation equivalents and hence should be added in quadrature to calculate the total variance. d, As c except the regression is onto the transient-control anomaly relative to the mean 1961-1990 anomaly.

© 2012 Macmillan Publishers Limited. All rights reserved. a Variance decomposition relative to control b Variance decomposition relative to mean 1961−1990 anomaly 0.7 CS 0.7 CS Ocean Ocean Volcanic Volcanic

0.6 Solar 0.6 Solar anthsca anthsca Noise Noise 0.5 0.5 0.4 0.4 0.3 0.3 Standard deviation (K) Standard deviation (K) Standard deviation 0.2 0.2 0.1 0.1 0.0 0.0 1920 1940 1960 1980 2000 2020 2040 2060 2080 1920 1940 1960 1980 2000 2020 2040 2060 2080

Year (centre of 20−year period) Year (centre of 20−year period)

Supplementary Figure SI 4: Variance decomposition across the BBC CCE vary- ing initial conditions, natural forcing, scaling on anthropogenic emissions and model physics parameters. a, Decomposition of variance in global 20-year mean transient-control anoma- lies, expressed as the contribution to the standard deviation, based on a multiple-linear re- gression of climate sensitivity (orange), ocean vertical diffusivity (magenta), volcanic (red) and solar (green) forcing, and anthsca (blue) onto the anomaly. Contribution of unex- plained variance is shown by the black dashed line. b, As a with the regression onto the transient-control anomaly relative to the mean 1961-1990 anomaly.

© 2012 Macmillan Publishers Limited. All rights reserved. AOGCM CMIP-3 Observations, Y transient-control, X control segments, Vi

Express in terms of the leading spatial temperature patterns (EOFs) of BBC CCE using projection operator, P. y = vec(PY), x = vec(PX) and ui = vec(PVi).

Estimate CN , the covariance of expected model error due to internal variability, using segments from the 1st half of each control simulation.

Calculate the distribution of expected Calculate model error for each AOGCM model error arising from internal simulation, 2 T 1 variability alone, ri = ui CN− ui, using 2 T 1 r =(y x) C− (y x). segments from the 2nd half of each − N − control.

Reject AOGCM simulation if 2 2 2 r >r1 α, where r1 α is the 1− α quantile− of the distribution− r2 . { i }

Supplementary Figure SI 5: Schematic showing the steps involved in the com- parison of observations with each AOGCM transient-control simulation. Giorgi and ocean region data is extracted and averaged to 5 year resolution over 1961-2010 first. α corresponds to the significance level of the test. For example, α =0.34 generates a 66% confidence interval. vec() is an operator that creates a column vector from a matrix by stacking the columns on top of each other.

© 2012 Macmillan Publishers Limited. All rights reserved. 1.0 1e+00

● 0.8 1e − 01

● 0.6 1e − 02 ● ● ● ● ● ● ● Eigenvalue ● ● ● ● 0.4 1e − 03 ● ● ● ● ● ● ● ● ● ● ● ● Cumulative fraction of variance fraction Cumulative

● ● 0.2 1e − 04 0.0 1e − 05 0 4 8 12 16 20 24 28

Spatial EOF

Supplementary Figure SI 6: Eigenspectrum of the spatial variability in the BBC CCE over 1961-2010 (points, left axis), and cumulative variance explained (solid line, right axis). Dashed lines indicate the truncation of 3 EOFs used in the main analysis.

© 2012 Macmillan Publishers Limited. All rights reserved. 1.0

ALA CGI NEU

60 NAS

WNA CNA CAS TIB 0.5 ENA MED NATL EAS 30 NPAC SAH CAM SAS NIND WAF EAF SEA 0 0.0 AMZ Latitude (deg) NAU SPAC SATL SAF SIND − 30 SSA SAU −0.5 − 60

−1.0

−150 −120 −90 −60 −30 0 30 60 90 120 150

Longitude (deg)

Supplementary Figure SI 7: First spatial EOF of surface temperature evolution from the BBC CCE, explaining approximately 83% of the variance across the ensemble. Colour scale indicates the magnitude of the loading in the EOF after removing the area- weighting contribution. Where Giorgi and ocean regions overlap, the ocean region is shown in front. Letters indicate the of each region, and dotted lines indicate the extent of the ocean regions (Table SI 3).

© 2012 Macmillan Publishers Limited. All rights reserved. a EOF 1 83.1% variance b EOF 2 5.33% variance 1e − 01 1e − 01 ) ) 1 1 − − a a 2 2 Spectral Density (K Spectral Density (K Spectral 1e − 03 1e − 03

1e − 05 1 2 5 10 20 50 1e − 05 1 2 5 10 20 50

Time Period (years) Time Period (years)

c EOF 3 2.66% variance d EOF 4 1.57% variance 1e − 01 1e − 01 ) ) 1 1 − − a a 2 2 Spectral Density (K Spectral Density (K Spectral 1e − 03 1e − 03

1e − 05 1 2 5 10 20 50 1e − 05 1 2 5 10 20 50

Time Period (years) Time Period (years)

Supplementary Figure SI 8: Power Spectra for the first 4 spatial EOFs over 1961-2010. Blue line, observations with the externally forced trend removed using the CMIP-3 ALL forcing mean over the same period. Grey lines, 50-year segments of non- overlapping CMIP-3 control simulations with the 5-95% interval at each point shown by black dashed lines. Black line, mean spectra over all control segments. Error bars in top left, 5-95% estimation uncertainty on the spectra for observations (blue) and CMIP-3 control mean (black) respectively. All spectra are based on annual mean data, and estimated with a Hanning filter of window length 49 years. In all EOFs control mean variability integrated on time-scales above 10 years is consistent with observed variability based on an F-test.

© 2012 Macmillan Publishers Limited. All rights reserved. Supplementary Figure SI 9: Time-series showing the evolution of the amplitudes of the first 4 EOFs in the BBC CCE (blue) and observations (black) from 1961-2010. Blue colouring indicates goodness-of-fit (r2) between observations and ensemble members, plotted in order of increasing agreement from light to dark blue. Thick blue lines indicate the 66% confidence interval. All series have been smoothed using a cubic spline. The first EOF shows a clear climate change signal, subsequent EOFs less so. Time-series are plotted at 5 year mean resolution. Note that the 4th EOF is not used in the r2 calculation.

© 2012 Macmillan Publishers Limited. All rights reserved. a Sensitivity to Flux Adjustment threshold b Climate Sensitivity v Flux Adjustment r2 entcoef

Flux 9 0.6 2 4.0 r + Flux 1 r2 + Flux std 3 9 8 3.5 7 3.0 T (K) 6 Δ 2.5 QUMP 5 2.0 2041 − 2060 Climate Sensitivity (K) IPCC − AR4 4 CMIP − 3 1.5 D 3 1.0 2 0.5

0 2 4 6 8 10 12 14 16 −20 −15 −10 −5 0 5 10

Absolute Total Flux adjustment (W/m2) Total Flux adjustment (W/m2)

c 2001−2010 Hindcast d 2041−2060 Forecast 4 4 CPDN−HadCM3L D CPDN−HadCM3L std ● QUMP−HadCM3 1 GFDL_CM2_0 2 GFDL_CM2_1 3 GISS_Model_E_H 4 GISS_Model_E_R 5 INMCM3_0 3 3 6 MIROC3_2_MedRes 7 MIUB_ECHO_G 8 MRI_CGCM2_3_2a NCAR_CCSM3_0 ● 9 ● 10 CCCMA_CGCM3_1_T47 11 CCCMA_CGCM3_1_T63 12 CNRM_CM3 13 CSIRO_MK3_0 ● 14 GISS_AOM ● 2 2 15 IAP_FGOALS1_0_G 16 IPSL_CM4 17 MPI_ECHAM5 ● 18 UKMO_HadCM3 ● ● 19 UKMO_HadGEM1 ● 11● 11● ● CMIP−3 CTRL ● 10 ● ●5 105 ● ● D D B ● D DD ● B 13 6 9 D 16● ● 13 D 6169 ● ● 1 1219 CS(K) 1 1219 CS(K) 1 ●D ● 8 1 D ● ● 8 15 178 ● 17158 ● fit (weighted mean squared error) Goodness − of fit (weighted A mean squared error) Goodness − of fit (weighted A 183● ● ● 6 3 ●18 ● ● 6 14 27 D 14 2 7 D 4 4 4 4

2 2 IPCC−AR4

Observed 0 0 0 0

0.0 0.5 1.0 1.5 0 1 2 3 4

2001−2010 ΔT (K) 2041−2060 ΔT (K)

Supplementary Figure SI 10: Sensitivity to flux adjustment threshold. a, Sen- sitivity of the 66% confidence interval (“likely” range) for 2050 warming to the additional constraint on the absolute global annual mean flux adjustment. Dotted black lines show the evolution of the range using a constraint based solely on the flux adjustments, and dashed lines indicate the 66% confidence interval using the surface warming constraint alone. Solid black lines and grey shading indicate the range combining the surface temperature and flux constraints. Red line indicates the range when the flux adjustment is taken instead rela- tive to the standard physics configuration. Shown for comparison are the ranges from the simulations in the CMIP-3 and QUMP ensembles that pass the residual test, and also the IPCC-AR4 subjective range. Dashed vertical black line indicates flux adjustment threshold used in the main analysis. b, Relationship between estimated equilibrium climate sensitivity and global annual mean total flux adjustment (or TOA flux imbalance) in the BBC CCE. Points are coloured by the corresponding model entrainment rate parameter. For each of the 153 climate sensitivity values there are 10 different flux adjustments corresponding to effect of perturbed ocean physics. Standard physics configuration is shown by D symbol. c,d, As Fig. 2a,b in the main text except the constraint on the global annual mean total flux adjustment is removed. Numbered symbols identify the different CMIP-3 model simulations.

© 2012 Macmillan Publishers Limited. All rights reserved. Sample obs 4.0 Reg. obs Sample PMS Reg. PMS 3.5 3.0 T (K) Δ 2.5 2041 − 2060 2.0 1.5 1.0

0 4 8 12 16 20 24 28 KI

Number of spatial EOFs

Supplementary Figure SI 11: Sensitivity of the 66% confidence interval (“likely” range) for 2050 warming to the number of spatial EOFs included in the analysis. Solid lines show the effect when comparison is made between model BBC CCE simulations and observations, with colours indicating the covariance estimation technique: black, sample covariance and red, regularized covariance (See Eqn. SI 19). Dashed lines show the behaviour in the perfect-model scenario (PMS) when we replace the observations with the best fit ensemble member. Error bars on the far right, KI, indicate the ranges using a set of spatial patterns based on the Karoly Indices 56.

© 2012 Macmillan Publishers Limited. All rights reserved. r2 Flux 2 4.0 r + Flux 3.5 3.0 T (K) Δ 2.5 QUMP 2.0 2041 − 2060 IPCC − AR4 1.5 CMIP − 3 1.0 0.5

0.0 0.4 0.8 1.2 1.6 2.0

α Scaling on estimated observed variability

Supplementary Figure SI 12: Sensitivity to scaling on covariance structure of variability in the observations. As Fig. SI 10a instead showing the sensitivity of the 66% confidence interval (“likely” range) for 2050 warming to the scaling on the estimate covariance structure of variability in the observations. Dashed black lines show the evolution of the 66% confidence interval using the surface warming constraint and dotted lines show the range using the flux constraint alone. Solid black lines and grey shading indicate the range combining the surface temperature and flux constraints. CMIP-3 , QUMP and IPCC- AR4 expert ranges shown for comparison. Dashed vertical black line indicates the scaling used in the main analysis.

© 2012 Macmillan Publishers Limited. All rights reserved. a 1961−1990 anomaly relative to pre−industrial b 2041−2060 Forecast 4 CPDN−HadCM3L D CPDN−HadCM3L std ● QUMP−HadCM3

200 ● CMIP−3 CMIP−3 CTRL 3

150 ●

● 2 100 ● ● ● ● ● ● Number of simulations ● ● ● D ● ● D ●D●● ● B ΔT (K)

1 D ● ● 1.0 ● ● ●● D ● ●

50 ● 0.5 A● ● ● ● ● ● ● ● ● D 0.0

fit (weighted mean squared error) Goodness − of fit (weighted −0.5

− 1910 to 1891 Observed: 1961 − 1990 relative IPCC−AR4 −1.0 0 0

−1.0 −0.5 0.0 0.5 0 1 2 3 4

1961−1990 anomaly relative to control (K) 2041−2060 ΔT (K)

Supplementary Figure SI 13: Impact of warming over 1961-1990 relative to the control simulation. a, Distribution of 1961-1990 reconstructed global-mean warming rel- ative to the control simulations measured by the transient-control anomaly. Vertical black line represents the observed change over 1961-1990 relative to 1891-1910. Grey shading indicates approximate 5-95% uncertainty estimate accounting for observational error and internal variability (see text for details). b, As Fig. 2b except members of the ensemble are coloured by the mean warming over 1961-1990 relative to control for each model configura- tion.

© 2012 Macmillan Publishers Limited. All rights reserved. BBC CCE CMIP−3

4.0 QUMP 3.5 3.0 T (K) Δ 2.5 2.0 2041 − 2060 1.5 1.0 0.5

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4

Global mean temperature bias (K)

Supplementary Figure SI 14: Sensitivity to the absolute bias in global-mean temperature. As Fig. SI 10a instead showing the sensitivity of the 66% confidence interval (“likely” range) for 2050 warming to an additional constraint on the absolute bias in global- mean temperature over 1961-1990. Grey shading and solid lines indicate the range for the BBC CCE simulations, dashed lines for CMIP-3 and dotted lines for QUMP. Bias is calculated from global-mean temperature in the transient simulations averaged over 1961- 1990, compared to observed values from HadCRUT3. For reference the observed mean is 14 0.3◦C. Black error bar on the far right indicates the range of temperature changes in≈ identically± prepared pre-industrial control segments from CMIP-3, arbitrarily placed in the vertical.

© 2012 Macmillan Publishers Limited. All rights reserved. a Predicted v Simulated Climate Sensitivity b Distribution of out−of−sample prediction errors

14 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● 12 ● ●● ●●● ● ● 1200 ● ●● ● ● ● ●● ● ●●●●●● ●●● ●●●●●● ● ● ● ● ● ●●●●● ●●● ● ● ●●●●● ● ● ● ●● ● ●●●● ● ● ● 10 ●● ● ●●●● ●●●●●● ●●● ● ● 1000 ● ● ●●●●●●●●●● ●●●●●●●●●● ● ●● ● ●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●● ● ● ●●●●●●●●●●● ● ●●●●●●●●●● ● ●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●● ●●● ● 8 ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●● 800 ● ●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●● ● ●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●● ● 6 ●● ●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●● ●●●●●●●●●●●●●●●●● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●● ● ● ●● 600 ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ● Number of Simulations ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● 4 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●● ●●● 400 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● Simulated Climate Sensitivity (K) Simulated ●●● ●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ● 2 ●●●●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●●●● ●●● ●●●●●●● 200 ●●●●●●● ● ●● ●●● 0 0

0 2 4 6 8 10 12 14 −1.0 −0.5 0.0 0.5 1.0

Predicted Climate Sensitivity (K) Prediction error (K)

Supplementary Figure SI 15: Validation of the random forest statistical model used to estimate equilibrium climate sensitivities. a, Results from a 10-fold cross validation over the 14,001 member training set from climateprediction.net. Shown are sim- ulated climate sensitivities from HadSM3 against out-of-sample predicted values from the random forest. Red line shows the theoretical 1:1 relationship. b, Histogram of prediction error. The root mean square error of approximately 0.3K is shown by dashed vertical grey lines.

© 2012 Macmillan Publishers Limited. All rights reserved.