Terrestrial Biosphere Model Performance for Inter-Annual Variability of Land-Atmosphere CO2 Exchange

Interannual variability in biosphere-atmosphere exchange of CO 2 is driven by a diverse range of biotic and abiotic factors. Replicating this variability thus represents the ‘acid test’ for terrestrial biosphere models. Although such models are commonly used to project responses to both normal and anomalous variability in climate, they are rarely tested explicitly against inter-annual variability in observations. Here, using standardized data from the North American Carbon Program, we assess the performance of 16 terrestrial biosphere models and 3 remote sensing products against long-term measurements of biosphere-atmosphere CO 2 exchange made with eddy-covariance flux towers at 11 forested sites in North America. Instead of focusing on model-data agreement we take a systematic, variability-oriented, approach and show that although the models tend to reproduce the mean magnitude of the observed annual flux variability, they fail to reproduce the timing. Large biases in modeled annual means are evident for all models. Observed interannual variability is found to commonly be on the order of magnitude of the mean fluxes. None of the models consistently reproduce observed interannual variability within measurement uncertainty. Underrepresentation of variability in spring phenology, soil thaw and snowpack melting, and difficulties in reproducing the lagged response to extreme climatic events are identified as systematic errors, common to all models included in this study.


Introduction
The terrestrial biosphere acts as a net sink for atmospheric CO 2 , with global forests absorbing on average 4 Pg C yr -1 (Pan et al., 2011), which, excluding deforestation, offsets roughly half of all anthropogenic emissions from fossil fuel burning and cement production (Pan et al., 2011). Interannual variability in this sink is often on the order of magnitude of the mean (e.g., Zeng et al., 2005;Reichstein et al., 2007a;Pan et al., 2011), and drives interannual variability in the growth rate of atmospheric CO 2 (Bosquet et al., 2000;Knorr et al., 2007). Carbon fluxes in forest ecosystems are tightly coupled to climate Piao et al., 2008;Chen et al., 2009;Dragoni et al., 2011), and anomalous climatic signals generally drive the observed variability in their sink strength (Dunn et al., 2007;Desai, 2010;Le Maire et al., 2010). Such signals tend to affect photosynthesis and respiration (the two processes which determine net ecosystem carbon exchange) to different extents Luyssaert et al., 2007), and therefore provide an excellent test-bed to assess the skill of terrestrial biosphere models.
Terrestrial biosphere models are the primary tools used for predicting the impact of climate variability on terrestrial carbon fluxes. Built around the philosophy that a blend of mechanistic and semi-empirical descriptions can capture functional responses to environmental drivers, they have been used in conjunction with remote sensing products (Zhao and Running, 2010) and data mining tools (Papale and Valentini, 2003) to provide regional and global estimates of terrestrial carbon cycling (e.g., Friend et al., 2010;Beer et al., 2010). They are also commonly used to quantify terrestrial responses to climatic variability, including anomalies and extreme events (Ciais et al., 2005;Reichstein et al., 2007;Vetter et al., 2008;Zhao and Running, 2010). Future model projections of the response of terrestrial carbon cycling to climate change (Heimann and Reichstein, 2008) are necessary to inform policy (IPCC, 2007), though current models show very divergent sensitivity to long-term changes in climate (Friedlingstein et al., 2006).
Biogeochemical models are often shown to capture diel and seasonal dynamics reasonably well (e.g., Braswell et al., 2005). This is not surprising, given the pronounced diurnal and seasonal cycles of climatic drivers. Over yearly and longer time scales, however, studies show poor model performance at reproducing gross fluxes and carbon budgets (e.g., Hanson et al., 2004;Braswell et al., 2005;Siqueira et al. 2006;Urbanski et al. 2007). Such comparison studies are typically restricted to a limited number of models and sites, and a relatively short time series length. Nonetheless, the results suggest that although the response of terrestrial ecosystems to mean climatic drivers is relatively well captured, sensitivity to the impact of variability in climatic drivers may not be, leading to the accumulation of high frequency model error (e.g., Dietze et al., 2012) over longer time scales (Schwalm et al., 2010). No study, however, has yet identified systematic errors in model sensitivity to climatic variability.
In this analysis, we use 16 terrestrial biosphere models and 3 remote sensing products, along with eddy-covariance data from a range of sites included in the North American Carbon Program interim site synthesis, to assess model ability to reproduce observed variability in carbon fluxes. We examine the frequency distribution of temporal anomalies in net ecosystem exchange (NEE), gross primary productivity (GPP), and ecosystem respiration (RE), for two plant functional types. We first assess individual model performance on an annual/interannual scale. As interannual variability can be driven by 'critical' periods within a year (Le Maire et al., 2010), we examine monthly systematic model errors (errors consistent across all models and sites). By using data from sites with a regionally coherent anomalous year, we then assess the possible role of extreme within-year climatic events and lagged effects on model performance for interannual variability in terrestrial carbon cycling.

Methods
All models and data used were obtained through the North American Carbon Program (NACP) (http://www.nacarbon.org/nacp/). To allow for an ensemble approach and reduce the potential for spurious variability, we selected only sites with at least 5 years of data, from plant functional types that were represented by at least 3 such sites. This resulted in a total of 11 forested sites distributed through North America (Table 1). Of those, 6 were deciduous broadleaf, and 5 evergreen needleleaf. This gave a total of 91 site-years for the analysis.
Eddy-covariance flux data (produced by AmeriFlux and Fluxnet-Canada investigators) from the 11 selected sites was processed according to a common protocol from the NACP site level interim synthesis (http://www.nacarbon.org/nacp/). The observed NEE were corrected for storage, despiked (i.e., outlying values removed), and filtered to remove conditions of low turbulence (friction velocity filtered). Flux error estimates were calculated (Barr et al., 2009) by combining random uncertainty (calculated following ) and uncertainty due to the selection of the friction velocity threshold (Barr et al., 2009). Observed monthly and annual NEE values were then calculated using gap-filled data from each site (Barr et al., 2009). The gapfilled NEE values were also partitioned to gross ecosystem photosynthesis (GPP) and ecosystem respiration (RE). Multiple approaches were used in order to quantify additional uncertainty introduced by the partitioning (Desai et al., 2008;Barr et al. 2009).
Gaps in the meteorological forcing data occurred due to instrument failure or quality control. Such gaps were filled using the nearest available climate station in the National Climatic Data Center's Global Surface Summary of the Day (NCDC-GSOD) database.
Sixteen terrestrial biosphere models (Table 2) were run at the sites for the period of available measurements (Table 1). The terrestrial biosphere models simulated carbon cycling with process-based formulations of varying detail for the component carbon fluxes of photosynthesis and respiration. Simulated NEE was based on model specific runs using gap filled observed weather at each site and locally observed values of soil texture according to a standard protocol (Ricciuto et al., 2009). Each model used species or plant functional type specific parameterizations as defined by the individual model teams, with the exception of LoTEC where parameters were optimized using data assimilation (Ricciuto et al., 2008). Three remote sensing products of terrestrial gross primary productivity (MODISc5 (Running et al., 2000), MODISc5.1 (Zhao et al., 2005), BESS ), not included in the North American Carbon Program, were also used to provide annual estimates of GPP for each site.
In order to assess interannual variability, we normalized the measured/modeled values of NEE, GPP and RE by subtracting the long-term calendar year measured/modeled mean for each site from individual site-year flux values, giving F obs and F sim for each flux and year. By comparing the long-term calendar year mean of measured and modeled fluxes, we also identified biases in model estimates of NEE, GPP and RE.
Model-data agreement for interannual variability in annual flux sums was assessed in terms of the normalized root mean squared error (NRMSE) and the χ 2 statistic.
The NRMSE is the root mean square error of model-data mismatch normalized by the magnitude of observed variability at each site: (1) where F represents the observed (obs) or modeled (sim) value of a particular flux, i (NEE, GPP, or RE), for a particular year, l. Note that each flux here is represented as the interannual variability (F obs and F sim ), not the mean flux. σ(F obs ) is the standard deviation of observed interannual variability at site k. NRMSE values are calculated for each model j at site k. The NRMSE thus reports the mean difference between the simulated and observed flux, relative to the variability in the observed flux.
The χ 2 statistic complements the NRMSE by incorporating measurement error. Here it is calculated for each model and PFT as the squared residual between paired model and data points for each flux (after normalization to the long-term mean as described above), relative to the observational error: (2) where δ(F obs ) is uncertainty related to the annual observed value of that flux, 2 normalizes the uncertainty in the observed flux to correspond to a 95% confidence interval. A χ 2 value of less than one indicates that the model agrees with the data relative to data uncertainty. I.e., interannual variability for a model with a χ 2 value of less than one will always fall within 1 standard deviation of data error. Above one, the χ 2 scales model error relative to observation uncertainty.
Interannual variability in observed fluxes commonly stems from specific, short-lived, periods of anomalous fluxes within the year (Krishnan et al., 2008(Krishnan et al., , 2009Chen et al., 2009;Le Maire et al., 2010). We therefore also assessed model performance for variability on a monthly scale. The variability of monthly fluxes between years was calculated in the same way as annual variability, as the difference between the observed or modeled monthly value and the associated long-term mean for the month in question.
By differencing the observed and predicted monthly variability (here termed variance residuals) specific periods during the year at which the models under-or over-represent the observed monthly variability can be identified. We define periods of systematic model error (statistically common to all models) as times when all models show the same-signed bias in variance residuals with 95% confidence. We also assess persistent biases, which are mean biases of more than one month in duration that are not necessarily systematic.
Extreme climatic events, detectable as regionally coherent deviations outside the normal range of variability, provide a strong test of model performance. We identified one such event in our database. At three sites in mid-western Canada, mean spring monthly temperatures in 2002 were between 8 and 10 °C below the long-term mean. We used this event to assess model skill and identify systematic model error.

Biases and the magnitude of variability:
In order to quantify interannual variability, we normalizing all models and data by subtracting respective mean annual totals from individual annual totals. This process identified considerable biases in model estimates of all total annual fluxes at all sites ( Fig. 1). In particular, biases in annual NEE were commonly of similar magnitude to mean observed annual NEE fluxes. The majority of models were biased towards underestimating ecosystem carbon uptake for both deciduous and evergreen sites ( Fig.   1). Note that biases here are reported relative to the observed mean NEE for each site, and therefore have the potential to be particularly larger for sites with low mean annual NEE. See Table 1 for per-site mean annual NEE values.
The magnitude of modeled interannual variability in each annual flux was on average of the same order of that observed (Fig. 2). A large range in model performance was evident (Table 3), but in general the models proved 'flexible' enough to reproduce the observed range of variability. Observed interannual variability in NEE for deciduous broadleaved forests was twice that of evergreen needle-leaved forests, a distinction only reproduced by six of the included models. The magnitude of interannual variability in both GPP and RE was greater (55%, 23%) in deciduous broadleaved forests than in evergreen needle-leaved forests. The remote sensing products, however, consistently predicted higher GPP variability in evergreen than in deciduous forests.

Statistical performance of models on an interannual scale
Although the mean magnitude of model variability on the interannual scale was similar to the mean observed magnitude of variability ( Fig. 2), all models fell outside of the data error of the observed for individual site-years (>1 χ 2 , Fig. 3, S1). This means that the general magnitude of interannual variability was well reproduced, but not the timing. Interannual variability in the annual net ecosystem exchange of evergreen forests was better simulated on average than deciduous forests (Fig. 3). A larger range of model performance was observed for variability in annual GPP than that of RE. Our results suggest that on average the inability of models to match the timing of observed variability in GPP is the main cause of errors in the simulation of interannual variability in NEE, though this is very model dependent (Fig. 3). The MODISc5 remote sensing product performed worse than the average model (Fig. 3, Fig S2). The MODISc5.1 data product proved to be a large improvement over the MODISc5 GPP product. The BESS remote sensing product, a process based model interpretation of remotely sensed data , performed better than either MODIS product for deciduous forests, though that was not the case for evergreens. Although process based models of different types were represented (e.g., light use efficiency vs enzyme kinetic model for GPP, Table 2) no model characteristic performed statistically better than any other (data not shown). This could be due to the limited number of models with particular characteristics.

Variability within the year
The models showed persistent systematic biases (see definition in Methods section) for monthly flux variability. In deciduous forests, models consistently underestimating monthly variability in NEE throughout spring (May and June) (Fig. 4). Model underestimation of variability in deciduous spring NEE fluxes was mostly due to underrepresentation of variability in spring GPP (Fig. 4). A smaller peak in the deciduous GPP variance residuals (predicted monthly variability -observed monthly variability) was also evident in September and October. Variability in deciduous RE showed no bias that was consistent across all models.
Systematic underrepresentation of monthly variability during May was also evident for evergreen forests (Fig. 4). Here, however, model error for NEE was dominated by the lack of variability in modeled RE during spring. Although evergreen forests do not exhibit the marked phenological transitions observed in deciduous forests, all evergreen forests included in this study maintain a large snow-pack throughout winter. Persistent, non-systematic biases were evidenced throughout the year, in particular an overestimation of winter variability in evergreen NEE and GPP, and a persistent underestimation of variability in evergreen RE during the summer.

Response to anomalous climate forcing
Three sites (CA-Ojp, CA-Obs, CA-Oas; see site description Table 1) experienced a regionally coherent extreme climatic event during 2002, where monthly mean temperatures were between 8 (CA-Ojp, CA-Obs) and 10 °C (CA-Oas) below the longterm mean. The anomaly largely affected canopy GPP at all three sites, and to a lesser extent RE (Fig. 5). At CA-Ojp and CA-Obs, anomalously low temperatures during the month of April lowered observed GPP by more than twice the normal range of variability for that month (Fig. 5). The models accurately captured the drop in productivity, with the mean of all model projections capturing both the sign and the magnitude of the April GPP anomaly at both sites. The temperature anomaly was observed at CA-Oas one month later, and again the models accurately reproduced the observed magnitude in anomalous GPP. During the following May, June and July, observed temperature remained colder than normal but returned to within the normal range of variability for the three sites. Observed GPP, however, remained anomalously low during those months and did not return to within the normal range of variability until July at each site. This lagged effect between anomalous climate forcing and resulting fluxes was not well reproduced by the models. At CA-Obs, all models returned to within the normal range of GPP variability in the month directly following the temperature anomaly. The same behavior was apparent at CA-Ojp and CA-Oas, though the average model GPP estimates remained just outside the normal range of variability due to persistent low temperatures. The extended period of low productivity in CA-Oas may be in part also due to consistently low precipitation during the year.
A similar, though smaller anomaly pattern was observable for RE (Fig. 5). Low spring temperatures caused a prolonged anomaly of low ecosystem respiration. The models tended to overestimate the reduction in RE as a result of the colder temperatures. After the initial anomaly, RE as estimated by eddy-covariance measurements took a few months to return to within the normal range of variability. Modeled RE quickly returned to 'normal' at CA-Ojp and CA-Obs. Temporal dynamics at CA-Oas differed from those of the other two sites due to the additional pressure of persistently low temperatures and precipitation during the year.

Discussion
This analysis has shown that, although capable of reproducing the magnitude of interannual variability, terrestrial biosphere models are not consistent with the timing of observations of interannual variability in surface-atmosphere exchanges of CO 2 at midlatitude forests over North America. By examining interannual variability in measured and modeled monthly fluxes, we show that all the models used for the NACP interim site synthesis systematically fail to reproduce observed variability during spring.
Underestimation of spring variability is largest for GPP in deciduous forests, and RE for evergreens, suggesting different processes may be responsible for plant functional type specific model error.
It has been shown that terrestrial biosphere models are typically unable to adequately explain the observed interannual variability in deciduous canopy phenology (Richardson et al., 2012), and that variability in spring GPP often drives observed interannual variability in net ecosystem exchange (Krishnan et al., 2008(Krishnan et al., , 2009). Here we show that this is a systematic cause of the low agreement between modeled and observed interannual variability in terrestrial carbon fluxes.
In a similar fashion, it has been shown that the current available models of snow pack dynamics perform poorly for both spatial and interannual variability. Rutter et al. (2009) tested 33 models of snowpack dynamics across a range of sites, and found that although a model could perform well when tuned to a particular site-year, this did not transfer to good performance for other years at the same site, or other sites. Interactions between snowmelt, soil thaw and water table depth are known to directly affect interannual variability in NEE (Goulden et al., 1998;Dunn et al., 2007;Hu et al., 2010). Results here suggest that this may be a direct systematic contributor to the low agreement between observed and modeled interannual variability in net ecosystem carbon exchange, in particular for evergreen sites. These results do not imply, however, that a lack of phenological variability in canopy or soil dynamics are necessarily the main culprits for the lack of agreement between the observations and output from any one model, as individual models showed large persistent biases at other times of the year (Fig. 3).
The remote sensing products performed comparably to the average process-based model when assessed against interannual variability in GPP. The MODISc5.1 data set is a post-processed version of the MODISc5 data set where corrections are made for poor quality driver data (Zhao et al., 2005). The remote sensing products, which are themselves models, are driven by a global daily meteorological reanalysis dataset not site-specific meteorology and the uncertainties in the meteorological reanalysis can introduce biases in GPP estimates (Zhao et al., 2006). Although estimates of GPP based on remote sensing have been used to evaluate process-based models (e.g., Poulter et al., 2009), results here suggest that estimates of interannual variability from both approaches are subject to similar magnitudes of error.
Although there was a general tendency for the models to persistently underestimate flux variability in summer, it should be noted that the flux data are subject to random error roughly in proportion to the size of the flux Richardson et al., 2008). Even if the model were perfect, modeled variability should be smaller than that observed. Carbon fluxes are typically higher in the summer, and subject to larger uncertainty. The apparent higher variability in the data during summer could therefore be due to random errors in the flux measurements generating larger variability in monthly totals. The analysis of model responses to the regionally coherent climatic anomaly of spring 2002 suggests that models have the potential to correctly reproduce the magnitude of instantaneous biological response to climate anomalies (Desai, 2010). Although the models accurately captured the direct effect of an isolated climate extreme, the models included here failed to accurately reproduce lagged effects of climate anomalies on both gross primary production and ecosystem respiration. Lagged effects of climate variability on ecosystem function have previously been reported (Gough et al., 2009), and our results suggest that such lagged effects are not well incorporated into models.
The nature of such lagged effects depends on the type of climatic anomaly. Spring frosts Marino et al., 2011), for example, are known to directly effect canopy structure, an aspect not currently accounted for in models. The affects of other disturbances, such as ice storms, strong winds and insect outbreaks are known to be poorly represented by models (Liu et al., 2011) and affect long-term carbon dynamics. .
Lagged effects unrelated to disturbances can be caused by changes in nutrient cycling  or changes in the size of carbon pools such as litter , or non-structural carbohydrates (Gough et al., 2009) due to climatic conditions in previous years. Model aspects related to lagged and cumulative effects can be improved through direct comparisons with observations (e.g., Keenan et al., 2009), though many related issues remain (Liu et al., 2011). Although lagged effects are apparent at the three sites showing a coherent regional extreme event, we did not detect similar lagged events for other climatic anomalies in the database. This is likely due to two confounding effects: that smaller anomalous climate signals do not produce lagged (on monthly scales) ecosystem effects, and that biotic effects could play a role in driving some of the interannual variability in observed fluxes .
Open questions remain as to the proportion of interannual variability in land-atmosphere carbon exchange that is directly explainable by variability in climate (Hui et al., 2003;Polley et al., 2010;Richarsdon et al., 2007). Controls on interannual variability can also be manifest in the form of functional changes in the ecosystem, or lagged effects on pool sizes and dynamics. By contrasting the interannual performance of a simple empirical model with fixed parameters against the same model with interannually varying parameters,  reported that forest functional change at a spruce forest was responsible for 55% of interannual variations in land-atmosphere CO 2 exchange. i.e., 45% of the observed variability was not explainable by the direct impacts of climate. Polley et al. (2010) used a similar approach to determine a significant contribution of ecosystem functional change to interannual variability in grasslands.
Using an optimized process-based model, however, Desai (2010) found that 81% of interannual variability in annual CO 2 exchange could be explained by variability in climate for five mature hardwood forests, a value that likely underestimates model performance given that it does not account for observational error. This result supports multi-site synthesis efforts that show that ~79% of interannual variability for midaltitude deciduous broadleaved forests can be explained by variability in temperature (Yuan et al., 2009). Clearly a detailed assessment of the relative roles of climate and functional change on the interannual variability of CO 2 flux across a wide range of sites and climate zones is needed.
We could not distinguish any model structure or characteristic (see Table 2) that tended to give a better model performance. All models are subject to errors resulting from both parameter choice (parameter mis-specification) and model structure (process misrepresentation) (Keenan et al., 2011). The fact that no model structure proved consistently better suggests that parameter error was excessively large. In future efforts, model-data fusion techniques (Wang et al., 2009;Keenan et al., 2011) could aid in reducing the relative magnitude of parameter errors, thus allowing for a more rigorous assessment of model structural differences.
Our estimates of the magnitude of observed interannual variability in land-atmosphere CO 2 exchange (DBF: ~85 gC m -2 ; EVG: 44 gC m -2 , Table. 3) are roughly 50% and 33% of the mean flux respectively. Given that this represents one standard deviation about the mean, variability in ecosystem carbon uptake is commonly on the order of magnitude of the mean. This supports previous results from single sites (Reichstein et al., 2007a), and modeling studies (Zeng et al., 2005), across the range of sites included here. Variability in GPP has been found to be the main contributor to variability in NEE for a variety of terrestrial ecosystems (Luyssaert et al., 2007). Here, we show that for deciduous forests, the interannual variability in GPP is on average 26% greater than that of RE (Table 3). All though on average both GPP and RE show a similar magnitude of variability at the evergreen needle-leaf forest sites, four of the six evergreen sites had higher variability in GPP. This suggests that variability in GPP dominates variability in NEE in deciduous mid-latitude forests, though this rule is not applicable to all sites included here.
Using 91 site-years at 11 long-term eddy-covariance forest sites, we show that terrestrial biosphere models have difficulty in simulating land-atmosphere CO 2 exchange at annual and interannual time scales, with the potential for large biases on the interannual scale, and incorrect simulation of the timing of interannual variability. Instead of focusing on model-data agreement, we present a variability-oriented approach of diagnosing systematic and persistent model-data disagreement. Given that studies of the impact of climate variability on terrestrial fluxes are likely to reveal a more informative picture of biosphere-atmosphere interactions (Le Maire et al., 2010), such a variability orientated approach should greatly aid modeling teams in future model assessment and development. Our results highlight three potential mechanisms -spring canopy phenology, soil thaw and the melting of the snow pack, and lagged effects -common to all models included in the study, which contribute to the low agreement between the models and the observed interannual variability in land-atmosphere CO 2 exchange.
Addressing these issues in future model efforts will be the first step towards improving the sensitivity of models to climatic variability on interannual time scales.   Table 3 for individual model values.   Model GPP R n/a n/a n/a