First, we simulated the carbon and water fluxes using the calibrated BiomeBGC model at the Changbai Mountains forest flux site. Then BiomeBGC MuSo with multilayer soil was applied in the simulation. Third, the daily soil temperature and moisture were assimilated into BiomeBGC MuSo. The performances of simulated carbon and water fluxes were evaluated by EC measurements. Finally, threedimensional relationships among ΔRMSE and climatic and biophysical factors were analyzed. Figure 3 represents the overall methodology in this study, the details of which are presented in subsequent sections.
BiomeBGC MuSo model
The BiomeBGC Multilayer Soil Module version 4.1 (BiomeBGC MuSo v4.1) was developed to improve its ability to simulate carbon and water cycles within terrestrial ecosystems. BiomeBGC MuSo v4.1 improved the multilayer soil module, and introduced the management and phenological modules. These three modules are independent of each other in the model. In this study, the management module was deactivated during the spinup and normal simulation for the forest. Hence, the logical values of planting, thinning, mowing, grazing, harvesting, ploughing, fertilizing, and irrigation were set to 0 (flag = 0). The thicknesses of the layers from the surface to the bottom were 5, 15, and 20 cm. Thus, the first, second, and third layers were located at depths of 0–5, 5–20 and 20–40 cm, respectively.
This model runs with a daily time step and requires four input files for execution. The first file is the initialization file containing basic siterelated information (e. g., elevation, soil texture, CO_{2} concentration, and Ndecomposition data). The second file is the daily meteorological data file and includes daily air maximum temperature, minimum temperature, precipitation, VPD, solar radiation and day length. The third file is the ecophysiological file and includes the ecophysiological parameters (e.g., ratio of leaf carbon to nitrogen, fine roots and coarse roots, fraction of leaf N in the Rubisco catalytic enzyme, and the maximum stomatal conductance). In this study, the ecophysiological parameter values in BiomeBGC MuSo were determined by the optimized results during the model run. The last input file is a special restart file, which is the output of the spinup and provides inputs for running the model under normal situations. The spinup phase was first performed using the meteorology covering the period 1981 to 2002 obtained from the Data Center of Chinese Meteorological Bureau, and the output endpoint is the input for normal simulation covering the period 2003 to 2007.
In the carbon flux module of the BiomeBGC MuSo model, GPP is calculated using Farquhar’s photosynthesis routine and data on the catalytic enzyme Rubisco in relation to temperature (Farquhar et al. 1980). Photosynthesis is the only process whereby the model can provide carbon into all of the pools. Root maintenance respiration was calculated layerbylayer using the soil water content (SWC) and soil temperature of each active layer (which differs from the averaged soil water status or soil temperature of the whole soil in the original BiomeBGC model). Growth respiration (GR) in the model was considered as the proportion of all new tissue growth, which was 30% (Larcher 2003).
The net primary productivity (NPP) was calculated using GPP, MR, and GR in the model. The carbon storage of the ecosystem originates from the balance between NPP and heterotrophic respiration (HR), which are regulated by decomposition activities. All litter and soil pools decompose through HR. NEE represents the difference between NPP and HR.
The soil flux module generally describes the decomposition of dead plant material, or litter, in addition to SOM, N mineralization, and N balance (Schwalm et al. 2015). Soil hydrology has significant effects on many soil processes (e.g., SOM, N mineralization, and soil evaporation), and thereby on the carbon and water cycles. Therefore, accurate description of soil hydrology is essential. In the original BiomeBGC model, the soil layer works as a “bucket”, and the soil water flux considers only canopy, interception, snowmelt, outflow, and soil evaporation. Therefore, runoff, percolation, diffusion, pond water formation, and transpiration were added into BiomeBGC MuSo.
The movement of water that occurs within the soil is known as percolation and diffusion. BiomeBGC MuSo implements two calculation methods for soil water movements. The first is based on Richards’ equation (Balsamo et al. 2009). The second, the so called “tipping bucket method” (Ritchie 1998), is based on the semiempirical estimation of percolation and diffusion fluxes and is generally used in crop modeling. Hydraulic conductivity (K) and hydraulic diffusivity (D) are used in diffusion and percolation calculations in the first method based on the diffusion equation based on Darcy’s diffusion law:
$$ \frac{\partial \theta }{\partial t}=\frac{\partial }{\partial z}\left[D\left(\theta \right)\bullet \frac{\partial \theta }{\partial z}\right]+\frac{\partial K}{\partial z}+S\left(\theta \right)\kern12em $$
(1)
where D is the hydraulic diffusivity (m^{2}∙s^{− 1}), K is the hydraulic conductivity (m∙s^{− 1}) and S represents the source and sinks of soil water such as precipitation, evaporation, transpiration, runoff, and deep percolation. The ClappHornberger formulation (Clapp and Hornberger 1978) was used to calculate K and D. These variables change rapidly and significantly as the SWC change. K and D were determined for each layer; the layerintegrated daily scale form was solved by this method of finite differences. The Richards equation was used to investigate soil water movements in this study.
Surface runoff occurs when the rate of rainfall exceeds the rate of water infiltrating the soil. Runoff simulation was conducted using the semiempirical method (Williams 1991). Under the conditions of intensive rainfall, when not all of the precipitation can infiltrate, pond water forms the surface. In BiomeBGC MuSo, evaporation of pond water is assumed to be equal to potential soil evaporation.
The soil temperature of each active layer was calculated using two methods. The first method involved logarithmic downward dampening of temperature fluctuations within the soil (Zheng et al. 1993). In this method, the soil surface temperature is determined by air temperature changes considering the insulating effect of snowcover and the shading effect of vegetation. The temperature of intermediate soil layers is calculated under the conditions of linear temperature change between soil layer depths of 0 cm and 3 m. The soil temperature below 3 m in the model is assumed to be the mean annual air temperature. The other method, uses DSSAT/4 M (Sándor and Fodor 2012) to empirically calculate the soil temperature. Because the former method is preferred (Zheng et al. 1993), we selected the same in this study and compared the results with measurements obtained at the Changbai Mountains forest flux site.
Ensemble Kalman filter
The EnKF algorithm, used mainly to forecast the error covariance of a model, is based on the Monte Carlo method (Evensen 2003), and can integrate multisource observations sequentially in time. The basic assumptions of this algorithm are that system and measurement noises are both based on white and Gaussian distributions. It is assumed that the N ensembles first generated from the background and observations are initialized to time t_{0}, and that the ensembles of the state variables are acquired by adding noise directly (Eq. 2). Then, independent model runs are invoked. For each model run, each time a new observation becomes available, and the analysis and regeneration of the state variables are conducted at time t–1, i.e., before the prediction of the state variables at time t. EnKF involves forecasting and measurement updates, and comprises five steps, as given below.

(1)
Initialization of the ensemble
The N ensembles to be generated are first defined. The state variable x is calculated at time t_{0} as follows:
$$ {x}_{t_0,i}=\overline{x_{t_0,i}}+{p}_i $$
(2)
$$ {p}_i\sim N\left(0,\sigma \right) $$
(3)
where x_{t0, i} is the initialized state vector at time t_{0}; \( \overline{x_{t0,i}} \) is the expectation in background; p_{i} represents the noise, and is distributed as Gaussian values with a mean of 0 and a variance of σ.

(2)
Forecasting
The state variables are predicted at time t using input data (time t – 1) and the model operator (BiomeBGC MuSo model):
$$ {x}_{i,t}^f={x}_{i,t1}^a+{B}_t{\mu}_i $$
(4)
where \( {x}_{i,t}^f \) is the forecasted state vector at time t, with superscript f referring to the forecasted value; F_{t} denotes the model operator; \( {x}_{i,t1}^a \) is the analyzed state value at time t – 1, with superscript a representing the analyzed value; B_{t} is the control matrix, which applies the effect of each control input parameter in vector μ_{i} on the state vector; and μ_{i} represents the model error, which follows a Gaussian distribution.
Uncertainties of noise in EnKF are reflected by the covariance matrix, with consideration of the error propagation at any time (Moradkhani et al. 2005). The covariance matrix is calculated during the entire forecasting process according to its properties as
$$ {P}_t^f={F}_t{P}_{t1}^a{F}_t^T+{Q}_t\kern20.5em $$
(5)
where \( {P}_t^f \) is the covariance matrix at time t, and Q_{t} is the covariance.

(3)
Calculation of the Kalman gain matrix
The core of data assimilation lies in the Kalman filter system, and it is assumed that observations are related to the true state. Therefore, the following expression applies for adding observations to the model at time t:
$$ {Z}_t={H}_t{x}_{i,t}^f+{v}_t $$
(6)
where Z_{t} is the observation vector at time t, and H_{t} is the operator that maps the model variable space to the observation space. v_{t} is a Gaussian random error vector with mean zero and observation error covariance R.
The Kalman gain matrix defined as
$$ {K}_t={P}_t^f{H}_t^T{\left(H{P}_t^f{H}_t^T+{R}_t\right)}^{1} $$
(7)
The EnKF forecast and analysis error covariance are acquired directly from the ensemble of model simulation as
$$ {P}_t^f=E\left[\left({x}_{i,t}^f{\overline{x}}_t^f\right){\left({x}_{i,t}^f{\overline{x}}_t^f\right)}^T\right]=\frac{1}{N1}{\sum}_{i=1}^N\left({x}_{i,t}^f{\overline{x}}_t^f\right){\left({x}_{i,t}^f{\overline{x}}_t^f\right)}^T $$
(8)
$$ {H}_t{P}_t^f{H}_t^T=\frac{1}{N1}{\sum}_{i=1}^N\left[{H}_t\left({x}_{i,t}^f\right){H}_t\left({\overline{x}}_t^f\right)\right]{\left[{H}_t\left({x}_{i,t}^f\right){H}_t\left({\overline{x}}_t^f\right)\right]}^T $$
(9)
The variance is based on the uncertainty of the data. Kalman gain at time t (K_{t}) is expressed in Eq. 9 and R_{t} is the covariance of Z_{t}.

(4)
Analysis and update
Under the above assumptions, the estimated state and error covariance using the Kalman gain are updated as
$$ {x}_{i,t}^a={x}_{i,t}^f+{K}_t\left({Z}_t{H}_t{x}_{i,t}^f\right) $$
(10)
$$ {P}_t^a=\left(1{K}_t{H}_t\right){P}_t^f $$
(11)

(5)
Repeat of steps (2), (3) and (4)
Iterations are established when running the algorithm from steps (1) to (5).
Data assimilation scheme
In this study, the assimilations of soil temperature and moisture were implemented using Eq. 10, with H equal to (1 1 1 1)^{T}. Once the daily soil temperature and moisture data were available, the model run was interrupted, EnKF updated the BiomeBGC MuSo state variables, and the simulation was reinitialized with the updated states and rerun until the next update was available. All the simulations were conducted from 2003 to 2007. An uncertainty of 10% for model parameters was considered and perturbed based on the Gaussian distribution (White et al. 2000). Sequential assimilation of observed data can be used to correct some uncertainty involved in model parameters (Das et al. 2008). The ensemble members were generated by randomly sampling model parameter combinations from the perturbed arrays (Ines et al. 2013). Two hundred ensemble members were selected to optimize the EnKF framework’s performance in terms of accuracy and computational time. Errors of the soil observations were obtained from the literature (Wang and Pei 2002).
We assimilated daily soil temperature and moisture to increase the numbers of observations, and we update the modeled soil respiration and transpiration. In BiomeBGC MuSo, soil temperature (T_{soil}) is a key parameter for calculating root respiration. Thus,
$$ \mathrm{MR}=\sum \limits_1^{n_r}\left({N}_{\mathrm{root}}\bullet {M}_{\mathrm{layer}}\bullet \mathrm{mrpern}\bullet {Q}_{10}^{\frac{T_{\mathrm{soil}\left(\mathrm{layer}\right)}20}{10}}\right) $$
(12)
where n_{r} is the number of soil layers, N_{root} is the total N content of the soil, M_{layer} is the proportion of the total root mass in the given layer, mrpern is an adjustable ecophysiological parameter, Q_{10} is the fractional change in respiration with a temperature change of 10 °C, and T_{soil(layer)} is the soil temperature of the given layer. The input of daily soil temperature updated the root respiration using the updated Eq. 10, and the updated variable was used to calculate ER for the next step.
Soil moisture was calculated using the volumetric water content (VWC), soil layer thickness, and water density in BiomeBGC MuSo. Assimilation of the daily SWC in the spinup is converted into the VWC array, which in turn provides reliable SWC during the model simulation phase through the restart file.
Once the daily observations were assimilated into the model, the initialization processes were implemented, and the soil variables were corrected on a daily basis throughout model runtime. This study compared normal simulations using calibrated BiomeBGC and BiomeBGC MuSo and simulations that assimilated soil temperature and moisture. All simulations were conducted for the period 2003–2007.
Evaluation and analysis of modeled estimates
To evaluate the simulated carbon and water fluxes, we used the results derived from EC measurements as ground truth observations, and we calculated R^{2}, Eq. 13; RMSE (Eq. 14); and relative error (RE), Eq. 15 to evaluate the accuracy of each model simulation. Additionally, a significance test (pvalue) was conducted to disprove the concept of “chance” and to reject a null hypothesis by adhering to the observed patterns.
$$ {R}^2=1\frac{\sum_{i=1}^t{\left({X}_{\mathrm{obs}}{X}_{\mathrm{mod}}\right)}^2}{\sum_{i=1}^t{\left({X}_{\mathrm{obs}}\overline{X_{\mathrm{mod}}}\right)}^2} $$
(13)
$$ \mathrm{RMSE}=\sqrt{\frac{\sum_{i=1}^t{\left{X}_{\mathrm{obs}}{X}_{\mathrm{mod}}\right}_i^2}{t}}\kern5.25em $$
(14)
$$ \mathrm{RE}=\raisebox{1ex}{$\left{X}_{\mathrm{mod}}{X}_{\mathrm{obs}}\right$}\!\left/ \!\raisebox{1ex}{${X}_{\mathrm{obs}}$}\right.\kern11.25em $$
(15)
In these equations, X_{obs} is the observation made at the forest flux site; X_{mod} is the simulated carbon or water flux, and i is the day of the year. t refers to the total number of days or day windows within one year.
We also analyzed the data assimilation performance of by comparing the difference (ΔRMSE) between RMSE_{DA} and RMSE_{MuSo}. A moving window of 15 days was used here. A positive ΔRMSE indicates that the accuracy of the model simulation was improved by our proposed data assimilation stratagem and vice versa. We examined the relationships of ΔRMSE with varying climatic forcings including Temp, Precip, and PAR and three biophysical factors such as soil temperature, soil moisture, and LAI. Therefore, this analysis addressed the situations showing the most significant improvements after assimilating soil temperature and moisture, thereby providing insights to the application of the proposed method to other forest ecosystems.