Skip to main content

Performance of statistical and machine learning-based methods for predicting biogeographical patterns of fungal productivity in forest ecosystems

Abstract

Background

The prediction of biogeographical patterns from a large number of driving factors with complex interactions, correlations and non-linear dependences require advanced analytical methods and modeling tools. This study compares different statistical and machine learning-based models for predicting fungal productivity biogeographical patterns as a case study for the thorough assessment of the performance of alternative modeling approaches to provide accurate and ecologically-consistent predictions.

Methods

We evaluated and compared the performance of two statistical modeling techniques, namely, generalized linear mixed models and geographically weighted regression, and four techniques based on different machine learning algorithms, namely, random forest, extreme gradient boosting, support vector machine and artificial neural network to predict fungal productivity. Model evaluation was conducted using a systematic methodology combining random, spatial and environmental blocking together with the assessment of the ecological consistency of spatially-explicit model predictions according to scientific knowledge.

Results

Fungal productivity predictions were sensitive to the modeling approach and the number of predictors used. Moreover, the importance assigned to different predictors varied between machine learning modeling approaches. Decision tree-based models increased prediction accuracy by more than 10% compared to other machine learning approaches, and by more than 20% compared to statistical models, and resulted in higher ecological consistence of the predicted biogeographical patterns of fungal productivity.

Conclusions

Decision tree-based models were the best approach for prediction both in sampling-like environments as well as in extrapolation beyond the spatial and climatic range of the modeling data. In this study, we show that proper variable selection is crucial to create robust models for extrapolation in biophysically differentiated areas. This allows for reducing the dimensions of the ecosystem space described by the predictors of the models, resulting in higher similarity between the modeling data and the environmental conditions over the whole study area. When dealing with spatial-temporal data in the analysis of biogeographical patterns, environmental blocking is postulated as a highly informative technique to be used in cross-validation to assess the prediction error over larger scales.

Background

Understanding the biogeographical patterns of organisms in natural ecosystems and predicting their distribution is a fundamental challenge in environmental sciences (Ehrlén and Morris 2015). This entails a deep understanding of their distribution across space and time underpinning ecological mechanisms, which becomes increasingly complex with an increasing amount of factors driving these patterns and the possible interactions and nonlinear dependencies between them (Dixon et al. 1999; Ye et al. 2015). Such complex interrelationships require advanced data analytic methods and modeling tools to yield realistic predictions of natural ecosystem attributes and processes.

Statistical methods traditionally used for this purpose aim at accounting for several elements that govern these natural mechanisms, trying to reach a parsimonious and robust understanding of ecological patterns (Wood and Thomas 1999). However, since conventional parametric approaches may over-simplify nonlinear relationships between variables and over- or under-estimate the influence of some drivers, conventional parametric approaches may result in poor predictions and/or descriptions of reality (Ye et al. 2015), especially for the analyses of large databases. To overcome potential limitations of classic statistical approaches in big data analysis, the increased computing power has led to recent considerable growth in the use of analytical methods based on artificial intelligence such as machine learning (Christin et al. 2019).

Machine learning algorithms are increasingly being used in species distribution and ecological niche modeling (Prasad et al. 2006; Cutler et al. 2007; Hannemann et al. 2015; Liang et al. 2016; Prasad 2018; Gobeyn et al. 2019), forest resources (Stojanova et al. 2010; Görgens et al. 2015) and climate change studies (Thuille 2003; Bastin et al. 2019), among others. To determine to what extent these “new” methodologies can contribute to improving our understanding and prediction capacity within the field of environmental sciences, comparative studies are required between those models that have been used historically and those fed by artificial intelligence algorithms (Özçelik et al. 2013; Diamantopoulou et al. 2015; Hill et al. 2017; Bonete et al. 2020). Yet, many machine learning algorithms have been developed in recent years, and each of them may be more or less appropriate depending on the specific tasks and research objectives (Thessen 2016). This highlights the need for systematic studies allowing for discerning the most suitable methodology according to a given research objective and data. Although several studies have analysed the performance of different analytical approaches (Hill et al. 2017; Bonete et al. 2020; Mayfield et al. 2020), existing ecological research addressing systematic assessments and comparisons of alternative modeling and predictive methods is scarce, making it difficult to provide clear methodological recommendations about the suitability of different approaches. Besides, in the field of environmental sciences, often, extrapolations in biophysically differentiated areas are required, which makes it necessary to take even more into account the data spatial dependencies. Due to data spatial autocorrelation, random cross-validations leads to over-optimistic error estimates (Bahn and McGill, 2012; Micheletti et al., 2013; Juel et al. 2015; Gasch et al. 2015; Roberts et al. 2017; Meyer et al. 2018; Meyer et al. 2019a), which makes it necessary to use proper, complementary validation methods such as spatial cross-validation (Le Rest et al. 2014; Pohjankukka et al. 2017; Roberts et al. 2017; Meyer et al. 2018; Valavi et al. 2018). Moreover, spatial dependencies in the data can lead to a misinterpretation of some predictors outside the sampling range (Meyer et al. 2018, 2019a).

Biogeographical patterns of fungal dynamics over large scales are a highly relevant question in ecology given the key role of fungi in forest ecosystems (Stokland et al. 2012; Mohan et al. 2014), especially in fungi-tree symbiosis. However, due to their great diversity and differential ecological requirements (Glassman et al. 2017), as well as the difficulty of monitoring their dynamics and the large array of potential drivers (Büntgen et al. 2013), little is known of fungal dynamics over large scales. The prediction of biogeographical patterns of fungal dynamics requires large fungal datasets with a correct taxonomic identification of the specimens and a consistent sampling methodology across space and time to avoid sample bias (Hao et al. 2020).

In particular, the spatially-explicit prediction of fungal productivity, i.e. mushroom fruiting patterns, is a key feature of fungal dynamics as it is tightly related to the supply of multiple provisioning, regulating and cultural ecosystem services (Boa 2004). However, the high correlation between mushroom production and meteorological conditions among other drivers (Taye et al. 2016; Alday et al. 2017; Collado et al. 2019) makes the prediction of mushroom production challenging, especially in Mediterranean ecosystems where there is a high inter-annual variability of climatic conditions. The long period of potential fruiting of different mushroom species, as a result of their adaptation to the recurrent climatic patterns of a dry summer followed by a wet autumn (Barnard et al., 2014), makes mushroom yields dependent on a large number of variables. Precipitation and temperatures on a weekly scale can be the factors that lengthen, shorten or shift the fruiting period (Gange et al. 2007; Kauserud et al. 2008; Kauserud et al., 2009; Büntgen et al. 2012), and also those that modulate mushroom production to a higher degree (Karavani et al. 2018). The large number of variables involved and their presumed interactions may often yield a misconception that fungal productivity is highly stochastic or very difficult to predict. Previous research to estimate mushroom productivity over large scales has been mainly based on mixed-effects modeling (de-Miguel et al. 2014; Sánchez-González et al. 2019). Despite being a valid approach, it may have certain limitations that are worth assessing in comparison with alternative methods that remain unexplored.

This study compares different statistical and machine learning models in estimating mushroom productivity at the landscape level, together with a systematic methodology to determine the best approach to predict mushroom production in forest ecosystems. Using climatic and biophysical data together with in situ fungal records collected weekly over more than 20 years on a hundred permanent plots, we developed spatially explicit, high-resolution continuous maps of mushroom productivity that were also used in the selection of the most suitable methods for predicting this ephemeral and important forest resource. Specifically, we compared two statistical models, namely, generalized linear mixed models (GLMM) and geographically weighted regression models (GWR), as well as four alternative state-of-the-art machine learning algorithms such as random forest (RF), extreme gradient boosting (XGB), support vector machine (SVM) and artificial neural network (ANN).

Methods

Study area and sampling plots

The study area was Catalonia region, northeastern Spain, in the western Mediterranean basin. The forest ecosystem types considered in this study were the main Mediterranean pine forest ecosystems that represent the majority of the forest area of the study region, namely, pure stands of Pinus halepensis, P. sylvestris, P. pinaster, P. nigra and P. uncinata and mixed stands of P. halepensis and P. nigra, and of P. sylvestris and P. nigra. We used a dataset that contains information from 98 permanent monitoring plots for fungal dynamics sampled on a weekly basis during the main mushroom fruiting period, between August and the end of December and from 1997 to 2019. The plots were distributed randomly and proportionally to the relative surface of the different pine forest ecosystems (Bonet et al. 2010) (Fig. S1). Data were aggregated to an annual basis to create predictive models to estimate annual mushroom productivity. More information about the sampling methods and data can be found in Bonet et al. (2004), Martínez de Aragón et al. (2007) and Table S3.

Climate and biophysical data

Meteorological data for each sampling plot was obtained from the interpolation and altitudinal correction of daily weather of 201 meteorological stations from the Catalan Meteorological Service and the Spanish Meteorological Agency. Interpolation was conducted with “meteoland” R package (v0.8.1; De Cáceres et al. 2018) that uses a modification of the DAYMET methodology (Thornton et al. 1997; Thornton and Running 1999). Likewise, to determine the typical climatic conditions across the whole study region, we used the mean of the interpolated daily weather variables for the period between 1991 and 2016 with 1-km resolution. We computed the accumulated monthly rainfall from August to October and the mean, maximum and minimum monthly temperatures for the same period, coinciding with the main mushroom fruiting period.

The total area covered by the different pine forest ecosystems was retrieved from the CORINE habitats map (Commission of the European Community 1991). The biophysical variables such as elevation, slope, aspect and stand basal area were obtained at 20-m resolution from the first cover of the LIDARCAT Project (http://territori.gencat.cat/es/detalls/Article/Mapes_variables_biofisiques_arbrat) based on different LiDAR flights between 2008 and 2011 with a point density of 0.5 points∙m− 2.

Analytical methods

We used and compared six different analytical methods to predict annual mushroom productivity. Two analytical approaches were based on statistical methods, namely, generalized linear mixed-effects models (GLMM) and geographically weighted regression (GWR), whereas the other four analytical methods were based on alternative machine learning approaches, namely, random forest (RF), extreme gradient boosting (XGB), support vector machine (SVM) and artificial neural network (ANN).

Statistical models fitting

We used a two-stage modeling approach to take into account the high frequency of “zero” production values in many sample plots over time (Hamilton and Brickell 1983; de-Miguel et al. 2014; Karavani et al. 2018; Collado et al. 2018). The high occurrence of these values arise from the small size of the plots and the stochastic nature of mushroom emergence (de-Miguel et al. 2014).

The first stage determines the probability of mushroom emergence, according to binomial data of presence/absence, using a logistic regression π(X) = E(Y|X) and a logit link function to represent the conditioned mean of Y given X (Eqs. 1 and 2).

$$ \pi \left({x}_k\right)=E\left(Y|{\mathrm{x}}_k\right)=\frac{1}{1+{e}^{-g\left({x}_k\right)}} $$
(1)
$$ g\left({x}_k\right)=\ln \left(\frac{\pi \left({x}_k\right)}{1-\pi \left({x}_k\right)}\right)={\beta}_0+{\beta}_k{x}_k $$
(2)

where π (xk) is the probability of mushroom occurrence, g (xk) the logit transformation of π (xk), Y is the dependent variable (mushroom presence/absence), xk is kth independent variable, β0 is the intercept parameter and βk is the regression coefficient for kth independent variable.

The second stage was based on the modeling of the production of non-zero production values at logarithmic scale using linear regression y = E (log(Y)|X) + ε. Logarithmic transformation allows to limit the production range in the interval [0, ∞), depending on the values of X (Eq. 3). The proportional bias of the logarithmic regression was corrected with the Snowdon’s bias correction factor (Snowdon 1991) based on the ratio of the arithmetic sample mean and the mean of the back-transformed predicted values from the regression (Eq. 4):

$$ \ln \left(\mathrm{prod}\right)={\beta}_0+{\beta}_k{x}_k+\varepsilon $$
(3)
$$ CF=Y/\hat{y} $$
(4)

where ln (prod) represents the non-zero production of mushrooms at logarithmic scale, β0 is the intercept parameter, βk the regression coefficient for kth independent variable, ε the random error of the deviation of the observations from the conditioned mean of ln(Y) and Y/ŷ is the ratio between the mean of observed and the mean of predicted values of the sapling units.

Finally, the total production of mushrooms was obtained from the product of the probability of appearance and the conditioned production of non-zero values (Eq. 5).

$$ \mathrm{yield}=\pi \left({x}_k\right)\ {e}^{\ln \left(\mathrm{yield}\right)}\ CF $$
(5)

where π (xk) is the probability of mushroom occurrence, ln (yield) represents the production of mushrooms at logarithmic scale and CF is bias correction factor.

Generalized linear mixed models

Due to mushrooms sampling methodology, where annually data was taken from a network of permanent plots, we used GLMM (de-Miguel et al. 2014; Karavani et al. 2018; Collado et al. 2018). This method can consider the spatial and temporal autocorrelation among observations (Pinheiro and Bates 2000) adding random effects to segment the data into different groups according to year and plot. In the proposed mixed-effects models only random effects on model interception were considered. All the models were fitted using the “glmer” function from the “lme4” R package (v1.1–21; Bates et al. 2015).

Geographically weighted regression

GWR is a non-stationary modeling technique that describes the spatially varying relationships between the dependent variable and the explanatory variables (Wheeler and Páez 2009). Coefficients of a GWR-based model are given by the spatial location of data and can be estimated for any new location. This means that given a grid, the estimated coefficients for each point in space vary continuously as a function of the spatial heterogeneity of the relationships.

Coefficients for each regression point were calibrated using the data around itself. Due to the annual sampling methodology and the geographical distribution of plots, some plots were grouped denser in some areas and less dense in others. Consequently, we used an adaptive window according to the spatial density of our plots (Georganos et al. 2017). Occurrence and conditional production models, as well as the optimal data value for adjusting the adaptive window, were obtained from the “ggwr” and “gwr.set” functions, respectively, of the R package “spgwr” (v0.6–32; Bivand and Yu, 2017).

Hyperparameters optimization of machine learning models

To optimize the performance of machine learning models, it was necessary to tune their respective hyperparameters (Hutter et al. 2011; Bergstra and Bengio 2012; Duarte and Wainer 2017). This needs to be conducted prior to the training of the final predictive models and also needs to consider the spatial and temporal dependencies of both the modeling and prediction datasets (Schratz et al. 2019). Since hyperparameters tuning based on a resampling method such as k-fold cross-validation may lead to an incorrect tuning for models that aim to predict in environmentally different areas (Roberts et al. 2017), hyperparameters tuning was conducted based on several alternative resampling techniques, namely, k-fold cross-validation, spatial cross-validation, and environmental cross-validation (Roberts et al. 2017).

We used an optimization algorithm based on a search grid (Bergstra and Bengio 2012) implemented in the R package “mlr3tuning” (v0.5.0; Becker et al. 2020) that selects the best hyperparameters configuration according to a given metric. The search space was defined from the Cartesian product of the discretized values of a set of n hyperparameters to be tuned in each model (Table 1). First, a set of hyperparameters configurations from the search grid was randomly selected and evaluated according to each resampling strategy. A search grid of resolution 25 was defined and tested with a total of 250 different hyperparameters configurations. Otherwise, we used 10 folds in each resampling strategy using the R package “mlr3” (v0.9.0; Lang et al. 2019) for the core computational operations and the extensions “mlr3spatiotempcv” (v0.1.1; Schratz and Becker 2021) and “mlr3learners” (v0.4.3; Lang et al. 2020a) for the resampling and the use of the different models, respectively. In addition, the R packages “paradox” (v0.6.0; Lang et al. 2020b) and “mlr3keras” (v0.1.3; Pfisterer et al. 2021) were also used in hyperparameters tuning.

Table 1 Hyperparameters ranges and types for each machine learning model

Model and variable selection and evaluation

Statistical model evaluation and variable selection was based on the current knowledge of forests and mushroom ecology, the statistical significance of model parameters (p < 0.05 or t > |1.96|), the variance inflation factor (VIF) to quantify the severity of multicollinearity and the parsimony principle. To check the sensitivity/specificity of the binomial classification models we used Receiver Operator Characteristic (ROC) curves, using the R package “ROCR” (v1.0–7; Sing et al. 2005).

To assess whether GWR improved GLMM due to the non-stationary nature of data and to avoid introducing an improvement that was not attributed to the type of modeling, the same explanatory variables as in GLMM were used. To test the non-stationarity of the independent variables of GWR models, the local parameters were compared with global GLMM coefficients. The probability of incorporating non-stationary variables increases if the estimated coefficient of the variable in GLMM (± standard error) is outside the 1st and 3rd quartile of the GWR model coefficient (Propastin 2009).

For each of the four machine learning algorithms, two models were adjusted. The first ones were trained using a total of 15 biophysical variables, while the second ones were trained using a subset of them (Table S1). This subset was determined from the same five variables used in the statistical models, including climate predictors only. This allowed us to assess separately the prediction accuracy due to the analytical method, and the prediction accuracy due to differences between predictors or the number of explanatory variables. The final machine learning models were trained using 100% of the sampled data (henceforth referred to as “modeling data”) and the optimal hyperparameters settings. We used the optimized hyperparameters from an environmental blocking (Roberts et al. 2017) to train the final models. The relationship between predictors and mushroom productivity was assessed based on partial dependence plots (PDPs), a low-dimensional graphical rendering between variable pairs, in order to determine whether this relationship lacked ecological sense. The importance of the predictors of the models was determined from a sensitivity analysis using the R package “rminer” (v1.4.2; Cortez 2016).

Evaluation of the predictive performance and mapping

Since the main purpose of this study was to develop models to accurately predict mushroom productivity, this entails the evaluation of the predictive performance of the resulting models also outside of the range of the training (or fitting) region. With the aim of determining the similarities between modeling data and the environmental conditions of the whole study area, a principal component analysis (PCA) based on both datasets altogether was used. Using the location of the modeling data within the space described by the first two components of the PCA, a density map was created based on a two-dimensional kernel density estimation and implemented in the “kde2d” function of the R package “MASS” (Venables et al. 2002). By overlapping this density map and the location of each pixel of the study area within the space defined by the two principal components of the PCA, a similarity value (ranging from 0 to 1) was obtained for each pixel of the study area based on the modeling data density. This similarity value was classified in three categories: low [0, 0.1], medium (0.1, 0.3] and high (0.3, 1] similarity. This classification mainly aimed at detecting the areas with very low similarity, i.e. with very different climatic conditions compared to the modeling data. This whole procedure was performed separately using the 5 climatic predictors of the 5-variable models and the 12 climatic predictors of the 15-variable models, respectively.

To evaluate and compare the predictive accuracy of the models, different resampling strategies were used (see the section of Hyperparameters optimization of machine learning models). The MSE and bias2 of the models were estimated by averaging, respectively, over the MSE and bias2 obtained from each of the 10 folds for each cross-validation strategy.

To generate the landscape-level mushroom productivity maps we used the predictions of the final trained models. These maps were constructed with a resolution of 1 km in accordance with the resolution of the climatic data. The resulting maps were evaluated on the basis of the scientific and expert knowledge about biogeographical fungal productivity patterns in order to assess whether they followed ecologically logical patterns (related to climatic conditions). Thus, we would expect smoothed estimates across the territory driven by the variations of the most important predictors of each model.

Results

Relationships between dependent and explanatory variables

Statistical models showed a statistically significant and positive relationship of mushroom productivity with rainfall in August, September and October (both in conditioned production and occurrence models). On the other hand, conditioned production and occurrence models also showed a statistically significant and negative relationship with the maximum temperature of August and October, respectively (Tables S1, S2, S3). Yet, the coefficients of GWR models varied according to geographical location (Table S3 and Fig. S2), describing certain non-stationarity in both precipitation and temperature.

Within GLMM models, PDPs showed an almost linear relationship between the amount of precipitation between August and October and mushroom productivity in the model fit data range. In contrast, GWR showed an accelerated growth in productivity by increasing rainfall, which was accentuated in those locations with a higher precipitation regression coefficient. Besides, and similarly in GLMM and GWR, the maximum temperature in August showed a decelerated decrease in productivity by increasing temperature, while the maximum temperature of October, even though it showed a negative relation, resulted in little relevance in mushroom productivity for the range of values of the fitting data (Fig. 1).

Fig. 1
figure 1

Relationship between annual mushroom productivity and August, September and October precipitation and maximum temperatures in September and October (these variables are the variables used in the statistical models and the five variables machine learning models). ran (random forest), xgb (extreme gradient boosting), svm (support vector machine), ann (artificial neural network), glmm (generalized linear mixed models) and gwr (geographically weighted regression). 05 and 15 refer to the models trained with five and fifteen variables, respectively

Different machine learning models resulted in rather similar relationships between variables although, due to the particularities of each algorithm, the patterns changed slightly between approaches. In contrast to the relationships in GLMM and GWR models, some of the machine learning models did not show monotonically increasing or decreasing relationships between dependent and explanatory variables. This monotony was often broken at the extremes of the range of values of the predictor variables, where the amount of data to train the models was lower (Fig. 1). Moreover, machine learning methods also showed differences in the importance assigned to different predictors. Thus, XGB identified some variables as very important compared to other predictors. Specifically, in the models trained with 15 variables, XGB showed a greater importance of precipitation in August, September and October, minimum temperature in October and aspect. In addition, precipitation of August resulted in having further greater importance in the models trained with five variables. Conversely, the importance detected by RF and SVM to the whole array of predictors was more homogeneous. RF showed a greater importance to the same variables as XGB, while the most important variables in SVM were precipitation in September and October, average temperature in August and September, and minimum temperature in August (Fig. 2).

Fig. 2
figure 2

Standardized variable importance value used to train random forest (ran), extreme gradient boosting (xgb) and support vector machine (svm) models with five a and 15 b variables. Variable importance values represent the contribution of each variable in the prediction of annual mushroom productivity

GLMM and GWR fitted models and their coefficients are shown in the supplementary material in Table S1 to S3. Likewise, optimal machine learning hyperparameters can be found in the supplementary material in Tables S4 and S5.

Predictive accuracy of different methods

In general, ML models showed better predictive accuracy, in terms of MSE, compared to the statistical models. Within ML models, ANN models showed a higher error than the other algorithms. Decision tree-based models, namely, RF and XGB, showed no differences between the 15- and five-variable models when k-fold CV-based resampling was used. Using an environmental blocking, RF models, as well as SVM and ANN, showed lower accuracy when using five variables instead of 15. Contrary, the prediction error using XGB increased significantly when using 15 variables (instead of five) in the environmental CV, resulting in the lowest accuracy among all machine learning models and equaling the error of the statistical models. Decision tree-based models reduced significantly the bias between predicted and observed values, especially when conducting k-fold CV. On the other hand, the error estimated from a spatial CV with the SVM and ANN models trained with 15 variables was lower than in the five-variable models. Using a k-fold CV, the error of the SVM models was higher when using more predictors, whereas in the ANN models it was higher when using more parsimonious models. Although GWR improved GLMM predictive accuracy in the k-fold and spatial CV mainly due to a notable bias reduction, this was not the case in the environmental CV, where the error was similar for both statistical modeling approaches (Table 2).

Table 2 Mean squared error (MSE) and squared bias (bias2) of the different machine learning and statistical models depending on different resampling strategies, namely, k-fold, environmental, and spatial cross-validation. ran (random forest), xgb (extreme gradient boosting), svm (support vector machine), ann (artificial neural network), glmm (generalized linear mixed models) and gwr (geographically weighted regression). 05 and 15 refer to the models trained with five and fifteen variables, respectively

Decision tree-based models increased the predictive accuracy reducing MSE by up to 10% compared to XGB, almost by 40% compared to ANN and GLMM, and by 20% compared to GWR when using k-fold CV. Similar trends were also obtained using environmental and spatial CV.

Mapping and accuracy of predictions at the landscape level

The spatially explicit predictions from each model at the landscape level resulted in rather similar general patterns between modeling approaches (Fig. 3). Namely, they predicted higher productivity in the northern areas of the study region, characterized by higher altitudes, i.e., Pyrenees mountain range. Also, the different models reproduced similar patterns within these areas according to variations in local topography. In addition, RF, XGB and SVM models trained with 15 predictors yielded higher estimates of mushroom productivity in coastal areas compared to the same algorithms based on a subset of five predictors. Those coastal areas represented the least similar bioclimatic conditions compared to the modeling data when using 12 predictors (Fig. 4 and Fig. S3), therefore increasing the area of extrapolation beyond the range of the modeling data. Specifically, the similarity map based on 12 predictors, shows that the number of pixels with low and medium similarity increased by 58% (359 km2) and 50% (847 km2), respectively, compared to the similarity map based on five predictors. On the other hand, pixels with high similarity decreased by 28% (1206 km2).

Fig. 3
figure 3

Landscape-level prediction of total annual mushroom productivity, using ran (random forest), xgb (extreme gradient boosting), svm (support vector machine), ann (artificial neural network), glmm (generalized linear mixed models) and gwr (geographically weighted regression). 05 and 15 labels refer to the models trained with five and fifteen variables, respectively

Fig. 4
figure 4

Similarity between the climate conditions of the whole study area and the modeling data. The spatial similarity was based on the number of sampled data with an environment similar to the prediction environment. The per-pixel similarity was obtained by overlapping the pixel position and a density map of the sampled data in a two-dimensional space defined by the first two principal components of a principal component analysis (PCA). The PCA was performed with five and 12 variables according to the machine learning models’ climate predictors (Fig. S3)

RF, XGB and SVM trained with 15 variables also resulted in less smoothed predictions of mushroom yield across the territory compared to estimates based on the subset of five predictors. Furthermore, SVM produced illogical predictions below 0 kg·ha− 1·year− 1 in a few spatially localized areas when five variables were used, and scattered throughout the territory when using 15 predictors. In contrast, ANN resulted in very smoothed estimates across the territory, contrary to the maps obtained from all the other machine learning methods (Fig. 3).

In addition, mushroom productivity predictions based on RF, XGB, SVM and GWR ranged between 0, in the less productive areas, and approximately 300 and 400 kg·ha− 1·year− 1 (with some maximum peaks reaching 500 and 600 kg·ha− 1·year− 1). Slightly lower productivity was detected using GLMM and ANN for the most productive sites, i.e. not exceeding 200 kg·ha− 1·year− 1 in any point of the study area.

Discussion

To our knowledge, this is the first study addressing a systematic evaluation of the predictive performance of alternative statistical and machine learning models to predict fungal productivity, and one of the few systematic comparisons between these different predictive approaches within the field of ecological research. This was conducted using one of the largest datasets (if not the largest one) for fungal productivity monitoring, based on consistent sampling methodology and taxonomic identification of mushrooms over more than 20 years on nearly a hundred permanent sampling plots, randomly distributed throughout the study region, which contributes to overcoming most of the practical problems related to the existence of available data for modeling fungal resources (Hao et al. 2020).

When dealing with complex ecological interactions between multiple potential explanatory variables, our results show that statistical models, especially GLMM, clearly seem to have lower predictive performance compared to artificial intelligence-based approaches, in line with previous research (e.g. Smoliński and Radtke 2016 and Schratz et al. 2019). They were less accurate and produced large over- or underestimation of mushroom productivity (Table 2), making them unreliable for such purposes compared to other alternatives. On the other hand, statistical models can be good candidates for detecting the most appropriate variables to be used in machine learning models and unravel environmental-ecological relationships between them (Shmueli 2010; Schratz et al. 2019), since the inherent statistical assumptions that shape these models allow the relationships between data in a set of probability distributions to be correctly approximated. Fitting GWR parameters using a subset of data according to their geographical location corrected for the strong underestimation of fungal productivity produced by GLMM models using k-fold CV. However, it was not possible to correct for the bias in mushroom productivity prediction in environmentally differentiated areas. By considering spatial parameters, we were able to find non-stationary patterns across the territory, denoting that climatic conditions do not affect equally at a landscape level.

As demonstrated here, choosing a subset of variables from statistically significant predictors from statistical models can help us to deal with some drawbacks. A problem with selecting a single subset of variables from a machine learning models is that, due to the algorithm itself, the significance is adjusted differently and could be inappropriate for some of them. For example, within decision tree algorithms, XGB determines the variable to be used in each node of the tree among the total of variables of a model, while RF does it within a subset of them, giving greater probability of being chosen to those less important variables (Hastie et al. 2001). On the other hand, the importance of a set of correlated variables can be distributed among the different predictors (giving lower importance to each one of them), but the total importance that this set represents in the predictive performance is remarkable (Toloşi and Lengauer 2011). This can cause that when discarding the less important variables, this set of predictors is omitted, causing a notorious drop in predictive performance. Moreover, in a group of correlated variables where there is only one true predictor (the one that implies real causality), machine learning algorithms could give similar values of importance to the whole set of variables (Archer and Kimes, 2008), actually hiding the true predictor. These problems may be aggravated when using a larger number of variables to train the models, where the probability of finding groups of correlated variables is higher. Consequently, each machine learning algorithm give a different importance to each variable. Therefore, to identify the variables that could best explain the processes that occur in natural ecosystems and/or use the variable importance to select a subset of predictors to train a more parsimonious model, the above considerations should be taken into account.

The fact that the prediction error obtained from RF, SVM, and ANN models was lower when using 15 predictors in environmental blocking, suggests that models using a larger number of predictors may be a better alternative for predicting mushroom productivity at the landscape level. However, the combination of climatic conditions represented by model predictors increases exponentially with increasing number of variables (Hughes, 1968). This makes it more likely that increasing the number of model predictors will increase the mismatch between the modeling data and the climatic conditions across the whole study area. Therefore, models with a larger number of predictors will probably result in greater extrapolation beyond the range of the modeling data, as shown in our study (Fig. 4 and Fig. S3). Thus, in the 15-variable models, extrapolation beyond the range of modeling data occurred over a larger extent of the study area compared to the models based on five predictors. Assuming that k-fold CV estimates model accuracy in areas where climatic conditions are similar to the modeling data, while environmental or spatial CV estimates model prediction error in climatically different areas (Roberts et al. 2017; Meyer et al. 2019a), the assessment of model accuracy for prediction across the study area can be improved based on the similarity in the climatic conditions between the modeling data and the whole study area. Thus, random blocking with five-predictor models informed more appropriately about the magnitude of the prediction error over a larger area compared to 15-predictor models, because areas with high similarity increased when using fewer predictors, i.e. from ~ 2800 to ~ 4000 km2. Conversely, the prediction error of the less parsimonious models will be given by an environmental blocking in a smaller area than in the models with less predictors. To assess which model is more suitable for prediction, one needs to consider the extent of the study area where the prediction error is quantified through random blocking and through environmental blocking, respectively, and not only whether the model error of more or less parsimonious models is higher or lower in each blocking strategy. In our study, when using models with 15 predictors, the entire coastal areas (east) and the Pyrenees mountain range (north), showed a low to moderate similarity of climatic conditions compared to the modeling data. In contrast, in the 5-predictor models, the coastal areas with low similarity decreased, while the area with high similarity of the Pyrenean mountain range increased considerably. Thus, it seems that parsimony may be a useful model selection criterion not only for statistical methods but also for machine learning algorithms (Coelho et al. 2018).

As noted, statistical models do not seem to be competitive compared to machine learning approaches due to poor predictive performance. Among the machine learning models, the ANN approach had the highest prediction error and also resulted in biogeographic patterns that did not seem to agree with the expected climatic variations throughout the study area. In turn, SVM yielded illogical negative values of mushroom productivity in some areas. Therefore, the best candidate methods are the decision trees-based algorithms, i.e. RF and XGB. Considering the similarities in the climatic conditions between the modeling data and the whole study area, we conclude that the best models will be RF and XGB models trained with five predictors. This study shows that, although machine learning algorithms allow to train models using a large number of variables, it may be wise to conduct a more thorough selection of model predictors prior to training the final models (Kuhn and Johnson 2013). This further contributes to improving the selection of the best modeling approach for prediction and also provides a methodology that, in the face of the current paucity of data to build process-based models (Hao et al. 2020), can be reasonably used in extrapolation. This is especially relevant in a context of global change, where climatic conditions are predicted to change over the years beyond the historical climatic ranges.

Conclusions

This study compares different statistical and machine learning models for predicting fungal productivity biogeographical patterns using a systematic methodology. Decision tree-based models, namely, RF and XGB, performed the best in the prediction of fungal productivity in both environmentally similar and differentiated areas. Therefore, we recommend the use of these algorithms for further research involving the prediction of fungal productivity, both under the current bioclimatic conditions and under climate change scenarios. When using these methods, careful selection of predictors allows for defining more interpretable and computationally less expensive models as well as for reducing the environmental space described by model predictors. Accordingly, the range of environmental conditions represented by the predictors in the modeling data can be more similar to the conditions over the whole study area, leading to reduced extrapolation. As a result, predictions can be more ecologically consistent compared to models with much higher number of predictors. In this regard, the degree of similarity in the range of environmental conditions between the modeling data and the whole study area for prediction is relevant when selecting the most appropriate blocking strategy for estimating model error. In more parsimonious models, where the range of the modeling data may be more representative of the environmental conditions over the whole study area compared to more complex models, the magnitude of the prediction error at the landscape level may be better retrieved through random blocking. In contrast, increasing model complexity may require environmental blocking for a more proper characterization of the prediction error at the landscape level. Model and variable selection should therefore also consider the extent of the area within the study region where the magnitude of the prediction error can be quantified more appropriately from either environmental or random blocking. Maps depicting the similarity between the environmental conditions accounted by the modeling data compared to the environmental conditions of the whole study area, can be useful to identify environmentally different or similar areas to further assist model selection and proper characterization of the prediction error based on alternative resampling techniques. In the end, given the multiple environmental factors driving fungal productivity, we highlight the importance of applying such methods using high-resolution environmental information to properly estimate its biogeographic patterns over large scales.

Availability of data and materials

The datasets generated and/or analyzed during the current study are not publicly available due to legal constraints because they are owned by different institutions, but are available from the authors on reasonable request.

Abbreviations

ANN:

Artificial Neural Network

bias2:

Squared bias

GLMM:

Generalized Linear Mixed Models

GWR:

Geographically Weighted Regression

MAE:

Mean Absolute Error

PCA:

Principal Component Analysis

PDP:

Partial Dependence Plots

RF:

Random Forest

MSE:

Mean Squared Error

SVM:

Support Vector Machine

XGB:

Extreme Gradient Boosting

References

Download references

Acknowledgements

Not applicable.

Funding

This work was supported by the Secretariat for Universities and of the Ministry of Business and Knowledge of the Government of Catalonia and the European Social Fund. This work was partially supported by the Spanish Ministry of Science, Innovation and Universities (Grant No. RTI2018–099315-A-I00). J.A.B. benefitted from a Serra-Húnter Fellowship provided by the Government of Catalonia.

Author information

Authors and Affiliations

Authors

Contributions

JMdA, JAB and SdM contributed in the installation of the sampling plots and data collection. AM, SdM and JL analysed the data. AM wrote the manuscript. All authors participated in the review and editing of the manuscript. SdM supervised the work during the whole process. The authors read and approved the final manuscript.

Corresponding author

Correspondence to Albert Morera.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Supplementary Information

Additional file 1: Table S1.

Fixed GLMM coefficients. β1, β2, β3 significance were calculated from t-value, while β3, β4, β5 from p-value. Table S2. Random GLMM coefficients. Table S3. GWR models coefficients. Table S3. Summary of mushroom, climate and physical data of the 98 sampled plots. All these variables were used to train the 15-variable machine learning models, while the 5-variable machine learning models and the statistical models only used those marked with *. The response variable is shown with **. Table S4. Tuned optimal hyperparameters using a k-fold CV. Table S5. Tuned optimal hyperparameters using an environmental CV. Figure S1. Study area, distribution of mushroom productivity monitoring plots (red points) and pine forest ecosystems represented by the sample plots (green area) Coordinates system: WGS 84 / UTM zone 31 N. Figure S2. GWR coefficient estimates according to geographical location. Coefficient of precipitation amount from August to October (A) and maximum temperature in August (B) in conditioned production model. Coefficient of precipitation amount from August to October (A) and maximum temperature in October (D) in occurrence model (C). Figure S3. Similarity in climatic conditions between the modeling data and the whole study area using five or 12 variables. Two-dimensional representation given by the two principal components, namely, PC1 and PC2, of a principal component analysis (PCA) of the modeling data (with a density map) and the environmental conditions over the whole study area (gray dots).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Morera, A., Martínez de Aragón, J., Bonet, J.A. et al. Performance of statistical and machine learning-based methods for predicting biogeographical patterns of fungal productivity in forest ecosystems. For. Ecosyst. 8, 21 (2021). https://doi.org/10.1186/s40663-021-00297-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40663-021-00297-w

Keywords