
Leave-one-out cross-validation for data cubes
Source:R/cross_validate_cube.R
cross_validate_cube.Rd
This function performs leave-one-out (LOO) or k-fold (experimental) cross-validation (CV) on a biodiversity data cube to assess the performance of a specified indicator function. It partitions the data by a specified variable, calculates the specified indicator on training data, and compares it with the true values to evaluate the influence of one or more categories on the final result.
Arguments
- data_cube
A data cube object (class 'processed_cube' or 'sim_cube', see
b3gbi::process_cube()
) or a dataframe (from$data
slot of 'processed_cube' or 'sim_cube'). To limit runtime, we recommend using a dataframe with custom function asfun
.- fun
A function which, when applied to
data_cube
returns the statistic(s) of interest. This function must return a dataframe with a columndiversity_val
containing the statistic of interest.- ...
Additional arguments passed on to
fun
.- grouping_var
A string specifying the grouping variable(s) for
fun
. The output offun(data_cube)
returns a row per group.- out_var
A string specifying the column by which the data should be left out iteratively. Default is
"taxonKey"
which can be used for leave-one-species-out CV.- crossv_method
Method of data partitioning. If
crossv_method = "loo"
(default),S = number of unique values in out_var
training partitions are created containingS - 1
rows each. Ifcrossv_method = "kfold"
, the aggregated data is split the data intok
exclusive partitions containingS / k
rows each. K-fold CV is experimental and results should be interpreted with caution.- k
Number of folds (an integer). Used only if
crossv_method = "kfold"
. Default 5.- max_out_cats
An integer specifying the maximum number of unique categories in
out_var
to leave out iteratively. Default is1000
. This can be increased if needed, but keep in mind that a high number of categories inout_var
may significantly increase runtime.- progress
Logical. Whether to show a progress bar. Set to
TRUE
to display a progress bar,FALSE
(default) to suppress it.
Value
A dataframe containing the cross-validation results with the following columns:
Cross-Validation id (
id_cv
)The grouping variable
grouping_var
(e.g., year)The category left out during each cross-validation iteration (specified
out_var
with suffix '_out' in lower case)The computed statistic values for both training (
rep_cv
) and true datasets (est_original
)Error metrics: error (
error
), squared error (sq_error
), absolute difference (abs_error
), relative difference (rel_error
), and percent difference (perc_error
)Error metrics summarised by
grouping_var
: mean relative difference (mre
), mean squared error (mse
) and root mean squared error (rmse
)
See Details section on how these error metrics are calculated.
Details
This function assesses the influence of each category in out_var
on the
indicator value by iteratively leaving out one category at a time, similar to
leave-one-out cross-validation. K-fold CV works in a similar fashion but is
experimental and will not be covered here.
Original Sample Data: \(\mathbf{X} = \{X_{11}, X_{12}, X_{13}, \ldots, X_{sn}\}\)
The initial set of observed data points, where there are \(s\) different categories in
out_var
and \(n\) total samples across all categories (= the sample size). \(n\) corresponds to the number of cells in a data cube or the number of rows in tabular format.
Statistic of Interest: \(\theta\)
The parameter or statistic being estimated, such as the mean \(\bar{X}\), variance \(\sigma^2\), or a biodiversity indicator. Let \(\hat{\theta}\) denote the estimated value of \(\theta\) calculated from the complete dataset \(\mathbf{X}\).
Cross-Validation (CV) Sample: \(\mathbf{X}_{-s_j}\)
The full dataset \(\mathbf{X}\) excluding all samples belonging to category \(j\). This subset is used to investigate the influence of category \(j\) on the estimated statistic \(\hat{\theta}\).
CV Estimate for Category \(\mathbf{j}\): \(\hat{\theta}_{-s_j}\)
The value of the statistic of interest calculated from \(\mathbf{X}_{-s_j}\), which excludes category \(j\). For example, if \(\theta\) is the sample mean, \(\hat{\theta}_{-s_j} = \bar{X}_{-s_j}\).
Error Measures:
The Error is the difference between the statistic estimated without category \(j\) (\(\hat{\theta}_{-s_j}\)) and the statistic calculated on the complete dataset (\(\hat{\theta}\)).
$$\text{Error}_{s_j} = \hat{\theta}_{-s_j} - \hat{\theta}$$
The Relative Error is the absolute error, normalised by the true estimate \(\hat{\theta}\) and a small error term \(\epsilon = 10^{-8}\) to avoid division by zero.
$$\text{Rel. Error}_{s_j} = \frac{|\hat{\theta}_{-s_j} - \hat{\theta}|}{\hat{\theta} +\epsilon}$$
The Percent Error is the relative error expressed as a percentage.
$$\text{Perc. Error}_{s_j} = \text{Rel. Error}_{s_j} \times 100 \%$$
Summary Measures:
The Mean Relative Error (MRE) is the average of the relative errors over all categories.
$$\text{MRE} = \frac{1}{s} \sum_{j=1}^s \text{Rel. Error}_{s_j}$$
The Mean Squared Error (MSE) is the average of the squared errors.
$$\text{MSE} = \frac{1}{s} \sum_{j=1}^s (\text{Error}_{s_j})^2$$
The Root Mean Squared Error (RMSE) is the square root of the MSE.
$$\text{RMSE} = \sqrt{\text{MSE}}$$
Examples
# Get example data
# install.packages("remotes")
# remotes::install_github("b-cubed-eu/b3gbi")
library(b3gbi)
cube_path <- system.file(
"extdata", "denmark_mammals_cube_eqdgc.csv",
package = "b3gbi")
denmark_cube <- process_cube(
cube_path,
first_year = 2014,
last_year = 2020)
# Function to calculate statistic of interest
# Mean observations per year
mean_obs <- function(data) {
out_df <- aggregate(obs ~ year, data, mean) # Calculate mean obs per year
names(out_df) <- c("year", "diversity_val") # Rename columns
return(out_df)
}
mean_obs(denmark_cube$data)
#> year diversity_val
#> 1 2014 11.553740
#> 2 2015 11.532206
#> 3 2016 5.532491
#> 4 2017 5.703888
#> 5 2018 5.598413
#> 6 2019 4.802676
#> 7 2020 4.972163
# Perform leave-one-species-out CV
# \donttest{
cv_mean_obs <- cross_validate_cube(
data_cube = denmark_cube$data,
fun = mean_obs,
grouping_var = "year",
out_var = "taxonKey",
crossv_method = "loo",
progress = FALSE)
head(cv_mean_obs)
#> id_cv year taxonkey_out rep_cv est_original error sq_error
#> 1 1 2014 2440669 7.50000 11.55374 -4.053740327 1.643281e+01
#> 2 2 2014 5220081 11.58585 11.55374 0.032109544 1.031023e-03
#> 3 3 2014 2434793 11.21422 11.55374 -0.339521431 1.152748e-01
#> 4 4 2014 2434806 11.32401 11.55374 -0.229733107 5.277730e-02
#> 5 5 2014 2440483 11.56282 11.55374 0.009082393 8.248986e-05
#> 6 6 2014 2433753 11.87914 11.55374 0.325400228 1.058853e-01
#> abs_error rel_error perc_error mre mse rmse
#> 1 4.053740327 0.3508595666 35.08595666 0.0113168 0.2217298 0.470882
#> 2 0.032109544 0.0027791471 0.27791471 0.0113168 0.2217298 0.470882
#> 3 0.339521431 0.0293862784 2.93862784 0.0113168 0.2217298 0.470882
#> 4 0.229733107 0.0198838731 1.98838731 0.0113168 0.2217298 0.470882
#> 5 0.009082393 0.0007860998 0.07860998 0.0113168 0.2217298 0.470882
#> 6 0.325400228 0.0281640593 2.81640593 0.0113168 0.2217298 0.470882
# }