
Standardise model inputs (no leakage) for residents and invaders
Source:R/standardise_model_inputs.R
standardise_model_inputs.RdColumn-wise z-scores environment and resident trait numerics, then
scales invader trait numerics using resident moments only (to avoid
information leakage). Invader factor/character columns are coerced to the
resident levels; unseen levels become NA. Optionally drops invader-only
columns so the resident/invader trait schemas match.
Usage
standardise_model_inputs(
env_df = NULL,
traits_res,
traits_inv = NULL,
drop_extra_invader_cols = FALSE,
verbose = TRUE
)Arguments
- env_df
Optional
data.frame(sites × environment). Numeric columns are z-scored; non-numeric are preserved.- traits_res
data.frame(residents × traits). Mixed types allowed; numeric columns are z-scored.- traits_inv
Optional
data.frame(invaders × traits). Must contain at least the trait columns present intraits_res. Numeric columns are scaled using resident means/SDs; factors are coerced to resident levels.- drop_extra_invader_cols
Logical; if
TRUE, invader-only columns are dropped (not used downstream). IfFALSE, they are still dropped for alignment but flagged ininfo$notes.- verbose
Logical; print messages about what was standardised/coerced.
Value
A named list with components:
- env_df_z
Environment table with numeric columns z-scored (or
NULL).- traits_res_glmm
Resident trait table with numeric columns z-scored.
- traits_inv_glmm
Invader trait table scaled to resident moments and factor levels matched (or
NULL).- moments
list(env_means, env_sds, trait_means_res, trait_sds_res)used for scaling.- info
list(notes=character())with human-readable notes.
Details
What gets standardised and how
Environment (
env_df): numeric columns are z-scored (mean 0, sd 1); non-numeric columns are kept as-is. Zero variance is guarded by setting sd=1.Resident traits (
traits_res): numeric columns are z-scored; mixed types allowed—non-numeric columns are kept.Invader traits (
traits_inv): numeric columns are scaled using the resident trait means/SDs only (never computed from invaders). Factor/ character columns are coerced to resident levels; unseen levels becomeNA. Extra invader columns are dropped (with a note).
Returned objects
env_df_z: environment table with numeric columns standardised (orNULL)traits_res_glmm: resident traits with numeric columns standardisedtraits_inv_glmm: invader traits, scaled like residents + factor levels matched (orNULL)moments: resident/reference moments used for scaling (env_*,trait_*)info$notes: human-readable notes on coercions/drops
Where this is used in the workflow
Called explicitly prior to GLMM fitting and when harmonising invaders, and implicitly by wrappers such as
prepare_trait_space()(if available).
Invariants and guards
Column names and row names are preserved.
Zero-variance numeric columns use
sd=1so z-scores stay defined.Invader trait numerics are always scaled by resident moments (no leakage).
Invader extra columns are dropped for alignment; missing required columns error.
Examples
# Minimal reproducible example ----------------------------------------------
set.seed(1)
env_df = data.frame(elev = rnorm(5), temp = rnorm(5), zone = factor(sample(c("A","B"), 5, TRUE)))
rownames(env_df) = paste0("s", 1:5)
traits_res = data.frame(
size = rlnorm(4), leaf = factor(c("broad","needle","broad","needle")),
stringsAsFactors = FALSE
)
rownames(traits_res) = paste0("sp", 1:4)
traits_inv = data.frame(
size = c(10, 1), leaf = factor(c("broad","unknown")) # 'unknown' -> NA after coercion
)
rownames(traits_inv) = c("inv1","inv2")
std = standardise_model_inputs(env_df, traits_res, traits_inv, verbose = FALSE)
str(std, 1)
#> List of 5
#> $ env_df_z :'data.frame': 5 obs. of 3 variables:
#> $ traits_res_glmm:'data.frame': 4 obs. of 2 variables:
#> $ traits_inv_glmm:'data.frame': 2 obs. of 2 variables:
#> $ moments :List of 4
#> $ info :List of 1
head(std$traits_res_glmm)
#> size leaf
#> sp1 -0.3871555 broad
#> sp2 -0.4124477 needle
#> sp3 -0.6865489 broad
#> sp4 1.4861521 needle
head(std$traits_inv_glmm) # note: 'leaf' for 'unknown' becomes NA
#> size leaf
#> inv1 31.8128564 broad
#> inv2 0.4876308 <NA>
# Workflow sketch ------------------------------------------------------------
# fit = prepare_inputs(long_df = longDF, ...) # gives fit$inputs$env_df, $traits_res
# inv = simulate_invaders(fit$inputs$traits_res, n_inv = 10)
# std = standardise_model_inputs(fit$inputs$env_df, fit$inputs$traits_res, inv)
# std$traits_res_glmm; std$traits_inv_glmm # pass to GLMM / trait-space steps