Simulating the occurrence process
The workflow for simulating a biodiversity data cube used in gcube can be divided in three steps or processes:
- Occurrence process
- Detection process
- Grid designation process
This tutorial documents the first part of the gcube simulation workflow, viz. the occurrence process.
Input
The functions are set up such that a single polygon as input is enough to go through this workflow using default arguments. The user can change these arguments to allow for more flexibility. In this tutorial we will demonstrate the different options.
As input, we create a polygon in which we want to simulate occurrences. It represents the spatial extend of the species.
The polygon looks like this.
Simulate occurrences
We generate occurrence points within the polygon using the simulate_occurrences()
function.
Default arguments ensure that an sf object with POLYGON geometry is sufficient to simulate occurrences.
The options for user defined arguments are demonstrated in the next subsections.
Changing number of occurrences over time
Say we want to have 100 occurrences in our plot over 10 years.
You can change the trend in the average number of occurrences over time.
We visualise this with the supporting functions used in simulate_occurrences()
.
The number of occurrences are always drawn from a Poisson distribution.
Option 1
If we do not specify a temporal function, we draw from a Poisson distribution for each time point with average (lambda parameter) initial_average_occurrences
.
We plot the simulated number of occurrences over time. We see that the average is close to 100 over time as expected. Using a different seed will result in different numbers but the average will be (close to) 100 over time.
Option 2
We can specify a function ourselves, e.g. the internal function simulate_random_walk()
to have a random walk over time.
A random walk is a mathematical concept where each step is determined randomly.
The sd_step
parameter refers to the standard deviation of these random steps (drawn from a Normal distribution).
A higher value leading to larger steps and potentially greater variability in the path of the random walk.
We plot the simulated number of occurrences over time which follow a random walk. Using a different seed will result in a different random pattern.
Option 3
We can specify a function ourselves that determines the average trend in number of occurrences over time. Here we provide an example for a linear trend.
We try out a linear trend with slope equal to 1.
We plot the simulated number of occurrences over time. We see that the average slope is indeed close to 1. Using a different seed will result in different numbers but the average slope will be (close to) 1.
Changing the degree of spatial clustering
We can also choose the amount of spatial clustering.
We visualise this with the supporting functions used in simulate_occurrences()
.
Option 1
There are defaults for random and clustered patterns. Let’s look at the default where we have no clustering.
We see values of high sampling probability randomly distributed.
Option 2
Let’s look at the default where we have clustering (same as spatial_pattern = 10
, see further).
We see values of high sampling probability clustered together.
Option 3
We can also change the clustering ourselves.
A larger number for spatial_pattern
means a broader size of the clusters area.
Let’s look at a low value for clustering.
We see values of high sampling probability in multiple, smaller clusters.
Let’s look at a high value for clustering.
We see values of high sampling probability in fewer, larger clusters.
The patterns generated above are then used for sampling using a different supporting function.
If we for example sample 500 occurrences from the last raster, we see the sampling is according to the expected pattern.
Example
Now that we know how the supporting functions work, we can generate occurrence points within the polygon using the simulate_occurrences()
function.
We can for example sample randomly within the polygon over 6 time points were we use a random walk over time with an initial average number of occurrences equal to 100.
This is the number of occurrences we have for each time point.
This is the spatial distribution of the occurrences for each time point.