Bandwidth selection by likelihood cross validation for temporal NKDE — bw_tnkde_cv_likelihood

Calculate for multiple network and time bandwidths the cross validation likelihood to select an appropriate bandwidth in a data-driven approach

bw_tnkde_cv_likelihood_calc(
  bws_net = NULL,
  bws_time = NULL,
  lines,
  events,
  time_field,
  w,
  kernel_name,
  method,
  arr_bws_net = NULL,
  arr_bws_time = NULL,
  diggle_correction = FALSE,
  study_area = NULL,
  adaptive = FALSE,
  trim_net_bws = NULL,
  trim_time_bws = NULL,
  max_depth = 15,
  digits = 5,
  tol = 0.1,
  agg = NULL,
  sparse = TRUE,
  zero_strat = "min_double",
  grid_shape = c(1, 1),
  sub_sample = 1,
  verbose = TRUE,
  check = TRUE
)

Arguments

bws_net: An ordered numeric vector with all the network bandwidths
bws_time: An ordered numeric vector with all the time bandwidths
lines: A feature collection of linestrings representing the underlying network. The geometries must be simple Linestrings (may crash if some geometries are invalid) without MultiLineSring.
events: events A feature collection of points representing the events on the network. The points will be snapped on the network to their closest line.
time_field: The name of the field in events indicating when the events occurred. It must be a numeric field
w: A vector representing the weight of each event
kernel_name: The name of the kernel to use. Must be one of triangle, gaussian, tricube, cosine, triweight, quartic, epanechnikov or uniform.
method: The method to use when calculating the NKDE, must be one of simple / discontinuous / continuous (see nkde details for more information)
arr_bws_net: An array with all the local netowrk bandwidths precalculated (for each event, and at each possible combinaison of network and temporal bandwidths). The dimensions must be c(length(net_bws), length(time_bws), nrow(events)))
arr_bws_time: An array with all the local time bandwidths precalculated (for each event, and at each possible combinaison of network and temporal bandwidths). The dimensions must be c(length(net_bws), length(time_bws), nrow(events)))
diggle_correction: A Boolean indicating if the correction factor for edge effect must be used.
study_area: A feature collection of polygons representing the limits of the study area.
adaptive: A boolean indicating if local bandwidths must be calculated
trim_net_bws: A numeric vector with the maximum local network bandwidth. If local bandwidths have higher values, they will be replaced by the corresponding value in this vector.
trim_time_bws: A numeric vector with the maximum local time bandwidth. If local bandwidths have higher values, they will be replaced by the corresponding value in this vector.
max_depth: when using the continuous and discontinuous methods, the calculation time and memory use can go wild if the network has many small edges (area with many of intersections and many events). To avoid it, it is possible to set here a maximum depth. Considering that the kernel is divided at intersections, a value of 10 should yield good estimates in most cases. A larger value can be used without a problem for the discontinuous method. For the continuous method, a larger value will strongly impact calculation speed.
digits: The number of digits to retain from the spatial coordinates. It ensures that topology is good when building the network. Default is 3. Too high a precision (high number of digits) might break some connections
tol: A float indicating the minimum distance between the events and the lines' extremities when adding the point to the network. When points are closer, they are added at the extremity of the lines.
agg: A double indicating if the events must be aggregated within a distance. If NULL, the events are aggregated only by rounding the coordinates.
sparse: A Boolean indicating if sparse or regular matrices should be used by the Rcpp functions. These matrices are used to store edge indices between two nodes in a graph. Regular matrices are faster, but require more memory, in particular with multiprocessing. Sparse matrices are slower (a bit), but require much less memory.
zero_strat: A string indicating what to do when density is 0 when calculating LOO density estimate for an isolated event. "min_double" (default) replace the 0 value by the minimum double possible on the machine. "remove" will remove them from the final score. The first approach penalizes more strongly the small bandwidths.
grid_shape: A vector of two values indicating how the study area must be split when performing the calculus. Default is c(1,1) (no split). A finer grid could reduce memory usage and increase speed when a large dataset is used. When using multiprocessing, the work in each grid is dispatched between the workers.
sub_sample: A float between 0 and 1 indicating the percentage of quadra to keep in the calculus. For large datasets, it may be useful to limit the bandwidth evaluation and thus reduce calculation time.
verbose: A Boolean, indicating if the function should print messages about the process.
check: A Boolean indicating if the geometry checks must be run before the operation. This might take some times, but it will ensure that the CRS of the provided objects are valid and identical, and that geometries are valid.

Value

A matrix with the cross validation score. Each row corresponds to a network bandwidth and each column to a time bandwidth (the higher the better).

Details

The function calculates the likelihood cross validation score for several time and network bandwidths in order to find the most appropriate one. The general idea is to find the pair of bandwidths that would produce the most similar results if one event is removed from the dataset (leave one out cross validation). We use here the shortcut formula as described by the package spatstat (Baddeley et al. 2021) .

\(LCV(h) = \sum_i \log\hat\lambda_{-i}(x_i)\)

Where the sum is taken for all events \(x_i\) and where \(\hat\lambda_{-i}(x_i)\) is the leave-one-out kernel estimate at \(x_i\) for a bandwidth h. A higher value indicates a better bandwidth.

References

Baddeley A, Turner R, Rubak E (2021). spatstat: Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests. R package version 2.1-0, https://CRAN.R-project.org/package=spatstat.

Examples

# \donttest{
# loading the data
data(mtl_network)
data(bike_accidents)

# converting the Date field to a numeric field (counting days)
bike_accidents$Time <- as.POSIXct(bike_accidents$Date, format = "%Y/%m/%d")
bike_accidents$Time <- difftime(bike_accidents$Time, min(bike_accidents$Time), units = "days")
bike_accidents$Time <- as.numeric(bike_accidents$Time)
bike_accidents <- subset(bike_accidents, bike_accidents$Time>=89)

# calculating the cross validation values
cv_scores <- bw_tnkde_cv_likelihood_calc(
  bws_net = seq(100,800,100),
  bws_time = seq(10,60,5),
  lines = mtl_network,
  events = bike_accidents,
  time_field = "Time",
  w = rep(1, nrow(bike_accidents)),
  kernel_name = "quartic",
  method = "discontinuous",
  diggle_correction = FALSE,
  study_area = NULL,
  max_depth = 10,
  digits = 2,
  tol = 0.1,
  agg = 15,
  sparse=TRUE,
  grid_shape=c(1,1),
  sub_sample=1,
  verbose = FALSE,
  check = TRUE)
# }