Calculate the Network Kernel Density Estimate based on a network of lines, sampling points, and events

nkde(
  lines,
  events,
  w,
  samples,
  kernel_name,
  bw,
  adaptive = FALSE,
  trim_bw = NULL,
  method,
  div = "bw",
  diggle_correction = FALSE,
  study_area = NULL,
  max_depth = 15,
  digits = 5,
  tol = 0.1,
  agg = NULL,
  sparse = TRUE,
  grid_shape = c(1, 1),
  verbose = TRUE,
  check = TRUE
)

Arguments

lines

A feature collection of linestrings representing the underlying network. The geometries must be simple Linestrings (may crash if some geometries are invalid) without MultiLineSring.

events

events A feature collection of points representing the events on the network. The points will be snapped on the network to their closest line.

w

A vector representing the weight of each event

samples

A feature collection of points representing the locations for which the densities will be estimated.

kernel_name

The name of the kernel to use. Must be one of triangle, gaussian, tricube, cosine, triweight, quartic, epanechnikov or uniform.

bw

The kernel bandwidth (using the scale of the lines), can be a single float or a numeric vector if a different bandwidth must be used for each event.

adaptive

A Boolean, indicating if an adaptive bandwidth must be used

trim_bw

A float, indicating the maximum value for the adaptive bandwidth

method

The method to use when calculating the NKDE, must be one of simple / discontinuous / continuous (see nkde details for more information)

div

The divisor to use for the kernel. Must be "n" (the number of events within the radius around each sampling point), "bw" (the bandwidth) "none" (the simple sum).

diggle_correction

A Boolean indicating if the correction factor for edge effect must be used.

study_area

A feature collection of polygons representing the limits of the study area.

max_depth

when using the continuous and discontinuous methods, the calculation time and memory use can go wild if the network has many small edges (area with many of intersections and many events). To avoid it, it is possible to set here a maximum depth. Considering that the kernel is divided at intersections, a value of 10 should yield good estimates in most cases. A larger value can be used without a problem for the discontinuous method. For the continuous method, a larger value will strongly impact calculation speed.

digits

The number of digits to retain from the spatial coordinates. It ensures that topology is good when building the network. Default is 3. Too high a precision (high number of digits) might break some connections

tol

A float indicating the minimum distance between the events and the lines' extremities when adding the point to the network. When points are closer, they are added at the extremity of the lines.

agg

A double indicating if the events must be aggregated within a distance. If NULL, the events are aggregated only by rounding the coordinates.

sparse

A Boolean indicating if sparse or regular matrices should be used by the Rcpp functions. These matrices are used to store edge indices between two nodes in a graph. Regular matrices are faster, but require more memory, in particular with multiprocessing. Sparse matrices are slower (a bit), but require much less memory.

grid_shape

A vector of two values indicating how the study area must be split when performing the calculus. Default is c(1,1) (no split). A finer grid could reduce memory usage and increase speed when a large dataset is used. When using multiprocessing, the work in each grid is dispatched between the workers.

verbose

A Boolean, indicating if the function should print messages about the process.

check

A Boolean indicating if the geometry checks must be run before the operation. This might take some times, but it will ensure that the CRS of the provided objects are valid and identical, and that geometries are valid.

Value

A vector of values, they are the density estimates at sampling points

Details

The three NKDE methods
Estimating the density of a point process is commonly done by using an ordinary two-dimensional kernel density function. However, there are numerous cases for which the events do not occur in a two-dimensional space but on a network (like car crashes, outdoor crimes, leaks in pipelines, etc.). New methods were developed to adapt the methodology to networks, three of them are available in this package.

  • The simple method: This first method was presented by (Xie and Yan 2008) and proposes an intuitive solution. The distances between events and sampling points are replaced by network distances, and the formula of the kernel is adapted to calculate the density over a linear unit instead of an areal unit.

  • The discontinuous method: The previous method has been criticized by (Okabe et al. 2009) , arguing that the estimator proposed is biased, leading to an overestimation of density in events hot-spots. More specifically, the simple method does not conserve mass and the induced kernel is not a probability density along the network. They thus proposed a discontinuous version of the kernel function on network, which equally "divides" the mass density of an event at intersections.

  • The continuous method: If the discontinuous method is unbiased, it leads to a discontinuous kernel function which is a bit counter-intuitive. Okabe et al. (2009) proposed another version of the kernel, which divides the mass of the density at intersections but adjusts the density before the intersection to make the function continuous.

The three methods are available because, even though that the simple method is less precise statistically speaking, it might be more intuitive. From a purely geographical view, it might be seen as a sort of distance decay function as used in Geographically Weighted Regression.


adaptive bandwidth
It is possible to use adaptive bandwidth instead of fixed bandwidth. Adaptive bandwidths are calculated using the Abramson’s smoothing regimen (Abramson 1982) . To do so, an original fixed bandwidth must be specified (bw parameter), and is used to estimate the priory densitiy at event locations. These densities are then used to calculate local bandwidth. The maximum size of the local bandwidth can be limited with the parameter trim_bw. For more details, see the vignettes.

Optimization parameters
The grid_shape parameter allows to split the calculus of the NKDE according to a grid dividing the study area. It might be necessary for big dataset to reduce the memory used. If the grid_shape is c(1,1), then a full network is built for the area. If the grid_shape is c(2,2), then the area is split in 4 rectangles. For each rectangle, the sample points falling in the rectangle are used, the events and the lines in a radius of the bandwidth length are used. The results are combined at the end and ordered to match the original order of the samples.

The geographical coordinates of the start and end of lines are used to build the network. To avoid troubles with digits, we truncate the coordinates according to the digit parameter. A minimal loss of precision is expected but results in a fast construction of the network.

To calculate the distances on the network, all the events are added as vertices. To reduce the size of the network, it is possible to reduce the number of vertices by adding the events at the extremity of the lines if they are close to them. This is controlled by the parameter tol.

In the same way, it is possible to limit the number of vertices by aggregating the events that are close to each other. In that case, the weights of the aggregated events are summed. According to an aggregation distance, a buffer is drawn around the fist event, all events falling in that buffer are aggregated to the first event, forming a new event. The coordinates of this new event are the means of the original events coordinates. This procedure is repeated until no events are aggregated. The aggregation distance can be fixed with the parameter agg.

When using the continuous and discontinuous kernel, the density is reduced at each intersection crossed. In the discontinuous case, after 5 intersections with four directions each, the density value is divided by 243 leading to very small values. In the same situation but with the continuous NKDE, the density value is divided by approximately 7.6. The max_depth parameters allows the user to control the maximum depth of these two NKDE. The base value is 15, but a value of 10 would yield very close estimates. A lower value might have a critical impact on speed when the bandwidth is large.

When using the continuous and discontinuous kernel, the connections between graph nodes are stored in a matrix. This matrix is typically sparse, and so a sparse matrix object is used to limit memory use. If the network is small (typically when the grid used to split the data has small rectangles) then a classical matrix could be used instead of a sparse one. It significantly increases speed, but could lead to memory issues.

References

Abramson IS (1982). “On bandwidth variation in kernel estimates-a square root law.” The annals of Statistics, 1217--1223.

Okabe A, Satoh T, Sugihara K (2009). “A kernel density estimation method for networks, its computational method and a GIS-based tool.” International Journal of Geographical Information Science, 23(1), 7--32.

Xie Z, Yan J (2008). “Kernel density estimation of traffic accidents in a network space.” Computers, environment and urban systems, 32(5), 396--406.

Examples

# \donttest{
data(mtl_network)
data(bike_accidents)
lixels <- lixelize_lines(mtl_network,200,mindist = 50)
samples <- lines_center(lixels)
densities <- nkde(mtl_network,
                  events = bike_accidents,
                  w = rep(1,nrow(bike_accidents)),
                  samples = samples,
                  kernel_name = "quartic",
                  bw = 300, div= "bw",
                  adaptive = FALSE,
                  method = "discontinuous", digits = 1, tol = 1,
                  agg = 15,
                  grid_shape = c(1,1),
                  verbose=FALSE)
# }