Validation of Hierarchical Climate Regionalization

validClimR computes indices for cluster validation, and an objective tree cut for regional linkage clustering method.

Usage

validClimR(y = NULL, k = NULL, minSize = 1, alpha = 0.05, verbose = TRUE,
    plot = FALSE, colPalette = NULL, pch = 15, cex = 1)

Arguments

y: a dendrogram tree produced by HiClimR.
k: NULL or a n integer k > 1 for the number of regions/clusters. Only for regional linkage method, k = NULL is supported, where the "optimal" number of regions will be used at a user specified significance level alpha. It is required to specify number of clusters k for the other methods, since they are not based on inter-cluster correlation. If k = NULL for these methods (except regional) linkage, the validClimR with be aborted. One can use validClimR function to compute inter-cluster correlation at different number of clusters to objectively cut the tree for the other methods, which could be computationally expensive to cover the entire merging history for large number of spatial elements.
minSize: minimum cluster size. The regional linkage method tend to isolate noisy data in small clusters. The minSize can be used to exclude these very small clusters from the statSum statistical summary, because they are most likely noisy data that need to be checked in a quality control step. The analysis may be then repeated.
alpha: confidence level: the default is alpha = 0.05 for 95% confidence level.
verbose: logical to print processing information if verbose = TRUE.
plot: logical to call the plotting method if plot = TRUE.
colPalette: a color palette or a list of colors such as that generated by rainbow, heat.colors, topo.colors, terrain.colors or similar functions.
pch: Either an integer specifying a symbol or a single character to be used as the default in plotting points. See points for possible values.
cex: A numerical value giving the amount by which plotting symbols should be magnified relative to the default = 1.

Value

An object of class HiClimR which produces indices for validating the tree produced by the clustering process. The object is a list with the following components:

cutLevel: the minimum significant correlation used for objective tree cut together with the corresponding confidence level.
clustMean: the cluster means which are the region's mean timeseries for all selected regions.
clustSize: cluster sizes for all selected regions.
clustFlag: a flag 0 or 1 to indicate the cluster used in statSum validation indices (interCor, intraCor, diffCor, and statSum), based on minSize minimum cluster size. If clustFlag = 0, the cluster has been excluded because its size is less than the minSize minimum cluster size. The sum of clustFlag elements represents the selected number clusters.
interCor: inter-cluster correlations for all selected regions. It is the inter-cluster correlations between cluster means. The maximum inter-cluster correlation is a measure for separation or contiguity, and it is used for objective tree cut (to find the "optimal" number of clusters).
intraCor: intra-cluster correlations for all selected regions. It is the intra-cluster correlations between the mean of each cluster and its members. The average intra-cluster correlation is a weighted average for all clusters, and it is a measure for homogeneity.
diffCor: difference between intra-cluster correlation and maximum inter-cluster correlation for all selected regions.
statSum: overall statistical summary for interCluster, intraCor, and diffCor.
region: ordered regions vector of size N number of spatial elements for the selected number of clusters, after excluding the small clusters defined by minSize argument.
regionID: ordered regions ID vector of length equals the selected number of clusters, after excluding the small clusters defined by minSize argument. It helps in mapping ordered regions and their actual names before ordering. Only the region component uses ordered ID, while other components use the names used during the clustering process.

Details

The validClimR function is used for validation of a dendrogram tree produced by HiClimR, by computing detailed statistical information for each cluster about cluster means, sizes, intra- and inter-cluster correlations, and overall summary. It requires the preprocessed data matrix and the tree from HiClimR function as inputs. An optional parameter can be used to validate clustering for a selected number of clusters k. If k = NULL, the default which supports only the regional linkage method, objective cutting of the tree to find the optimal number of clusters will be applied based on a user specified significance level (alpha parameter). In regional linkage method, noisy spatial elements are isolated in very small-size clusters or individuals since they do not correlate well with any other elements. They can be excluded from the validation indices (interCor, intraCor, diffCor, and statSum), based on minSize minimum cluster size. The excluded clusters are identified in the output of validClimR in clustFlag, which takes a value of 1 for selected clusters or 0 for excluded clusters. The sum of clustFlag elements represents the selected number clusters.This should be followed by a quality control step before repeating the analysis.

References

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015): A Tool for Hierarchical Climate Regionalization, Earth Science Informatics, 8(4), 949-958, doi:10.1007/s12145-015-0221-7 .

Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014): Hierarchical Climate Regionalization, Comprehensive R Archive Network (CRAN), https://cran.r-project.org/package=HiClimR.

Author

Hamada S. Badr <badr@jhu.edu>, Benjamin F. Zaitchik <zaitchik@jhu.edu>, and Amin K. Dezfuli <amin.dezfuli@nasa.gov>. HiClimR is a modification of hclust function, which is based on Fortran code contributed to STATLIB by F. Murtagh.

Examples

require(HiClimR)

## Load test case data
x <- TestCase$x

## Generate longitude and latitude mesh vectors
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)

## Hierarchical Climate Regionalization
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
    continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
    standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE,
    kH = NULL, members = NULL, validClimR = TRUE, k = 12, minSize = 1,
    alpha = 0.01, plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
#> 
#> PROCESSING STARTED
#> 
#> Checking Multivariate Clustering (MVC)...
#> ---> x is a matrix
#> ---> single-variate clustering: 1 variable
#> Checking data...
#> ---> Checking dimensions...
#> ---> Checking row names...
#> ---> Checking column names...
#> Data filtering...
#> ---> Computing mean for each row...
#> ---> Checking rows with mean bellow meanThresh...
#> ---> 4678 rows found, mean ≤  10
#> ---> Computing variance for each row...
#> ---> Checking rows with near-zero-variance...
#> ---> 3951 rows found, variance ≤  0
#> Data preprocessing...
#> ---> Applying mask...
#> ---> Checking columns with missing values...
#> ---> Removing linear trend...
#> ---> Standardizing data...
#> Agglomerative Hierarchical Clustering...
#> ---> Computing correlation/dissimilarity matrix...
#> ---> Starting clustering process...
#> ---> Constructing dendrogram tree...
#> Calling cluster validation...
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
#> 
#> PROCESSING COMPLETED
#> 
#> Running Time:
#>    user  system elapsed 
#>   0.245   0.014   0.259 
#> Time difference of 0.2595906 secs

## Validtion of Hierarchical Climate Regionalization
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...

## Use a specified number of clusters (k = 12)
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...

## Apply minimum cluster size (minSize = 25)
z <- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...

## The optimal number of clusters, including small clusters
k <- length(z$clustFlag)

## The selected number of clusters, after excluding small clusters (if minSize > 1)
ks <- sum(z$clustFlag)