Validation of Hierarchical Climate Regionalization
validClimR.RdvalidClimR computes indices for cluster validation, and an
objective tree cut for regional linkage clustering method.
Usage
validClimR(y = NULL, k = NULL, minSize = 1, alpha = 0.05, verbose = TRUE,
plot = FALSE, colPalette = NULL, pch = 15, cex = 1)Arguments
- y
a dendrogram tree produced by
HiClimR.- k
NULLor a n integerk > 1for the number of regions/clusters. Only forregionallinkage method,k = NULLis supported, where the "optimal" number of regions will be used at a user specified significance levelalpha. It is required to specify number of clusterskfor the other methods, since they are not based on inter-cluster correlation. Ifk = NULLfor these methods (exceptregional) linkage, thevalidClimRwith be aborted. One can usevalidClimRfunction to compute inter-cluster correlation at different number of clusters to objectively cut the tree for the other methods, which could be computationally expensive to cover the entire merging history for large number of spatial elements.- minSize
minimum cluster size. The
regionallinkage method tend to isolate noisy data in small clusters. TheminSizecan be used to exclude these very small clusters from thestatSumstatistical summary, because they are most likely noisy data that need to be checked in a quality control step. The analysis may be then repeated.- alpha
confidence level: the default is
alpha = 0.05for 95% confidence level.- verbose
logical to print processing information if
verbose = TRUE.- plot
logical to call the plotting method if
plot = TRUE.- colPalette
a color palette or a list of colors such as that generated by
rainbow,heat.colors,topo.colors,terrain.colorsor similar functions.- pch
Either an integer specifying a symbol or a single character to be used as the default in plotting points. See
pointsfor possible values.- cex
A numerical value giving the amount by which plotting symbols should be magnified relative to the
default = 1.
Value
An object of class HiClimR which produces indices for validating
the tree produced by the clustering process.
The object is a list with the following components:
- cutLevel
the minimum significant correlation used for objective tree cut together with the corresponding confidence level.
- clustMean
the cluster means which are the region's mean timeseries for all selected regions.
- clustSize
cluster sizes for all selected regions.
- clustFlag
a flag
0 or 1to indicate the cluster used instatSumvalidation indices (interCor,intraCor,diffCor, andstatSum), based onminSizeminimum cluster size. IfclustFlag = 0, the cluster has been excluded because its size is less than theminSizeminimum cluster size. The sum ofclustFlagelements represents the selected number clusters.- interCor
inter-cluster correlations for all selected regions. It is the inter-cluster correlations between cluster means. The maximum inter-cluster correlation is a measure for separation or contiguity, and it is used for objective tree cut (to find the "optimal" number of clusters).
- intraCor
intra-cluster correlations for all selected regions. It is the intra-cluster correlations between the mean of each cluster and its members. The average intra-cluster correlation is a weighted average for all clusters, and it is a measure for homogeneity.
- diffCor
difference between intra-cluster correlation and maximum inter-cluster correlation for all selected regions.
- statSum
overall statistical summary for i
nterCluster,intraCor, anddiffCor.- region
ordered regions vector of size
Nnumber of spatial elements for the selected number of clusters, after excluding the small clusters defined byminSizeargument.- regionID
ordered regions ID vector of length equals the selected number of clusters, after excluding the small clusters defined by
minSizeargument. It helps in mapping ordered regions and their actual names before ordering. Only theregioncomponent uses ordered ID, while other components use the names used during the clustering process.
Details
The validClimR function is used for validation of a dendrogram tree
produced by HiClimR, by computing detailed statistical information for
each cluster about cluster means, sizes, intra- and inter-cluster correlations,
and overall summary. It requires the preprocessed data matrix and the tree from
HiClimR function as inputs. An optional parameter can be used to
validate clustering for a selected number of clusters k. If k = NULL,
the default which supports only the regional linkage method, objective cutting
of the tree to find the optimal number of clusters will be applied based on a user
specified significance level (alpha parameter). In regional linkage method,
noisy spatial elements are isolated in very small-size clusters or individuals since
they do not correlate well with any other elements. They can be excluded from the
validation indices (interCor, intraCor, diffCor, and statSum),
based on minSize minimum cluster size. The excluded clusters are identified in
the output of validClimR in clustFlag, which takes a value of 1
for selected clusters or 0 for excluded clusters. The sum of clustFlag
elements represents the selected number clusters.This should be followed by a quality
control step before repeating the analysis.
References
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015): A Tool for Hierarchical Climate Regionalization, Earth Science Informatics, 8(4), 949-958, doi:10.1007/s12145-015-0221-7 .
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014): Hierarchical Climate Regionalization, Comprehensive R Archive Network (CRAN), https://cran.r-project.org/package=HiClimR.
Author
Hamada S. Badr <badr@jhu.edu>, Benjamin F. Zaitchik <zaitchik@jhu.edu>,
and Amin K. Dezfuli <amin.dezfuli@nasa.gov>. HiClimR is
a modification of hclust function, which is based on
Fortran code contributed to STATLIB by F. Murtagh.
Examples
require(HiClimR)
## Load test case data
x <- TestCase$x
## Generate longitude and latitude mesh vectors
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)
## Hierarchical Climate Regionalization
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE,
kH = NULL, members = NULL, validClimR = TRUE, k = 12, minSize = 1,
alpha = 0.01, plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
#>
#> PROCESSING STARTED
#>
#> Checking Multivariate Clustering (MVC)...
#> ---> x is a matrix
#> ---> single-variate clustering: 1 variable
#> Checking data...
#> ---> Checking dimensions...
#> ---> Checking row names...
#> ---> Checking column names...
#> Data filtering...
#> ---> Computing mean for each row...
#> ---> Checking rows with mean bellow meanThresh...
#> ---> 4678 rows found, mean ≤ 10
#> ---> Computing variance for each row...
#> ---> Checking rows with near-zero-variance...
#> ---> 3951 rows found, variance ≤ 0
#> Data preprocessing...
#> ---> Applying mask...
#> ---> Checking columns with missing values...
#> ---> Removing linear trend...
#> ---> Standardizing data...
#> Agglomerative Hierarchical Clustering...
#> ---> Computing correlation/dissimilarity matrix...
#> ---> Starting clustering process...
#> ---> Constructing dendrogram tree...
#> Calling cluster validation...
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
#>
#> PROCESSING COMPLETED
#>
#> Running Time:
#> user system elapsed
#> 0.242 0.018 0.260
#> Time difference of 0.2594428 secs
## Validtion of Hierarchical Climate Regionalization
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## Use a specified number of clusters (k = 12)
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## Apply minimum cluster size (minSize = 25)
z <- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## The optimal number of clusters, including small clusters
k <- length(z$clustFlag)
## The selected number of clusters, after excluding small clusters (if minSize > 1)
ks <- sum(z$clustFlag)