Validation of Hierarchical Climate Regionalization
validClimR.Rd
validClimR
computes indices for cluster validation, and an
objective tree cut for regional
linkage clustering method.
Usage
validClimR(y = NULL, k = NULL, minSize = 1, alpha = 0.05, verbose = TRUE,
plot = FALSE, colPalette = NULL, pch = 15, cex = 1)
Arguments
- y
a dendrogram tree produced by
HiClimR
.- k
NULL
or a n integerk > 1
for the number of regions/clusters. Only forregional
linkage method,k = NULL
is supported, where the "optimal" number of regions will be used at a user specified significance levelalpha
. It is required to specify number of clustersk
for the other methods, since they are not based on inter-cluster correlation. Ifk = NULL
for these methods (exceptregional
) linkage, thevalidClimR
with be aborted. One can usevalidClimR
function to compute inter-cluster correlation at different number of clusters to objectively cut the tree for the other methods, which could be computationally expensive to cover the entire merging history for large number of spatial elements.- minSize
minimum cluster size. The
regional
linkage method tend to isolate noisy data in small clusters. TheminSize
can be used to exclude these very small clusters from thestatSum
statistical summary, because they are most likely noisy data that need to be checked in a quality control step. The analysis may be then repeated.- alpha
confidence level: the default is
alpha = 0.05
for 95% confidence level.- verbose
logical to print processing information if
verbose = TRUE
.- plot
logical to call the plotting method if
plot = TRUE
.- colPalette
a color palette or a list of colors such as that generated by
rainbow
,heat.colors
,topo.colors
,terrain.colors
or similar functions.- pch
Either an integer specifying a symbol or a single character to be used as the default in plotting points. See
points
for possible values.- cex
A numerical value giving the amount by which plotting symbols should be magnified relative to the
default = 1
.
Value
An object of class HiClimR
which produces indices for validating
the tree produced by the clustering process.
The object is a list with the following components:
- cutLevel
the minimum significant correlation used for objective tree cut together with the corresponding confidence level.
- clustMean
the cluster means which are the region's mean timeseries for all selected regions.
- clustSize
cluster sizes for all selected regions.
- clustFlag
a flag
0 or 1
to indicate the cluster used instatSum
validation indices (interCor
,intraCor
,diffCor
, andstatSum
), based onminSize
minimum cluster size. IfclustFlag = 0
, the cluster has been excluded because its size is less than theminSize
minimum cluster size. The sum ofclustFlag
elements represents the selected number clusters.- interCor
inter-cluster correlations for all selected regions. It is the inter-cluster correlations between cluster means. The maximum inter-cluster correlation is a measure for separation or contiguity, and it is used for objective tree cut (to find the "optimal" number of clusters).
- intraCor
intra-cluster correlations for all selected regions. It is the intra-cluster correlations between the mean of each cluster and its members. The average intra-cluster correlation is a weighted average for all clusters, and it is a measure for homogeneity.
- diffCor
difference between intra-cluster correlation and maximum inter-cluster correlation for all selected regions.
- statSum
overall statistical summary for i
nterCluster
,intraCor
, anddiffCor
.- region
ordered regions vector of size
N
number of spatial elements for the selected number of clusters, after excluding the small clusters defined byminSize
argument.- regionID
ordered regions ID vector of length equals the selected number of clusters, after excluding the small clusters defined by
minSize
argument. It helps in mapping ordered regions and their actual names before ordering. Only theregion
component uses ordered ID, while other components use the names used during the clustering process.
Details
The validClimR
function is used for validation of a dendrogram tree
produced by HiClimR
, by computing detailed statistical information for
each cluster about cluster means, sizes, intra- and inter-cluster correlations,
and overall summary. It requires the preprocessed data matrix and the tree from
HiClimR
function as inputs. An optional parameter can be used to
validate clustering for a selected number of clusters k
. If k = NULL
,
the default which supports only the regional
linkage method, objective cutting
of the tree to find the optimal number of clusters will be applied based on a user
specified significance level (alpha
parameter). In regional
linkage method,
noisy spatial elements are isolated in very small-size clusters or individuals since
they do not correlate well with any other elements. They can be excluded from the
validation indices (interCor
, intraCor
, diffCor
, and statSum
),
based on minSize
minimum cluster size. The excluded clusters are identified in
the output of validClimR
in clustFlag
, which takes a value of 1
for selected clusters or 0
for excluded clusters. The sum of clustFlag
elements represents the selected number clusters.This should be followed by a quality
control step before repeating the analysis.
References
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2015): A Tool for Hierarchical Climate Regionalization, Earth Science Informatics, 8(4), 949-958, doi:10.1007/s12145-015-0221-7 .
Hamada S. Badr, Zaitchik, B. F. and Dezfuli, A. K. (2014): Hierarchical Climate Regionalization, Comprehensive R Archive Network (CRAN), https://cran.r-project.org/package=HiClimR.
Author
Hamada S. Badr <badr@jhu.edu>, Benjamin F. Zaitchik <zaitchik@jhu.edu>,
and Amin K. Dezfuli <amin.dezfuli@nasa.gov>. HiClimR
is
a modification of hclust
function, which is based on
Fortran code contributed to STATLIB by F. Murtagh.
Examples
require(HiClimR)
## Load test case data
x <- TestCase$x
## Generate longitude and latitude mesh vectors
xGrid <- grid2D(lon = unique(TestCase$lon), lat = unique(TestCase$lat))
lon <- c(xGrid$lon)
lat <- c(xGrid$lat)
## Hierarchical Climate Regionalization
y <- HiClimR(x, lon = lon, lat = lat, lonStep = 1, latStep = 1, geogMask = FALSE,
continent = "Africa", meanThresh = 10, varThresh = 0, detrend = TRUE,
standardize = TRUE, nPC = NULL, method = "ward", hybrid = FALSE,
kH = NULL, members = NULL, validClimR = TRUE, k = 12, minSize = 1,
alpha = 0.01, plot = TRUE, colPalette = NULL, hang = -1, labels = FALSE)
#>
#> PROCESSING STARTED
#>
#> Checking Multivariate Clustering (MVC)...
#> ---> x is a matrix
#> ---> single-variate clustering: 1 variable
#> Checking data...
#> ---> Checking dimensions...
#> ---> Checking row names...
#> ---> Checking column names...
#> Data filtering...
#> ---> Computing mean for each row...
#> ---> Checking rows with mean bellow meanThresh...
#> ---> 4678 rows found, mean ≤ 10
#> ---> Computing variance for each row...
#> ---> Checking rows with near-zero-variance...
#> ---> 3951 rows found, variance ≤ 0
#> Data preprocessing...
#> ---> Applying mask...
#> ---> Checking columns with missing values...
#> ---> Removing linear trend...
#> ---> Standardizing data...
#> Agglomerative Hierarchical Clustering...
#> ---> Computing correlation/dissimilarity matrix...
#> ---> Starting clustering process...
#> ---> Constructing dendrogram tree...
#> Calling cluster validation...
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
#>
#> PROCESSING COMPLETED
#>
#> Running Time:
#> user system elapsed
#> 0.229 0.032 0.261
#> Time difference of 0.2617424 secs
## Validtion of Hierarchical Climate Regionalization
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## Use a specified number of clusters (k = 12)
z <- validClimR(y, k = 12, minSize = 1, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## Apply minimum cluster size (minSize = 25)
z <- validClimR(y, k = 12, minSize = 25, alpha = 0.01, plot = TRUE)
#> ---> Computing cluster means...
#> ---> Computing inter-cluster correlations...
#> ---> Computing intra-cluster correlations...
#> ---> Computing summary statistics...
#> Generating region map...
## The optimal number of clusters, including small clusters
k <- length(z$clustFlag)
## The selected number of clusters, after excluding small clusters (if minSize > 1)
ks <- sum(z$clustFlag)